Quantifying dependencies for sensitivity analysis with multivariate input sample data
We present a novel method for quantifying dependencies in multivariate datasets, based on estimating the Rényi entropy by minimum spanning trees (MSTs). The length of the MSTs can be used to order pairs of variables from strongly to weakly dependent, making it a useful tool for sensitivity analysis with dependent input variables. It is well-suited for cases where the input distribution is unknown and only a sample of the inputs is available. We introduce an estimator to quantify dependency based on the MST length, and investigate its properties with several numerical examples. To reduce the computational cost of constructing the exact MST for large datasets, we explore methods to compute approximations to the exact MST, and find the multilevel approach introduced recently by Zhong et al. (2015) to be the most accurate. We apply our proposed method to an artificial testcase based on the Ishigami function, as well as to a real-world testcase involving sediment transport in the North Sea. The results are consistent with prior knowledge and heuristic understanding, as well as with variance-based analysis using Sobol indices in the case where these indices can be computed.
keywords:Rényi entropy, dependent data, sensitivity analysis, large datasets, minimum spanning trees
Msc: 65C60, 62H12, 62-07
Sensitivity analysis (SA) is a core topic in the field of Uncertainty Quantification (UQ) Mai10 (); 2Sul15b (), in which the uncertainties in simulation models of complex systems are explored. SA addresses the question how uncertainty in the output of a model can be allocated to different uncertain model inputs 2Cre09 (); Hel06 (); Hom96 (); 2Ioo15 (); 2Krz06 (); 2Liu06 (); 2Oak04 (); Sal02b (); 2Sal10 (); Sal08 (); Sob93 (); Sud08 (); Tru06 (); Wag95 (); Yam05 (). Being able to answer this question allows one to focus on model inputs that induce the largest uncertainties in the output, bringing benefits such as more efficient numerical exploration of model uncertainties, or guidance on how to reduce model output uncertainty most effectively.
A variety of methods for SA exists, both local and global methods, see e.g. the chapter on SA in 3Gha17 (), as well as the references mentioned above, for an overview. Many of these methods are intended for cases where the inputs are mutually independent, however there are situations where the assumption of independent inputs is unrealistic.
Input dependencies give several complications when using well-established methods. In that case, most methods either fail or make restrictive assumptions on the input, the output function, the number of available simulations, or combinations hereof. Examples of restrictions include assuming linear dependencies, normally distributed input variables, or being able to perform enough simulations to allow for Monte-Carlo methods 2Cha15 (); 2Cha12 (); 2Dav09 (); 2Dou13 (); 2Kuc12 (); 2Mar12 (); 2Xu08 (). The most well-known method is using Sobol’s indices Sob93 (); 3Sob01 (). In case of dependent inputs, the functions in the Sobol decomposition are not orthogonal. Due to this, the indices may become negative, and therefore the interpretation of the indices becomes less clear Nav14 (). Also, the contribution due to dependency can cancel the contribution due to the variable itself.
The setting under consideration in this paper is one in which the probability distribution of the multivariate input is unknown (both the joint distribution and the marginals), and only a sample (dataset) of the input is available. Thus, not only the dependency structure (i.e., between which variables dependencies exist) is unknown, but also the shape of these dependencies. Our aim in this study is to develop a methodology to detect and quantify the dependencies and their strength from the input data, without making assumptions on the shape.
3Fil17 () developed a Bayesian nonparametric procedure that leads to an analytic quantification for dependence versus independence. However, this procedure does not quantify explicitly the strength of the dependency. Furthermore, the inference resulting from the Pólya trees used in the procedure is known to depend strongly on the choice of the partitioning of the data points, although a partial optimization of the partitioning is proposed as well. This optimization increases the evidence in favour of dependence, but this is not a problem as long as the order of pairs of variables is considered in contrast to the numerical value of the evidence.
We build our methodology on the concept of entropy. Entropy is a notion that is used in various fields, e.g. statistical physics, information theory and mathematics, and that can represent dependencies between variables that go beyond linear relationships (correlation). Its usefulness for SA was discussed before in 2Aud08 (); 2Liu06 (). However, although entropy has a clear mathematical foundation, the estimation of entropy from data sampled from a continuous distribution is not straightforward. Both the use of binning (as in 2Aud08 ()) and the Kullback-Leibler entropy (as in 2Liu06 ()) are difficult, as the results with binning can be sensitive to the choice of bins, while the method to compute the Kullback-Leibler entropy can be sensitive to the parameters of the kernel-density estimation that it uses.
An appealing and elegant method for estimating entropies is due to Hero et al. 2Her03 (); 2Her02 (); 2Her98 (), who proposed to compute minimum spanning trees (MSTs) of the data and to use the length of the MST in an estimator for Rényi entropy. In this study, we employ this method for the purpose of SA, by using the length of the MST as a measure of dependence between variables. In principle, MSTs can be computed exactly, however the computation becomes expensive in case of large datasets. Therefore we test several approximate methods to compute the MST length with reduced computational cost.
In Section 2 we briefly review some theory regarding entropy and estimators of entropy. Afterwards, in Section 3 the approximation methods are discussed. Section 4 consists of validation of the proposed estimator and determining the consistency and robustness of the approximations to it. Section 5 links dependency quantification to sensitivity analysis, of which Section 6 describes one example and one application. Section 7 concludes.
2 Dependence and entropy
In this section we summarize a few basic definitions and concepts regarding dependence and entropy, as well as the connection between the two. This paves the way for defining our proposed estimator for dependency between two variables.
Entropy is a concept which quantifies uncertainty or randomness in a system. It has its origins in classical thermodynamics, but generalizations to other fields, including mathematics, can be made. Shannon’s entropy (information theory) is defined as follows 2Cov06 ():
where represents the possible states of the discrete random variable and their probability of occurring. Its continuous analogue (differential entropy) is given by 2Cov06 ()
where is the domain of the continuous random variable . Rényi has extended Shannon entropy to
for , see 2Cov06 (). In the limit , the Rényi entropy (3), sometimes referred to as -entropy, reduces to the differential entropy (2). For large values of , the events with high probability density determine the value of the entropy, whereas different values of the probability density are weighted more equally when is close to .
2.2 Entropy as measure of dependence
It is well-known that the dependence between two random variables and can be characterized by the mutual information (MI), i.e. the difference between the Shannon entropy of the joint distribution and the sum of the Shannon entropies of the marginals,
see 2Cov06 (). attains it maximum if and are completely dependent, in that case and . Thus, quantifies the strength of dependencies: the higher , the stronger the dependency.
In a similar manner, dependence can be quantified in terms of the Rényi entropy of the joint distribution . The difference can be negative (unlike mutual information), however if we eliminate the effect of the marginal distributions we can use to quantify dependency. The lower , the stronger the dependence. The marginals need to be eliminated because the Rényi entropy is not scale-invariant. Therefore, the value of is strongly influenced by the scaling of and . We can eliminate the effect of the marginals by transforming the data so that all marginals become identical.
We achieve this by applying the rank-transform Con12 (); 3Spe04 (), together with centering, to the input data. The marginals of the rank-transformed data are discrete representations of the uniform distribution . That is, each transformed input variable attains the values , for the number of datapoints and the index of the datapoints. Because of this transform, the Rényi entropy of the transformed input variables becomes a straightforward quantifier of dependence. Since the transformation is monotonic in each of the dimensions, the ordering of the datapoints along each coordinate axis is preserved. The only change is that after the transformation, the distances between datapoints along the coordinate axes are equal (original distances are distorted). Furthermore, outliers are in general less pronounced in the rank-transform than in the original form. Altogether, the rank-transform leaves the structure of the original data intact. We give some numerical examples of the rank-transform and its effects towards the end of this section.
2.3 Estimator of the Rényi entropy
It is not straightforward how to estimate the entropy of a given dataset. One approach is to estimate the distribution underlying the dataset, and compute the entropy from this estimated distribution. Both parametric and non-parametric methods (e.g. kernel density estimation, binning) can be used to estimate the distribution, however the results are quite sensitive to the details of the method used Silverman (). Therefore we employ a different approach for entropy estimation in this study, one in which the data is used directly and no estimation of the distribution is needed so that we can circumvent the sensitivities due to estimating the distribution. This approach can be used to estimate the Rényi entropy with , however not the Shannon entropy. Hence our focus on the Rényi entropy here.
In 2Dud81 (), Shannon entropy has been used for a uniformity test on the unit interval, and 2Kra04 () proposed two classes of estimators for the mutual information based on -nearest neighbor distances. However, it has been shown 2Gao15 () that the number of samples required scales exponentially with the MI itself, which makes accurate estimation between strongly dependent variables almost impossible. An improved estimator is proposed, but requires the setting of an extra parameter.
Hero & Michel describe in three papers 2Her03 (); 2Her02 (); 2Her98 () direct methods to estimate the Rényi entropy, which are based on constructing minimal graphs spanning the datapoints in the domain. The estimator, which is asymptotically unbiased, is
with denoting the dataset, with the dimensionality of the domain of and the number of data points. is a constant only depending on and the definition of . Finally, is a functional, defined as
where is the set of spanning trees on and denotes an edge. The parameter can be freely chosen within the interval , where denotes the dimension (number of input variables) of . Furthermore, must satisfy the constraint . The choice of determines , and the given bound on guarantees to be in . Estimator (5) is strongly consistent for . Cases with are not considered. If is chosen to be and denotes the Euclidean distance between datapoints in , then describes the length of the minimum spanning tree (MST) on the dataset. If the dependence between two input variables is computed, then and thus . The effect of varying and thereby is largely outside of the scope of this study. Note that is computationally a convenient choice. Furthermore, it has been proven that for a related quantity, the -divergence, the theoretically optimal value for distinguishing between two densities which are close to each other on the basis of -divergence is 3Her01 (). Also, the only value of for which the -divergence is monotonically related to a true distance metric between the densities is 3Her01 ().
Here, we propose to use the Rényi entropy with instead of the Shannon entropy since (i) Equation (5) can not conveniently be computed for , because it requires taking the limit of to 0, which also influences and (ii) is both numerically and theoretically a convenient choice as argued above.
Because of the scaling by the rank-transform as proposed earlier, in case of independence the MST will approximately cover the full domain (the unit square, if ). In case of dependence, either the density is nonuniform, or some areas of the domain contain no datapoints. In both cases, the average edge length decreases and so does the total length of the MST. This is illustrated in Figure 1. If the length becomes shorter, the entropy estimate becomes smaller and the estimated dependence increases. Note that there is no assumption on the shape or structure of the dependence.
2.4 Quantifier of dependence
We propose to use the following quantity to estimate (or rather quantify) the dependence between two (rank-transformed) variables:
where is a dataset of the variables containing datapoints. We set , as discussed in the previous section. With and the dimension of the data () given, it follows that . We note that . Thus, there is a linear relation between and , as and are constants as long as the dimension () and do not change. Thus, for quantifying and comparing dependencies of different two-dimensional datasets, using (7) is equivalent to using (5), yet simpler in practice as we do not have to provide the constant , which cannot be obtained easily. In fact, the relation between (7) and (5) is affine.
Besides obtaining an ordering in terms of the dependency strength, we can use (7) to construct a reference level for distinguishing between independent and dependent variables. This construction consists of computing on different datasets of rank-transformed bivariate uniformly distributed data. From this we obtain an empirical distribution, based on samples, for for independent variables (we recall that data from independent variables always leads to data uniformly distributed on a grid after the rank-transform). The reference level can then be defined as the -quantile of the empirical distribution for small, since decreases with increasing dependence. This leads to the statistical test for dependence with the null hypothesis (independence) rejected if
in which is the 0.01 or 0.05 quantile of the empirical distribution for for independent variables.
2.5 Proof of concept
First, we compute the quantifier of dependence (7) on multiple datasets with varying distributions to analyze its behavior. Then, its convergence and robustness for increasing values of are studied. For the distributions considered in this section, it is possible to compute the Rényi entropy with high accuracy for the scaled (i.e., rank-transformed) distribution such that a comparison can be made between the behavior of (7) and (3).
We use datasets sampled from the following distributions: (i) a bivariate uniform distribution, (ii) a standard normal distribution with varying correlation coefficient , (iii) a constant density on a region within the unit square with area , in which the data is focused in the lower left corner (corner distribution), and (iv) a constant density on a region within the unit square with area , but in which the data is focused on a symmetry axis (line distribution). For the latter two cases, the region includes values near the minimum and maximum of both variables, such that the ranges of the region for varying stay the same. These distributions are also referred to as shape distributions. Figure 5 shows examples of these distributions. For each of the four distributions, datasets are generated with datapoints in every dataset. and are varied in steps of from to .
In Figure 2, the empirical cumulative distribution function (CDF) of (7) is plotted for data from the bivariate uniform distribution. It can be seen that the distribution is concentrated in a rather narrow interval. At the end of this section we explore how this interval becomes more or less narrow as the amount of data changes.
In Figure 3, the empirical CDF of is plotted for both the normal distribution (case (ii)) and the two shape distributions (case (iii) and (iv)), for different small values of and . More precisely, for each dataset sampled from one of these distributions, we first apply the rank-transform and then evaluate (7) for the rank-transformed data. The empirical CDF for the uniform distribution, already shown in Figure 2, is included in black for comparison. If or , there is no dependence anymore and the empirical CDF obtained from the normal distribution or shape distributions coincides with the empirical CDF from the bivariate uniform distribution (modulo differences due to finite sample size, ).
It can be seen that for the shape distributions, the quantifier of dependence is more distinctive for small values of than it is for the normal distribution with small . Hence, weak dependencies in the shape distributions can more easily be detected than in the normal distribution. To demonstrate the behavior of for the full range of values of and , we plot the mean together with the empirical 95% confidence intervals in Figure (b)b. In the case of , the distribution of the uniform (independent) distribution is recovered, indicated by a different color.
From Figure (a)a, it can be seen why the differences between the CDFs are small in case of the normal distribution: for the normal distribution, the entropy is nearly flat as a function of , for small values. Figure (b)b shows the rank-transform has a more severe effect on the corner than on the line distribution, since the estimates for the corner distribution decrease slower. The distortion of the shape of the graph for the corner distribution between Figure (a)a and (b)b is caused by the distortion of the distances between data points after the rank-transform. Note that both shape distributions have the same Rényi entropy if the rank-transform is not applied. The effect of the rank-transform in this case can be seen in Figure 5. The closer is to 1, the more the data falls in the two boxes for the corner distribution, due to the combination of skewness and discontinuity. This does not hold for the line distribution. In the limit of and , the entropy of the rank-transformed corner distribution goes to , while it goes to for the rank-transformed line distribution. Hence, it is consistent that does not go to for the corner distribution. We note that the shape distributions are quite artificial and have discontinuous density, while datasets in practice are usually samples from a distribution with a smooth density and a less artificial shape.
For completeness, we give the values for with the normal distribution with parameter and for varying in Figure 6. It can be seen that the differences are small, apart from a shift that is constant in . This constant shift has no consequence for our goal of quantifying dependencies by means of entropy. In particular, for all values of shown, the entropy is nearly flat as a function of as . Thus, other values than offer no advantage in this respect, supporting our choice for .
We conclude this section by investigating the effect of varying the size of the dataset, . We compute the empirical CDF of using , and . For this computation we increase to to obtain a smooth CDF. The resulting empirical CDFs are shown in Figure 7. It can be seen that the distribution becomes narrower with increasing , as was to be expected. Thus, larger makes the comparison of estimates easier due to the smaller confidence intervals involved.
We note that for higher values of , the computations become expensive due to the computation of the edge lengths before computing the MST. The order of this operation is . Kruskal’s method for computing MSTs 2Kru56 () runs in time, but the steps are faster than the ones needed for computing the edge lengths. Prim’s method for computing MSTs 2Pri57 () can be faster (), but the implementation is more involved. The high cost of computing the exact MST motivates us to investigate approximate algorithms in the next section.
3 Approximation methods
For large datatsets (i.e., large ), the computational cost of evaluating the estimator (7) becomes prohibitively high, due to the computational complexity of constructing the MST. In this section we discuss methods to reduce the computational cost by approximating the value of (7) evaluated on a large dataset. We consider three types of approximation: first, MSTs can be computed on multiple subsets of the data. The second type aggregates data points into clusters, such that only one MST calculation is performed on the cluster centers. The third type clusters the data points, constructs MSTs on each cluster and combines them in a smart way 2Zho15 ().
3.1 Sampling-based MST
In this method, the dataset is split in subsets and (7) is computed on each of the subsets. Thus, estimates of are obtained. Their arithmetic mean becomes the new estimate, while their variance is a measure for the quality of the new estimate. The number can be chosen freely, although it represents a trade-off between accuracy and computation time (since both accuracy and computational time decrease with increasing ).
The splitting can be done in various ways, of which random and stratified are the most straightforward ones. In the random splitting, datapoints are allocated randomly to one of the subsets, which all have size (rounding is neglected). In the proportional splitting, clusters are generated with the -means method 3Cel13 (); Ste56 () and these are considered the strata. The data points in each stratum are then proportionally allocated to the subsets. The number of clusters involved () would be another parameter, which we choose equal to .
3.2 Cluster-based MST
Another way of reducing the computational burden would be to cluster the data in a large number of clusters and compute the MST on the cluster centers. In this case, the clustering method can be chosen, as well as the number of clusters. To be consistent with the previous method, we define here the number of clusters to be , such that the number of points in the MST is similar to the previous method. Furthermore, it is possible to include the size (weight) of the clusters as well. Two different clustering methods are investigated: -means Ste56 () and PCA-based clustering 2Egg18 (). The construction of the MST is performed both without and with weighting. The weighting is harmonic, which implies that in the construction of the MST, the edge between datapoints and , , is replaced by
where is the number of cluster and , are the weights of the clusters, computed by the fraction of datapoints in that cluster.
3.3 Multilevel MST
This method is developed by Zhong et al. 2Zho15 () and is also called FMST (fast MST). It is based on the idea that to find a neighbor of a data point, it is not necessary to consider the complete dataset of size . In the FMST method, the dataset is partitioned in clusters via the -means clustering method, and MSTs are constructed on each of the clusters. In a next step, a MST is constructed on the cluster centers to determine which clusters get connected at their boundaries. Between these clusters, the shortest connecting edges are computed. Because this is heuristic, the complete process is repeated with the midpoints of the edges connecting different MSTs being the initial cluster centers for the -means method. The resulting two MSTs are merged into a new graph and its MST is computed as final outcome of the method. 2Zho15 () report good results with this method, which has a computational complexity of . Errors occur only if points that should be connected end up in different clusters twice, which does not occur often and has only a minor effect. For high-dimensional datasets, erroneous edges are included more often, but the effect thereof is smaller. Both are due to the curse of dimension.
The accuracy, robustness and computational time of the methods explained before are tested on bivariate uniform datasets with and bivariate strongly dependent datasets with . The strongly dependent datasets are chosen as projections of the dataset used in Section 6.1. Hence, the dataset exists of combinations of , , and . Since this leads to 6 projections per dataset, we choose in this case. In these cases, it is computationally expensive but still feasible to compute the full (exact) MST, and we do so to be able to compare results from the three approximate methods with the method based on the full MST. For the approximate methods, we choose the parameter to be . The results can be found in Figure 8. We mention here the effect of the implementation of -means, as the choice of initialization can affect the results. We use the -means++ initialization algorithm, repeat the clustering 10 times from different initializations and select the result with the best clustering criterion.
The error estimate is computed as the absolute difference between the approximated and exact value of . The FMST has consistently small error, while other methods have larger error. In our opinion, it is important the approximation is accurate and robust, i.e., has little variation in its error over multiple runs of data with the same distribution. Furthermore, it should perform well for both dependent and independent data. Therefore, we propose to use the FMST to compute approximations to the MST in case where the dataset is large (say, ).
4 Validation of the proposed FMST estimator
In this section, we further investigate using the FMST method with the estimator (7). First, the effect of the size of the dataset is investigated together with the robustness of the FMST estimator for relatively small . Then, the FMST estimator for dependence is tested for behavior and consistency using datasets sampled from the three distributions considered before (uniform, normal and shape distributions).
It is straightforward that approximations to the MST using the correct edge distances overestimate the length of the MST. Therefore, we compare the empirical distributions of based on the MST (Figure 7) to the ones based on the FMST in Figure 9. The number of repetitions is for , and for and . It can be seen that the distributions are different, but only shifted. Furthermore, the width of the distribution decreases to almost zero for and . This means the estimator is robust to sampling effects. Note that there is still a difference between the distributions for and . This is due to the bias caused by approximating the MST, which reduces for increasing .
We repeated the experiment from Section 2.5, but now with the estimator based on FMST to evaluate the effect of the FMST approximation on different data distributions. The results are in Figure 10. Again, the behavior of the estimator based on FMST is the same as for MST, but only shifted upwards. This bias is very small for the line distribution. In these tests, in order to compare them to the previous results.
Based on these results, we conclude that the FMST estimator is a good and robust approximation of the MST estimator. It has a positive bias, but for the purpose of comparing and ranking strength of dependencies as quantified by the FMST estimator, this bias does not pose a problem.
5 From dependency quantification to sensitivity analysis
As already briefly mentioned in the introduction, the method proposed here to quantify dependencies via the MST can contribute to performing SA. If we have a -dimensional dataset consisting of data for input variables () and one output variable , we can quantify the dependencies by computing for all combinations . The result can help to quantify sensitivity. In particular, it enables us to rank the input variables from most important (strongest dependency of input and output, largest sensitivity) to least important (weakest dependency of input and output, smallest sensitivity).
We note that the first step in our proposed method consists of performing a rank-transform on the data, before computing the MST. Clearly, this step is needed as well in case the data includes an output variable.
The ranking of input variables from most to least important and quantification of sensitivity of the output with respect to different inputs is a main goal of SA. Different from other SA methods, there is no assumption of independent inputs needed for the method we propose. The idea to use the quantification of dependencies between inputs and outputs by means of entropy for purposes of SA was proposed before by 2Aud08 (); 2Liu06 (), however different methods were used in these studies to estimate entropy (or a proxy thereof) compared to what we propose. Furthermore, in these studies the case with multiple dependent inputs was not considered.
It must be mentioned here that by considering combinations of one input and one output, higher-order effects (interactions) cannot be explored. MSTs on higher-dimensional combinations of input and output variables may be a way to generalize this approach for higher-order effects. We leave this for further study. Note that the number of combinations to be computed grows fast with the number of inputs.
For comparison purposes, in the next section we include an example with independent input variables and one output variable, of which the Sobol’s (total) indices can be computed analytically. These serve as a reference to test the proposed method on its qualities for sensitivity analysis. We compare our estimates to Sobol’s and Sobol’s total indices.
The proposed methods are applied to two testcases in this section. In the first case, we use the Ishigami function 3Ish90 () and evaluate it using randomly sampled synthetic data as inputs. We consider data from two different input distributions, one without dependencies (uniform) and one with strong dependencies. On all combinations of variables (input and output), the dependence is quantified and, for the uniform dataset, compared to the values of the Sobol (total) indices. In the second testcase, measurements on wave conditions at sea are investigated, of which the resulting sediment transport is computed.
6.1 Ishigami function
The parameters and are chosen to be and , respectively, in accordance with 2Cre09 (). One dataset is four-dimensional and uniformly distributed, while the other one is generated as
and range-normalized to the unit hypercube. Note that there is one variable () which is not used in the Ishigami function. This is on purpose to show the effect of confounders. The MSTs are approximated with both the MST and the FMST method and (7) is used to compute the dependencies both between input variables and between the input variables and the output variable. The ordering is , , , et cetera. Because of this ordering, the pairs numbered 4, 7, 9 and 10 represent combinations of one input and the output variable (, respectively).
The datasets have been generated times with samples each and the results are in Figure 11. First of all, it can be seen that the estimates are robust, as for each set (or pair) the estimates are all in a narrow interval (indicated by the minimum and maximum estimates).
Furthermore, it can be seen that for the uniform dataset (Figure (a)a), only the input variables and are found to have an effect on the output. For most sets, the dependence estimate is slightly below -0.4, the value attained in case of independence. One can compare this with Figure 9, where it can be seen that in case of two uniformly distributed independent variables, the estimate is slightly below -0.4 in case of . Indeed, the input variables are all independent in case of this uniform dataset. Also, by construction the Ishigami function does not depend on . The independence of and is nontrivial, however it is consistent with the analysis using Sobol indices (as discussed below). is not directly dependent on , although there is an interaction effect with . This interaction effect is not detected with the analysis of pairwise dependencies.
The exceptions to independence are sets 4 and 7 (pairs and ), for which the estimate is lower, indicating dependence. The dependence of and is stronger than between and , visible in the smaller value for set 7. The (in)dependence of on the various inputs is illustrated in Figure 12 where scatterplots of the input-output pairs are shown.
For the strongly dependent dataset, the analysis is less straightforward. Again, the combinations and are strongly dependent, while shows a weak dependence. Because of the structure of the dataset, all input variables are mutually dependent. Furthermore, the dependencies within the input data are stronger than the dependencies of input variables with the output, except for the combination. The relative ordering of dependencies (strongest for , weaker for , very weak for and ) is consistent with the definition of the Ishigami function and the intuition gleaned from the right panels of Figure 12.
We continue by comparing the means of the estimates for the input-output combinations for the uniform dataset based on FMST with the values of its Sobol indices and Sobol total indices in Table 1.
For the Ishigami function, the sum of all and first-order interaction terms equals 1, and only , and are nonzero. Since , is not involved in interaction terms, while does not have an effect by itself. The term contributes to both and . From the numerical values for the Sobol indices, summarized in Table 1, we conclude that is most important by itself, while is most important when interactions with other variables are included.
The proposed estimator behaves similar to . It is approximately in case of independence (exact value depends on , higher leads to a smaller value, see Figure 9), and drops below this value if the variables are dependent. The values of in Table 1 lead to the same conclusion as those of , of being the most important here, followed by . The estimator does not have the sum property, nor does it compute estimates for interaction effects. This may be possible by MSTs on higher-dimensional combinations of input and output variables, as mentioned in Section 5.
In practice, it might be beneficial to compute all the Sobol indices when the input variables are independent and not large in number. However, our prime interest in this study is in general cases with dependent input variables given to us in the form of datasets. In these cases, Sobol indices are difficult to calculate, especially if one wants to take the dependencies into account. By contrast, can be computed easily for dependent variables. For dependent variables, the expressions for the Sobol indices become more comprehensive and the interpretation becomes less straightforward, due to the possibility of negative values for the indices.
6.2 Sediment transport
In this testcase, we consider data of physical quantities related to sediment transport in the sea. The data consists of measurements of three variables, and simulations of one intermediate variable and one output variable obtained at the North Sea coast near the town of Noordwijk (the Netherlands), between June 19, 1984 and October 5, 1987. The measured variables are root-mean-square wave height (m), peak wave period (s), wave direction () (relative to shore-normal). The output is the cross-shore sediment transport (m/s) computed with the Unibest-TC module 3Rue07 (); 2Wal12 (); 2Wal00 (). Also available is the long-shore sediment transport (m/s) obtained from brute force simulation without computation of the bed changes. Depending on the application, it can be treated as an input or an output variable.
We explore the dependencies between the input variables and output variables . However, before discussing the results, we want to point out a phenomenon frequently occurring in real-world datasets. Because of the accuracy of the measuring device, there are only 60 different values for . This makes the data, although rank-transformed including the multiples, differing substantially from uniform. We correct for this in the rank-transform by random positioning within groups with the same value. Similarly, for and only 379 and 353 unique values, respectively, are present in the dataset. This leads to a few observed pairs occurring twice, so that the distance between these pairs is zero. Some MST algorithms treat this as a non-existing edge, resulting in larger values of the MST length. We circumvent this by adding a small noise to the normalized data matrix (), such that the datapoints are not exactly at the same location anymore, although the effect is too small to affect the ordering by the rank-transform.
The resulting normalized lengths are given in Table 2.
It is found that the strongest dependency is between and . This is not a surprise, since the long-shore and cross-shore sediment transport are expected to be dependent. The weakest dependencies are between and and between and . The latter indicates that the root-mean-square wave height and wave direction are only slightly correlated.
In this study we have proposed a novel method for quantifying dependencies in multivariate datasets by means of the minimum spanning tree (MST). The length of the MST is directly related to the Rényi entropy of the data, which in turn is a suitable quantity to assess dependency. Our approach to assess the relative strength of dependencies via the length of the MST can be used as an aid for sensitivity analysis, as discussed in Section 5 and demonstrated with numerical examples in Section 6.
Constructing the MST is computationally expensive for large datasets. To reduce the computational cost, algorithms that approximate the exact MST can be considered. We have compared three such approximation algorithms (Section 3), and found the multilevel FMST algorithm due to 2Zho15 () to be the most accurate.
The method proposed here can be considered as an alternative to the widely used variance-based sensitivity analysis methods. Because it is based on entropy rather than variance, it is easier to incorporate the dependencies. Furthermore, it is well-suited for SA in cases where the input distributions are unknown and only a sample (dataset) of the inputs is available.
We have tested our approach on the Ishigami function as well as on a real-world testcase involving sediment transport in the North Sea. In both cases we obtained rankings of dependencies consistent with prior knowledge or heuristic understanding. Moreover, in the Ishigami test case we included an example with independent input variables, making it possible to compare our approach with the analysis based on Sobol indices. The results from our proposed method were consistent with those from the Sobol indices.
Altogether, the approach proposed here is a suitable and useful method to quantify dependencies and to aid in the performance of SA. It gives consistent results and remains computationally feasible even for large datasets by using the FMST algorithm. A limitation of the method as presented in this paper is that it does not account for interaction effects, see the discussion in Section 5. However, as also briefly discussed in Section 5, we anticipate that this method can be generalized by constructing MSTs in three (or more) dimensions, making it possible to account for interaction effects as well. We intend to explore this generalization in a future study.
We would like to thank Bruna de Queiroz and Deltares for providing the sediment transport data. This research is part of the EUROS program, which is supported by NWO domain Applied and Engineering Sciences and partly funded by the Ministry of Economic Affairs.
- (1) O. P. Le Maître, O. M. Knio, Spectral Methods for Uncertainty Quantification: With Applications to Computational Fluid Dynamics, Scientific Computation, Springer, 2010. doi:10.1007/978-90-481-3520-2.
- (2) T. J. Sullivan, Introduction to Uncertainty Quantification, Vol. 63, Springer, 2015. doi:10.1007/978-3-319-23395-6.
- (3) T. Crestaux, O. Le Maître, J.-M. Martinez, Polynomial chaos expansion for sensitivity analysis, Reliability Engineering & System Safety 94 (7) (2009) 1161–1172. doi:10.1016/j.ress.2008.10.008.
- (4) J. C. Helton, J. D. Johnson, C. J. Sallaberry, C. B. Storlie, Survey of sampling-based methods for uncertainty and sensitivity analysis, Reliability Engineering & System Safety 91 (10) (2006) 1175–1209. doi:10.1016/j.ress.2005.11.017.
- (5) T. Homma, A. Saltelli, Importance measures in global sensitivity analysis of nonlinear models, Reliability Engineering & System Safety 52 (1) (1996) 1–17. doi:10.1016/0951-8320(96)00002-6.
- (6) B. Iooss, P. Lemaître, A Review on Global Sensitivity Analysis Methods, in: Uncertainty Management in Simulation-Optimization of Complex Systems, Springer, 2015, pp. 101–122. doi:10.1007/978-1-4899-7547-8_5.
- (7) B. Krzykacz-Hausmann, An approximate sensitivity analysis of results from complex computer models in the presence of epistemic and aleatory uncertainties, Reliability Engineering & System Safety 91 (10) (2006) 1210–1218. doi:10.1016/j.ress.2005.11.019.
- (8) H. Liu, W. Chen, A. Sudjianto, Relative Entropy Based Method for Probabilistic Sensitivity Analysis in Engineering Design, Journal of Mechanical Design 128 (2) (2006) 326–336. doi:10.1115/1.2159025.
- (9) J. E. Oakley, A. O’Hagan, Probabilistic sensitivity analysis of complex models: a Bayesian approach, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 66 (3) (2004) 751–769. doi:10.1111/j.1467-9868.2004.05304.x.
- (10) A. Saltelli, Making best use of model evaluations to compute sensitivity indices, Computer Physics Communications 145 (2) (2002) 280–297. doi:10.1016/s0010-4655(02)00280-1.
- (11) A. Saltelli, P. Annoni, I. Azzini, F. Campolongo, M. Ratto, S. Tarantola, Variance based sensitivity analysis of model output. Design and estimator for the total sensitivity index, Computer Physics Communications 181 (2) (2010) 259–270. doi:10.1016/j.cpc.2009.09.018.
- (12) A. Saltelli, M. Ratto, T. Andres, F. Campolongo, J. Cariboni, D. Gatelli, M. Saisana, S. Tarantola, Global sensitivity analysis: the primer, John Wiley & Sons, 2008. doi:10.1002/9780470725184.
- (13) I. M. Sobol’, Sensitivity Estimates for Nonlinear Mathematical Models, Mathematical Modeling & Computational Experiments 1 (4) (1993) 407–414.
- (14) B. Sudret, Global sensitivity analysis using polynomial chaos expansions, Reliability Engineering & System Safety 93 (7) (2008) 964–979. doi:10.1016/j.ress.2007.04.002.
- (15) T. G. Trucano, L. P. Swiler, T. Igusa, W. L. Oberkampf, M. Pilch, Calibration, validation, and sensitivity analysis: What’s what, Reliability Engineering & System Safety 91 (10) (2006) 1331–1357. doi:10.1016/j.ress.2005.11.031.
- (16) H. M. Wagner, Global Sensitivity Analysis, Operations Research 43 (6) (1995) 948–969. doi:10.1287/opre.43.6.948.
- (17) Y. Yamanishi, Y. Tanaka, Sensitivity Analysis in Functional Principal Component Analysis, Computational Statistics 20 (2) (2005) 311–326. doi:10.1007/bf02789706.
- (18) R. Ghanem, D. Higdon, H. Owhadi, Handbook of uncertainty quantification, Springer, 2017. doi:10.1007/978-3-319-12385-1.
- (19) G. Chastaing, F. Gamboa, C. Prieur, Generalized Sobol sensitivity indices for dependent variables: Numerical methods, Journal of Statistical Computation and Simulation 85 (7) (2015) 1306–1333. doi:10.1080/00949655.2014.960415.
- (20) G. Chastaing, F. Gamboa, C. Prieur, et al., Generalized Hoeffding-Sobol decomposition for dependent variables - application to sensitivity analysis, Electronic Journal of Statistics 6 (2012) 2420–2448. doi:10.1214/12-ejs749.
- (21) S. Da Veiga, F. Wahl, F. Gamboa, Local Polynomial Estimation for Sensitivity Analysis on Models With Correlated Inputs, Technometrics 51 (4) (2009) 452–463. doi:10.1198/tech.2009.08124.
- (22) S. Dougherty, Sensitivity analysis of models with input codependencies, Ph.D. thesis, Queen’s University, Kingston (2013).
- (23) S. Kucherenko, S. Tarantola, P. Annoni, Estimation of global sensitivity indices for models with dependent variables, Computer Physics Communications 183 (4) (2012) 937–946. doi:10.1016/j.cpc.2011.12.020.
- (24) T. A. Mara, S. Tarantola, Variance-based sensitivity indices for models with dependent inputs, Reliability Engineering & System Safety 107 (2012) 115–121. doi:10.1016/j.ress.2011.08.008.
- (25) C. Xu, G. Z. Gertner, Uncertainty and sensitivity analysis for models with correlated parameters, Reliability Engineering & System Safety 93 (10) (2008) 1563–1573. doi:10.1016/j.ress.2007.06.003.
- (26) I. M. Sobol’, Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates, Mathematics and Computers in Simulation 55 (1) (2001) 271–280. doi:10.1016/s0378-4754(00)00270-6.
- (27) M. Navarro, J. A. S. Witteveen, J. Blom, Polynomial chaos expansion for general multivariate distributions with correlated variables (June 2014). arXiv:1406.5483.
- (28) S. Filippi, C. C. Holmes, et al., A Bayesian Nonparametric Approach to Testing for Dependence Between Random Variables, Bayesian Analysis 12 (4) (2017) 919–938. doi:10.1214/16-ba1027.
- (29) B. Auder, B. Iooss, Global sensitivity analysis based on entropy, in: Safety, reliability and risk analysis-Proceedings of the ESREL 2008 Conference, 2008, pp. 2107–2115. doi:10.1.1.595.3509.
- (30) A. O. Hero, J. Costa, B. Ma, Asymptotic Relations Between Minimal Graphs and -entropy, Tech. Rep. 334, Comm. and Sig. Proc. Lab.(CSPL), Dept. EECS, University of Michigan, Ann Arbor (2003).
- (31) A. O. Hero, B. Ma, O. J. J. Michel, J. Gorman, Applications of Entropic Spanning Graphs, IEEE signal processing magazine 19 (5) (2002) 85–95. doi:10.1109/MSP.2002.1028355.
- (32) A. O. Hero, O. J. J. Michel, Robust Entropy Estimation Strategies Based on Edge Weighted Random Graphs, in: SPIE’s International Symposium on Optical Science, Engineering, and Instrumentation, International Society for Optics and Photonics, 1998, pp. 250–261. doi:10.1117/12.323804.
- (33) T. M. Cover, J. A. Thomas, Elements of information theory 2nd edition, John Wiley & Sons, 2006. doi:10.1002/047174882X.
- (34) W. J. Conover, The rank transformationâan easy and intuitive way to connect many nonparametric methods to their parametric counterparts for seamless teaching introductory statistics courses, Wiley Interdisciplinary Reviews: Computational Statistics 4 (5) (2012) 432–438. doi:10.1002/wics.1216.
- (35) C. Spearman, The proof and measurement of association between two things, The American journal of psychology 15 (1) (1904) 72–101. doi:10.2307/1412159.
- (36) B. W. Silverman, Density Estimation for Statistics and Data Analysis, Springer US, 1986. doi:10.1007/978-1-4899-3324-9.
- (37) E. J. Dudewicz, E. C. van der Meulen, Entropy-Based Tests of Uniformity, Journal of the American Statistical Association 76 (376) (1981) 967–974. doi:10.1080/01621459.1981.10477750.
- (38) A. Kraskov, H. Stögbauer, P. Grassberger, Estimating mutual information, Physical review E 69 (6) (2004) 066138. doi:10.1103/physreve.69.066138.
- (39) S. Gao, G. Ver Steeg, A. Galstyan, Efficient Estimation of Mutual Information for Strongly Dependent Variables, in: AISTATS, 2015.
- (40) A. O. Hero, B. Ma, O. Michel, J. Gorman, Alpha-divergence for classification, indexing and retrieval, Tech. Rep. CSPL-328, Communication and Signal Processing Laboratory, University of Michigan (2001).
- (41) J. B. Kruskal, On the shortest spanning subtree of a graph and the traveling salesman problem, Proceedings of the American Mathematical society 7 (1) (1956) 48–50. doi:10.1090/S0002-9939-1956-0078686-7.
- (42) R. C. Prim, Shortest connection networks and some generalizations, Bell Labs Technical Journal 36 (6) (1957) 1389–1401. doi:10.1002/j.1538-7305.1957.tb01515.x.
- (43) C. Zhong, M. Malinen, D. Miao, P. Fränti, A fast minimum spanning tree algorithm based on -means, Information Sciences 295 (2015) 1–17. doi:10.1016/j.ins.2014.10.012.
- (44) M. E. Celebi, H. A. Kingravi, P. A. Vela, A comparative study of efficient initialization methods for the k-means clustering algorithm, Expert systems with applications 40 (1) (2013) 200–210. doi:10.1016/j.eswa.2012.07.021.
- (45) H. Steinhaus, Sur la division des corps matériels en parties, Bulletin de l’Académie Polonaise des Sciences IV (12) (1956) 801–804.
- (46) A. W. Eggels, D. T. Crommelin, J. A. S. Witteveen, Clustering-based collocation for uncertainty propagation with multivariate dependent inputs, to appear in International Journal for Uncertainty Quantification.
- (47) T. Ishigami, T. Homma, An Importance Quantification Technique in Uncertainty Analysis for Computer Models, in: Uncertainty Modeling and Analysis, 1990. Proceedings., First International Symposium on, IEEE, 1990, pp. 398–403. doi:10.1109/isuma.1990.151285.
- (48) B. Ruessink, Y. Kuriyama, A. Reniers, J. Roelvink, D. Walstra, Modeling cross-shore sandbar behavior on the timescale of weeks, Journal of Geophysical Research: Earth Surface 112 (F3) (2007) F03028. doi:doi:10.1029/2006JF000730.
- (49) D. Walstra, A. Reniers, R. Ranasinghe, J. Roelvink, B. Ruessink, On bar growth and decay during interannual net offshore migration, Coastal Engineering 60 (2012) 190–200. doi:10.1016/j.coastaleng.2011.10.002.
- (50) D. J. R. Walstra, Unibest-TC Userguide, Tech. rep., WL Delft Hydraulics (2000).