Quantitative model validation techniques: new insights
Abstract
This paper develops new insights into quantitative methods for the validation of computational model prediction. Four types of methods are investigated, namely classical and Bayesian hypothesis testing, a reliabilitybased method, and an area metricbased method. Traditional Bayesian hypothesis testing is extended based on interval hypotheses on distribution parameters and equality hypotheses on probability distributions, in order to validate models with deterministic/stochastic output for given inputs. Formulations and implementation details are outlined for both equality and interval hypotheses. Two types of validation experiments are considered  fully characterized (all the model/experimental inputs are measured and reported as point values) and partially characterized (some of the model/experimental inputs are not measured or are reported as intervals). Bayesian hypothesis testing can minimize the risk in model selection by properly choosing the model acceptance threshold, and its results can be used in model averaging to avoid Type I/II errors. It is shown that Bayesian interval hypothesis testing, the reliabilitybased method, and the area metricbased method can account for the existence of directional bias, where the mean predictions of a numerical model may be consistently below or above the corresponding experimental observations. It is also found that under some specific conditions, the Bayes factor metric in Bayesian equality hypothesis testing and the reliabilitybased metric can both be mathematically related to the value metric in classical hypothesis testing. Numerical studies are conducted to apply the above validation methods to gas damping prediction for radio frequency (RF) microelectromechanical system (MEMS) switches. The model of interest is a general polynomial chaos (gPC) surrogate model constructed based on expensive runs of a physicsbased simulation model, and validation data are collected from fully characterized experiments.
1Introduction
Model validation is defined as the process of determining the degree to which a model is an accurate representation of the real world from the perspective of the intended use of the model [?]. Qualitative validation methods such as graphical comparisons between model predictions and experimental data are widely used in engineering. However, statisticsbased quantitative methods are needed to supplement subjective judgments and to systematically account for errors and uncertainty in both model prediction and experimental observation [?].
Previous research efforts include the application of statistical hypothesis testing methods in the context of model validation [?], and development of validation metrics as measures of agreement between model prediction and experimental observation [?]. Some discussions on the pros and cons of these validation methods can be found in [?]. Based on these existing methods and the related studies, this paper is motivated by several issues which remain unclear in the practice of model validation: (1) validation with fully characterized vs. partially characterized experimental data; (2) validation of deterministic vs. stochastic model predictions; (3) accounting for the existence of directional bias; and (4) choice of thresholds in different validation metrics.
First, there are two possible types of validation data, resulting from (1) fully characterized experiments (i.e., all the inputs of the model/experiment are measured and reported as point values) or (2) partially characterized experiments (i.e., some inputs of the model/experiment are not measured or are reported as intervals). For instance, some input variables of the model/experiment may not be measured, but we may have expert opinions about the possible ranges or probability distributions of these input variables, and thus this experiment is “partially” characterized. In other words, there will be more uncertainty in the data from partially characterized experiments than from fully characterized experiments, due to the uncertainty in the input variables. Some partially characterized experiments with limited uncertainty may be considered for validation by practitioners. The term “input” is referred to as the variables in a model that can be measured in experiments. We assume that the same set of variables goes into the model and validation experiments as inputs, and we are comparing the outputs of the model and experiments during validation. Therefore, the terms “model inputs” and “ experimental inputs” mean the same thing in this paper. When a model is developed, the physical quantity is postulated to be a function of a set of variables . This function is not exactly known and hence is approximated using a model with output . is observable through some experiments and are the measurable inputs variables to the experiments. Note that are the variables that cannot be measured in the experiments and are called as “model parameters”. A simple example of the measurable experimental inputs is the amplitude of loading applied on a cantilever beam, while the deflection of the beam is the measured quantity. Also note that the diagnostic quality and the bias in experiments are not considered as “input”. Instead, they are classified as components of the measurement uncertainty, which is represented by in this paper. While most of the previous studies only focus on validation with fully characterized experimental data, this paper explores the use of both types of data in various validation methods.
Second, due to the existence of aleatory and epistemic uncertainty, both the model prediction (denoted as ) and the physical quantity to be predicted (denoted as ) can be uncertain, and this has been the dominant case studied in the literature [?]. However, in practice it is possible that either or can be considered as deterministic. Note that is deterministic means that for given values of the model input variables, the output prediction of the model is deterministic. The application of various validation methods to these different cases will be covered in this paper.
Third, in this study, we defined two terms to characterize the difference between model prediction and validation data  bias and directional bias. Bias is defined as the difference between the mean value of model predictions and the statistical mean value of experiment observations, and the term “directional bias” means that the direction of bias remains unchanged as one varies the inputs of model and experiment. This paper explores various validation methods in order to account for the existence of the directional bias.
Fourth, although different validation metrics are usually developed to measure the agreement between model prediction and validation data from different perspectives, this paper shows that under certain conditions some of the validation metrics can be mathematically related. These relationships may help decision makers to select appropriate validation metrics and the corresponding model acceptance/rejection thresholds.
Various quantitative validation metrics, including the value in classical hypothesis testing [?], the Bayes factor in Bayesian hypothesis testing [?], a reliabilitybased metric [?], and an areabased metric [?], are investigated in this paper. Based on the original definition of Bayes factor, we formulate two types of Bayesian hypothesis testing, one on the accuracy of predicted mean and standard deviation of model prediction, and the other one on the entire predicted probability distribution of the model prediction. These two formulations of Bayesian hypothesis testing can be applied to both fully characterized and partially characterized experiments. The use of these two types of experimental data in the other validation methods is also investigated. The first formulation of Bayesian hypothesis testing, along with the modified reliabilitybased method and the area metricbased method, takes into account the existence of directional bias. The mathematical relationships among the metrics used in classical hypothesis testing, Bayesian hypothesis testing, and the reliabilitybased method are investigated.
Section 2 presents the general procedure of quantitative model validation in the presence of uncertainty. Section 3 and Section 4 investigate the aforementioned model validation methods for (1) fully characterized and partially characterized experimental data, (2) application to the case when model prediction and the quantity to be predicted may or may not be uncertain, (3) sensitivity to the existence of the directional bias, and (4) the mathematical relationships among some of these validation methods. A numerical example is presented in Section 5 to illustrate the validation of a MEMS switch damping model, which is a generalized polynomial chaos (gPC) surrogate model [?] that has been constructed to predict the squeezefilm damping coefficient. The gPC model is used to replace the expensive microscale fluid simulation model and thus expedite the probabilistic analysis of the MEMS device.
2Quantitative validation of model prediction
Suppose a computational model is constructed to predict an unknown physical quantity. Quantitative model validation methods involve the comparison between model prediction and experimental observation. In this paper, we use the following notations  represents the “true value” of the system response  is the model prediction of this true response  is the experimental observation of
The development of validation metrics is usually based on assumptions on , , and , and these assumptions relate to the various sources of uncertainty and the types of available validation data. In order to select appropriate validation methods, the first step is to identity the sources of uncertainty and the type of validation data.
As mentioned earlier, the available validation data can be from fully characterized or partially characterized experiments. In the case of fully characterized experiments, the model/experimental inputs are measured and reported as point values. The true value of the physical quantity () and the output of model () corresponding to these measured values of will be deterministic if there are no other uncertainty sources existing in the physical system and the model. Note that and can still be stochastic because of other uncertainty sources other than the input uncertainty. For example, the Young’s modulus of a certain material can be stochastic due to variations in the material microstructure, and the output of a regression model for given inputs is stochastic because of the random residual term. If the experiment is partially characterized, some of the inputs are not measured or are reported as intervals, and thus the uncertainty in should be considered. In the Bayesian approach, the lack of knowledge (epistemic uncertainty) about is represented through a probability distribution (subjective probability). Then, since both and are considered as functions of , they also get treated through probability distributions. Nonprobabilistic approaches have also been proposed to handle the epistemic uncertainty; in this paper, we only focus on probabilistic methods.
Note that results from the addition of measurement uncertainty to the true value of the physical quantity , i.e., , where represents measurement uncertainty. Hence, the uncertainty in the experimental observation () can be split into two parts, the uncertainty in the physical system response () and the measurement uncertainty in experiments (). It should be noted that experimental data with poor quality can hardly provide any useful information on the validity of a model. The discussions in this paper are restricted to the cases where uncertainty in data (due to the uncertainty in measuring experimental input and output variables) is limited.
Table 1 summarizes the applicability of the various validation methods investigated in this paper to the different scenarios discussed above, and more details will be presented in Sections Section 3 and Section 4.
Experimental data 
Quantity (to be predicted) 
Prediction (from model)  Applicable methods 
Fully characterized 
Stochastic  Deterministic  1,2,4 
Deterministic  Stochastic  1,2,4,5  
Stochastic  Stochastic  1,2,3,4,5  
Partially characterized 
Stochastic  Stochastic  1,2,3,4,5 
After selecting a validation method and computing the corresponding metric, another important aspect of model validation is to decide if one should accept or reject the model prediction based on the computed metric and the selected threshold. Section 3 and Section 4 will provide some discussions on the decision threshold. The flowchart in Figure 1 describes a systematic procedure for quantitative model validation.
3Hypothesis testingbased methods
Statistical binary hypothesis testing involves deciding between the plausibility of two hypotheses  the null hypothesis (denoted as ) and the alternative hypothesis (denoted as ). is usually something that one believes could be true, whereas may be a rival of [?]. For example, given a model for damping coefficient prediction, can be the hypothesis that the model prediction is equal to the actual damping coefficient, and correspondingly states that the model prediction is not equal to the actual damping coefficient. The null hypothesis will be rejected if it fails the test, and will not be rejected if it passes the test. Two types of error can possibly occur from this exercise: rejecting the correct hypothesis (type I error), or failing to reject the incorrect hypothesis (type II error). In the context of model validation, it should be noted that the underlying subject matter domain knowledge is also necessary for the implementation of the hypothesis testingbased methods, especially in the formulation of test hypotheses ( and ) and the selection of model acceptance threshold. To formulate appropriate and for the validation of a model with stochastic output prediction , we need to be clear about the physical interpretation of “model being correct”. In other words, we need to decide whether or not the accurate prediction of certain order moments or the entire PDF of suggests that the model is correct.
3.1Classical hypothesis testing
Classical hypothesis testing is well established and has been explained in detail in many statistics textbooks. A brief overview is given here, only to facilitate the development of mathematical relationships between various validation methods in later sections.
In classical hypothesis testing, a test statistic is first formulated and the probability distributions of this statistic under the null and alternative hypothesis are derived theoretically or by approximations. Thereafter, one can compute the value of the test statistic based on validation data and thus calculate the corresponding value, which is the probability that the test statistic falls outside a range defined by the computed value of the test statistic under the null hypothesis. The value can be considered as an indicator of how good the null hypothesis is, since a better corresponds to a narrower range defined by the computed value of the test statistic and thus a higher probability of the test statistic falling outside the range.
The practical outcome of model validation should be to provide useful information for decision making in terms of model selection. The decisions whether or not to reject the null hypothesis can be made based on the acceptable probabilities of making type I and type II errors (specified by the decision maker). The concept of significance level defines the maximum probability of making type I error, and the probability of making type II error can be estimated based on and the probability distribution of the test statistic under . The null hypothesis will be rejected if the computed value is less than , or the computed exceeds the acceptable value. Correspondingly, we will reject the model if is rejected, and accept the model if is not rejected. An alternative approach to comparing value and is to use confidence intervals. A confidence interval can be constructed based on the confidence level , and the null hypothesis will be rejected if the confidence interval does not include the predicted value from the model.
It should be note that accepting the model (or failing to reject ) indicates that the accuracy of the model is acceptable, but it does not prove that the model (or ) is true. Also note that the comparison between value and significance level becomes meaningless when the sample size of experimental data is large. Since almost no null hypothesis is true, the value will decrease as the sample size increases, and thus will tend to be rejected at a given significance level as the sample size grows large [?]. In addition, the overinterpretation of values and the corresponding significance testing results can be misleading and dangerous for model validation. Criticisms on overstressing values and significance levels can be found in [?].
Various test statistics have been developed corresponding to different types of hypotheses. The test or test can be used to test the hypothesis that the mean of a normal random variable is equal to the model prediction; the chisquare test can be used to test the hypothesis that the variance of a normal random variable is equal to the model prediction; and the hypothesis that the observed data come from a specific probability distribution can be tested using methods such as the chisquare test, the KolmogorovSmirnov (KS) test, the AndersonDarling test and the Cramer test [?]. The tests on the variance or the probability distribution require relatively large amounts of validation data and thus only the tests on the distribution mean are discussed in this subsection, namely the test and the test.
The test is based on Student’s distribution. Suppose the quantity to be predicted is a normal random variable with mean and standard deviation , and the measurement noise is also a normal random variable with zero mean and standard deviation . Thus, the experimental observation . For the sake of simplicity, let . The validation data is a set of random samples of with size (i.e., , , ..., ) and the corresponding sample mean is and sample standard deviation is . The variable is modeled with a distribution with degrees of freedom. Therefore, if one assumes that the model mean prediction (if model prediction is deterministic, equals to the prediction value) is the mean of , i.e., the null hypothesis is , then the corresponding test statistic (follows the same distribution) is
The value for the twotailed test (considering both ends of the distribution) can be obtained as
where is the cumulative distribution function (CDF) of a distribution with degrees of freedom. If the chosen significance level is , then one will reject the null hypothesis if , or fail to reject if .
The test requires a sample size in order to estimate the sample standard deviation . If , the test can be used instead by assuming that the standard deviation of is equal to the standard deviation of model prediction , i.e., and . Thus, the variable follows the standard normal distribution. Under the null hypothesis , the test statistic is
The corresponding value for the twotailed test can be computed as
where is the CDF of the standard normal variable. Similar to the test, one will reject if , or fail to reject if .
To compute the probability of making type II error , an alternative hypothesis is needed and a commonly seen formulation is . In test, under the alternative hypothesis , the statistic follows a noncentral distribution with noncentrality parameter [?], the probability of making type II error can then be estimated as
where the term is called the power of the test in rejecting . Similarly, in the test can be estimated as
Note that the above discussion is for the case that both and are stochastic. If is deterministic, the standard deviation becomes zero; if is deterministic, becomes zero. However, the computation procedure of value remains the same.
Applying classical hypothesis testing to fully characterized experiments is straightforward as one can directly compare the data with the model predictions for given inputs. For partially characterized experiments, some of the inputs of the model/experiments are available in the form of intervals or probability distributions based on measurements or expert opinions. Let data that have inputs with the same intervals or probability distributions form a sample set. The aforementioned test and test can then be conducted by comparing the mean of the sample set with the mean of the unconditional probability distribution of model output (“unconditional” means that the probability distribution is not dependent on the point values of inputs). The unconditional probability distribution of model output can be obtained by propagating uncertainty from the input variables to the output variable [?].
3.2Bayesian hypothesis testing
In probability theory, Bayes’ theorem reveals the relation between two conditional probabilities, e.g., the probability of occurrence of an event given the occurrence of an event (denoted as ), and the probability of occurrence of the event given the occurrence of the event (denoted as ). This relation can be written as [?]
Suppose event is the observation of a single validation data point , and event can be either the hypothesis is true or the hypothesis is true. Using Bayes’ theorem, we can calculate the ratio between the posterior probabilities of the two hypotheses given validation data as
where and are the prior probabilities of and respectively, representing the prior knowledge one has on the validity of these two hypotheses before collecting experimental data; and and are the posterior probabilities of and respectively, representing the updated knowledge one has after analyzing the collected experimental data. The likelihood function in Equation 5 is the conditional probability of observing the data given the hypothesis ( or ), and the ratio is known as the Bayes factor [?] and is used as the validation metric.
The original intent of the Bayes factor was to compare the data support for two models [?], and thus the two hypotheses become : model is true and : model is true. If and are the parameters of the two competing models respectively, the Bayes factor is calculated as
where and are the probability density distributions of and respectively.
In the context of validating a single model, and need to be formulated differently. Rebba and Mahadevan [?] proposed the equalitybased formulation () and the intervalbased formulation () for Bayesian hypothesis testing, where is the model prediction for a particular input .
Consider the case when both the model prediction and the quantity to be predicted are random variables. Two null hypotheses can be formulated: (1) the hypothesis that the difference between the means of and , and the difference between the standard deviations of and , are within desired intervals respectively; (2) the hypothesis that the PDF of is equal to the PDF of . With the first formulation, it is straightforward to derive the likelihood functions under the null and alternative hypothesis, and the existence of directional bias can be reflected in the test, as will be shown below. The advantages of the second formulation are that it avoids the setting of interval width in the first formulation, and leads to a direct test on probability distributions instead of distribution parameters. For the case that either or is deterministic, the first formulation can still be applicable by setting the standard deviation of the deterministic quantity to be zero; however, the second formulation only applies to the case when both and are stochastic. These two formulations are applicable to both fully characterized and partially characterized experiments. Note that in the case where the model output follows a tailheavy distribution, formulating hypotheses on higher order moments (instead of the mean and standard deviation) may be necessary in order to assess the validity of the model. In this paper, the prediction distribution of the damping model in the application example is close to a Gaussian distribution. Therefore, we only consider hypotheses on the first two moments (mean and standard deviation) and the entire PDF for the purpose of illustration.
Interval hypothesis on distribution parameters The interval hypothesis can be formulated as , and . and are the means of and respectively, and and are the standard deviations of and respectively. , , and are constants which define the width of interval. Note that , .
Under the interval hypothesis , can be any value between [, ]. So , and the PDF . Similarly, , and the PDF . Thus
In the presence of measurement noise, the experimental observation is a random variable with conditional probability . Hence the likelihood function under the null hypothesis can be derived as
Under the alternative hypothesis , can be any value outside [, ], but the uniform distribution is not applicable to infinite space in practical cases. To avoid this issue, we can assume that the possible values of are within a finite interval [, ] based on the underlying physics. Therefore , and the PDF . Similarly, , and the PDF . thus
where A is calculated as
The likelihood function under can then be derived as
The Bayes factor for the Bayesian interval hypothesis testing can be calculated by dividing in Equation 8 by in Equation 10.
It is straightforward to apply this method to the case that is deterministic and the case that is deterministic. For the first case, let be zero and the rest of the computation remains the same. For the second case, the interval assumption will only be made on and , since we know is zero. The other steps will be the same as above.
The directional bias defined in Section 1 can be captured by conducting two separate Bayesian interval hypothesis tests. In the first test, we set and , and thus under the null hypothesis . In the second test, we set and , and thus under the null hypothesis . The model will fail if any of these two null hypotheses fails the corresponding test. Therefore, the existence of directional bias will increase the chance of a model to fail the combined test. Figure 2 illustrates this combined test using the concept of data space. Suppose is the overall validation data space, is the set of data which does not support the model in the first Bayesian interval hypothesis test, and is the set of data which does not support the model in the second test. Then, the union of and is the set of data that does not support the model combining these two tests.
Equality hypothesis on probability density functions To further validate the entire distribution of predicted by a probabilistic model, or can be formulated correspondingly as the predicted distribution being or not being the true distribution of the quantity to be predicted , i.e., , and . The Bayes factor in this case becomes
where is the conditional probability of observing noisy data given that the actual value of is ; is the PDF of under the null hypothesis and hence ; is the PDF of under the alternative hypothesis . If no extra information about is available, it can be assumed as a noninformative uniform PDF. Note that the bounds of this uniform distribution will affect the value of the estimated Bayes factor, and thus it should be carefully selected according to the available information.
Note that is proportional to the value of the PDF of conditioned on which is evaluated at , i.e., . Therefore, Equation 11 can be rewritten as
Validation data from fully/partially characterized experiments If the validation data is from a fully characterized experiment, i.e., all the input parameters of the experiment are measured and the point values of are available, and used in the Bayesian interval hypothesis testing are the mean and standard deviation of the model prediction given the input , and the PDF of () used in the Bayesian equality hypothesis testing is conditioned on . If the experiment is partially characterized, i.e., some of input corresponding to observation are not measured or are reported as intervals, we can assume that have the PDF based on reported intervals or expert opinions. One can first calculate the unconditional PDF of model prediction via propagating uncertainty from to model output
and then calculate and from . If data from both fully characterized and partially characterized experiments are available, we can first calculated Bayes factors corresponding to these two types of data points separately using different and (in the Bayesian interval hypothesis testing), or (in the Bayesian equality hypothesis testing) as shown above, and then multiply these Bayes factors to obtain an overall Bayes factor, as discussed below.
Bayesian hypothesis testing with multiple data points If there are () validation data points available and the corresponding experiments are conducted independently, i.e., no correlation exists between different data points, according to the basic rule of probability theory, the probability of observing the whole data set is the product of the probabilities of observing individual data points , . Since the likelihood functions and are essentially probabilities of observing the data, after computing the Bayes factor for each data point, these individual Bayes factors can be multiplied to compute the overall Bayes factor under the assumption that the observations are independent, as
If the number of data points is relatively large and most of ’s are larger than one, the product of individual Bayes factors may also be a large number. In such a case it is more convenient to express Bayes factor on a logarithmic scale as
Interpretation of Bayesian hypothesis testing results If the Bayes factor calculated is greater than 1, it is indicated that the data favors the null hypothesis; if the Bayes factor is less than 1, it is indicated that the data favors the alternative hypothesis. In addition, Jeffreys [?] gave a heuristic interpretation of Bayes factor in terms of the level of support that the hypotheses obtain from data. The threshold value of Bayes factor can be related to the socalled Bayes risk in detection theory [?], which is the sum of costs due to different decision scenarios  failing to reject the true/wrong hypothesis and rejecting the true/wrong hypothesis. It has been shown that appropriate selection of can help reduce the Bayes risk [?]. If one assumes that the cost of making correct decisions (failing to reject the true hypothesis or rejecting the wrong hypothesis) is zero, the costs of type I and type II error are the same, and the prior probabilities of the null and alternative hypothesis being true are equal, then the resulting [?]. However, It should be noted that as a part of the decision making process, the choice of thresholds for Bayes factor inevitably contains subjective elements.
Before collecting validation data, there may be no evidence to support or reject the model. In that case, it may be reasonable to assume that the prior probabilities of the null hypothesis and alternative hypothesis are equal (). In that case, a simple expression of the posterior probability of the null hypothesis can be derived in terms of the Bayes factor [?], which is a convenient metric to assess the confidence in the model prediction:
An advantage of Bayesian hypothesis testing is that the posterior probabilities of and obtained from the validation exercise can both be used through a Bayesian modelaveraging approach [?] to reflect the effect of the model validation result on the uncertainty in model output, as shown in Equation 14
where is the predicted PDF of combining the PDFs of under the null and alternative hypothese. Therefore, instead of completely accepting a single model, one can include the risk of using this model in further calculations. This helps to avoid both Type I and Type II errors, i.e., accepting a wrong model or rejecting a correct model.
3.3Relationship between value and Bayes factor
Although the value in classical hypothesis testing and the Bayes factor are based on different philosophical assumptions and formulated differently, it has been shown that these two metrics can be mathematically related for some special cases [?]. In the discussion below, the Bayes factor based on the hypothesis of probability density functions for a fully characterized experiment is found related to the value in test and test, if the model prediction is a normal random variable with mean and standard deviation .
Starting from the formula of Bayes factor in Equation 12, since we assume that the PDF of the quantity to be predicted under the alternative hypothesis is uniform, the integration term in the denominator is not affected by the target model and thus can be treated as a constant . Based on the null hypothesis , the quantity to be predicted . Recall the relationship , and , we know that . Thus the numerator of Equation 12 can be calculated as
where is the PDF of the standard norm random variable.
If the variance of measurement noise is negligible compared to the variance of , i.e., , we have . Also note that for a single data point . Therefore Equation 12 becomes
Based on Eqs. Equation 1 and Equation 3, we have
Substituting Equation 16 into Equation 15, we obtain
where is the probability density function of a standard normal variable.
From Equation 17, we can see that the Bayes factor can be related to either the statistic or the statistic, and hence it can be related to the value in both test and test. Combining Eqs. Equation 4 and Equation 17, we obtain the relation between Bayes factor and the value in the test as
where is the inverse standard normal CDF. Similarly, the relation between Bayes factor and the value in the test can be obtained by combining Eqs. Equation 2 and Equation 17 as
where is the inverse CDF of a distribution with degrees of freedom.
If the chosen significance level in test or test is , the corresponding threshold Bayes factor can be calculated using Equation 18 or Equation 19 by letting . In that case, the test/test with significance level and Bayesian hypothesis testing with the corresponding threshold value will both give the same model validation result.
4Nonhypothesis testingbased methods
Besides the binary hypothesis testing methods discussed above, various other validation metrics have been developed to quantify the agreement between models and experimental data from other perspectives, such as the Mahalanobis distance for models with multivariate output [?], the weighted validation databased metric [?], the KullbackLeibler divergence in the area of signal processing [?] and for the design of validation experiments [?], the probability boundbased metric [?], the confidence intervalbased metric [?], the reliabilitybased metric [?], and the area metric [?]. The weighted validation databased metric introduced by Hills and Leslie [?] is designed for the case when the importance of different validation experiments is different. The confidence intervalbased validation method proposed by Oberkampf et al. [?] computes the confidence interval of error, which is defined as the difference between the model mean prediction and the true mean of the variable to be predicted. An average absolute error metric and an average absolute confidence indicator are also computed. However, it is not clear how to apply this method to validation of a multivariate model with data from discrete test combinations, as the method requires integration over the space of test inputs. Therefore, only the reliabilitybased metric and the area metric are discussed in this paper.
4.1Reliabilitybased metric
The reliability metric proposed by Rebba and Mahadevan [?] is a direct measure of model prediction quality and is relatively easy to compute. It is defined as the probability of the difference between model prediction and observed data being less than a given quantity
As mentioned in Section 2, experimental observation is random due to measurement uncertainty and model output may be uncertain in the Bayesian framework. Therefore, as the difference between two random variables, in the Bayesian framework we interpret as a random variable. The probability distribution of can be obtained from the probability distributions of and . For instance, if the model prediction , and the corresponding observation (see discussion in Section 3.1), the difference . For the sake of simplicity, let . In this case, the reliabilitybased metric can be derived as
In this paper, experimental data are considered as the samples of the random variable . Therefore, if the number of experimental data is relatively large, e.g., , the sample variance can be assumed as a good estimator of (the true variance of ), which is needed to compute the reliability metric. If is small and no prior information on is available, we can assume that , which is the same assumption used in test. By assuming further that the mean of validation data is equal to , Equation 20 can be rewritten as
By substituting Equation 16 into Equation 21, the relation between the reliabilitybased metric and the test statistic in the test or test is obtained as
By combining Eqs. Equation 2, Equation 4 and Equation 22, the reliabilitybased metric can be further related to the value in the test or test as
If one chooses to test models based on a threshold reliability value calculated by letting in Equation 23 above, the result of model validation will be the same as that in the test or test with significance level .
Note that the threshold used in the reliabilitybased method represents the minimum probability of the difference falling within an interval [], and the decision of accepting/rejecting a model can be made based on the decision maker’s acceptable level of model reliability.
Since the reliabilitybased metric is the probability of being within a given interval, it can also reflect the existence of directional bias by modifying the intervals. Similar to the Bayesian interval hypothesis testing, we can take two different intervals and , and calculate the corresponding values of metric and as:
By comparing the values of and with the threshold (half of the original threshold value because the width of intervals considered is half of the original one), the model will be failed if either or is less than .
Note that for the case that the quantity to be predicted is deterministic, becomes zero, and for the case that the model prediction is deterministic, becomes zero.
4.2Area metricbased method
The area metric proposed by Ferson et al. [?] is attractive due to its capability to incorporate fully characterized experiments using the socalled “upooling” procedure, and thus to validate models with sparse data on multiple experimental combinations [?]. For a single experimental combination with input , suppose is the corresponding cumulative probability function (CDF) of model output and is the observed data, one can compute a value for this experimental combination as
Based on the probability integral transform theory in statistics [?], if the observed data is a random sample from the probability distribution of model output, the computed will be a random sample from the standard uniform distribution, and thus the empirical CDF of all the ’s () should match the CDF of the standard uniform random variable. The difference between these two CDF curves is a measure of the disparity between model predictions and experimental observations. Hence, a model validation metric can be developed as [?]
where is the empirical CDF of all the ’s and is the CDF of the standard uniform random variable. If the value of is small/large, the model predictions are correspondingly close/not close to experimental observations.
If the model prediction is deterministic, the CDF function in Equation 25 cannot be constructed, and hence the area metricbased method is not applicable to testing a computational model with deterministic output.
The area metric can also reflect the existence of directional bias, i.e., the experimental observations are consistently below or above the corresponding mean predictions of numerical model. This is due to the use of CDF of model output in Equation 25. For example, if the model outputs at different test combinations are Gaussian random variables, the ’s in Equation 25 become Gaussian CDFs. Hence, the values of will all be less than 0.5 if ’s are smaller than the mean of the corresponding Gaussian variables. Therefore, instead of being uniformly distributed between [0,1], ’s are distributed between [0,0.5], causing a large area between the empirical CDF of and the standard uniform CDF.
Compared to the hypothesis testing methods and the reliabilitybased method, the area metricbased method lacks a direct interpretation of model acceptance threshold, i.e., it is not clear how to set up an appropriate threshold to decide when one should reject/accept a model. This is a disadvantage for the area metricbased approach.
5Numerical example
In this section, the aforementioned model validation methods are illustrated via an application example on damping prediction for MEMS switches. The quantity to be predicted, the damping coefficient, is considered as a random variable due to the lack of understanding in physical modeling, in other words, the epistemic uncertainty of damping coefficient is represented by a subjective probability distribution following the Bayesian way of thinking; the corresponding computational model is also stochastic as will be shown in Section ?. The validation data are obtained from fully characterized experiments, and it is found that the directional bias defined in Section 1 exists between model prediction and validation data.
5.1Damping model and experimental data
Despite the superior performance provided in terms of signal loss and isolation compared with silicon devices [?], the use of RF MEMS switches in applications requiring high reliability is hindered by significant variations in device lifetime [?]. Rigorous quantification of the uncertainty sources contributing to the observed life variations is necessary in order to achieve the design of reliable devices. Within the framework of uncertainty quantification in the modeling of RF MEMS switches, the validation of squeezefilm damping model emerges as a crucial issue due to two factors: (1) damping strongly affects the dynamic behavior of the MEMS switch and therefore its lifetime [?]; (2) it is difficult to accurately model microscale fluid damping and available models are applicable to limited regimes [?].
Uncertainty quantification in microscale squeezefilm damping prediction
For the purpose of illustration, this study considers damping prediction using the NavierStokes slip jump model [?]. Two major sources of uncertainty have been shown to affect the prediction of gas damping [?]. The first one is epistemic uncertainty related to the lack of understanding of fundamental failure modes and related physical models. The second one is aleatory uncertainty in model parameters and inputs due to variability in either the fabrication process or in the operating environment. Uncertainty quantification approaches usually require large numbers of deterministic numerical simulations. In order to reduce the computational cost, a generalized polynomial chaos (gPC) surrogate model [?] is constructed and trained using solutions of the NavierStokes (NS) equation for a few input points, thus avoiding repetitively solving the NS equation. Note that several other surrogate modeling techniques are also available, including Kriging or Gaussian Process (GP) interpolation [?], support vector machine (SVM) [?], relevance vector machine [?], etc. The gPC model is used for the purpose of illustration. This model approximates the target stochastic function using orthogonal polynomials in terms of the random inputs [?]. A order gPC model that approximates a random function can be written as
where ’s are the orthonormal polynomial bases such as Legendre polynomials, Hermite polynomials, and WienerAskey polynomials; is the dimension of input and is the order of the polynomial; is the error of the gPC model; ’s are coefficients and can be obtained as
where is constant for a given polynomial basis , and is a set of nodes and weights of the quadrature rule for numerical integration.
Based on the calculated damping coefficient values at the quadrature nodes by solving the NavierStokes Slip Jump model, the gPC model can be constructed using Eqs. Equation 27 and Equation 28. For any given input , is deterministic, while the residual term is random. Under the GaussMarkov assumption, asymptotically follows a Gaussian distribution with zero mean, and the variance can be estimated as [?]
where is a function of model input ; the vector = [,,...,; the matrix = ,,...,; and .
Therefore, for a given input , the prediction of damping coefficient based on the gPC model is a random variable with Gaussian distribution . The methods presented in Sections Section 3 and Section 4 will be applied to the validation of this predicted distribution.
The example RF MEMS switch modeled as a membrane is shown in Figure 3. To construct a gPC model for the damping coefficient, the input parameters need to be specified first. A probabilistic sensitivity analysis shows that the membrane thickness , the gap height , and the frequency are the major sources of variability in the damping coefficient, and hence these three parameters are included in the gPC model, i.e., . Four different gas pressures  18798.45 Pa, 28664.31 Pa, 43596.41 Pa, and 66661.19 Pa  are considered and correspondingly four gPC models are constructed. This example uses a third order gPC model with Legendre polynomial bases [?].
It should be noted that the validity of the surrogate model does not guarantee the validity of the original model. We only have access to the surrogate model and validation experimental data; therefore in this example we are only assessing the validity of the surrogate model.
Experimental data for validation
In the experiment, seven devices with different geometric dimensions are considered. For each of the four pressures mentioned above, 5 repetitive tests are conducted on each of the seven devices, and hence 140 data points are obtained. Since the input parameters are recorded for each of the data points, these experiments are fully characterized. However, there is only one data point for each test combination due to the fact that each of the 140 input value combinations is different from others.
Fig. ? shows a graphical comparison between the mean gPC model prediction and experimental data under the four different pressures by aggregating predictions and data with respect to the 35 test combinations for each pressure value. The top/bottom points are correspondingly the maximum/minimum value of model mean predictions and experimental data, and the square/diamond markers are the average values of predictions/data on the 35 test combinations. A more detailed graphical comparison showing mean prediction of the gPC model vs. experimental data on each of the individual test combinations is provided in Figs. ?.
From the graphical comparison, we can see that the gPC model performs better under the middle two values of pressure. Also note that there is a systematic bias between the gPC model and experimental observations at the low pressure value (18798.45 Pa), i.e., the mean predictions of the gPC model are always larger than the experimental data.
5.2Validation based on binary hypothesis testing
Classical hypothesis testing
Because the sample size for each experimental combination is only 1, the test is not applicable and instead test is used. The values calculated using Equation 4 are shown in Fig. ?. The dashed lines in Fig. ? represent the significance level . The model is considered to have failed at the experimental combinations where the corresponding values fall below the dashed line. Note that a more rigorous test will need to include the probability of making type II error (). The individual numbers of failures of the four gPC models are shown in Table 2.
Pressure (Pa)  18798.45  28664.31  43596.41  66661.19 

Number of failures  10  5  7  20 
Failure percentage  28.6%  14.3%  20.0%  57.1% 
Bayesian hypothesis testing
Interval hypothesis on distribution parameters As discussed in Section 3.2, combination of two Bayesian hypothesis tests based on the interval null hypotheses and respectively can reflect the existence of directional bias. In practical, the parameters , , and that define the intervals can be determined based on the strictness requirement of validation. For the purpose of illustration, we set , , and . and that define the possible range of are set as 0 and 1 respectively since the MEMS device considered is underdamped. and are set to be 0.001 and 0.05 respectively. The results of Bayesian interval hypothesis testings are calculated using Equation 7  Equation 10, and are shown in Fig. ? and Table 3.
Pressure (Pa)  18798.45  28664.31  43596.41  66661.19  


Number of failures  10  5  0  10  
Overall Bayes factor  3.1  58.3  92.9  44.1  
(r)37  Number of failures  5  4  5  14  
Overall Bayes factor  63.9  87.1  74.1  1.4  
(r)37  Combined test  Number of failure  10  5  5  16 
Failure percentage  28.6%  14.3%  14.3%  45.7%  
Equality hypothesis on probability density functions In this study, the possible values of damping coefficient range from 0 to 1 since the system is underdamped. Hence the limit of integration in the denominator of Equation 12 is [0, 1], while the limit of integration in the numerator is .
The performance of the gPC models in Bayesian hypothesis testing are shown in Fig. ? and Table 4. The values of Bayes factor are calculated using Equation 12, and the threshold Bayes factor (this threshold value is chosen based on the discussion in Section 3.2). Although the performance of the gPC model in terms of failure percentage is different for the two hypothesis tests as shown in Table 2 and Table 4, if one increases the threshold Bayes factor to 2.88, which is calculated using Equation 18 with in Section 3.3, the result of Bayesian hypothesis testing in terms of the number of failures becomes the same as in the test in Section ?. The reason for this coincidence has been explained in Section 3.3. Note that the performance of the second gPC model (for pressure = 28664.31 Pa) remains the same when is raised from 1 to 2.88, and this can be easily observed from Fig. ?.
Pressure (Pa)  18798.45  28664.31  43596.41  66661.19 

Number of failures  5  5  3  15 
Failure percentage  14.3%  14.3%  8.6%  42.9% 
Overall Bayes factor (logscale)  7.4  57.2  72.3  10.2 
By comparing the results based on interval hypothesis on distribution parameters and equality hypothesis on probability density functions (Tables Table 3 and Table 4), it can be observed that the performance of the gPC model for pressure 18798.45 Pa in the first test is significantly worse than in the second test, while the models for the other three pressures have similar failure percentages in these two tests. As shown in Fig. ?, the data are all located below the mean predictions of this gPC model, which indicates the existence of directional bias, and thus the gPC model for pressure 18798.45 Pa performs worse in the Bayesian interval hypothesis testing.
5.3Validation using nonhypothesis testingbased methods
Reliabilitybased metric
Fig. ? and Table 5 show the calculated values of the reliabilitybased metric , and (Eq. Equation 20 and Equation 24), the failure percentage of models with and the decision criterion . This decision criterion is obtained using Equation 23 with the significance level , and thus the results of validation (comparing with ) in terms of failure percentage are the same as in the test in Section ?. It can also observed that the failure percentage of the gPC model for pressure 18798.45 Pa increases significantly in the test that comparing and with due to the existence of directional bias.
Pressure (Pa)  18798.45  28664.31  43596.41  66661.19  

vs. 
Number of failures  10  5  7  20 
Failure percentage  28.6%  14.3%  20.0%  57.1%  
(r)26 and vs.  Number of failures  20  7  12  25 
Failure percentage  57.1%  20.0%  34.3%  71.4%  
Area metricbased method
The area metric for the four gPC models can be computed using Eqs. Equation 25 and Equation 26, and the results are shown in Fig. ? and Table 6. Note that the gPC model for pressure 18798.45 Pa has the highest area value and thus performs worst. This is due to the directional bias between mean predictions and experimental data, and it is reflected in the area metric as discussed in Section 4.2.
Pressure (Pa)  18798.45  28664.31  43596.41  66661.19 

0.543  0.146  0.151  0.250  
5.4Discussion
This section demonstrated a numerical example of validating the gPC surrogate model for the RF switch damping coefficient using the validation methods presented in Sections Section 3 and Section 4, and 140 fully characterized experimental data points. Based on the performance of the gPC model in these validation tests, it can be concluded that the prediction of the gPC model has better agreement with observation under the middle two values of pressure (28664.31 Pa and 43596.41 Pa), whereas less agreement can be found under the lowest and highest pressure values (18798.45 Pa and 66661.19 Pa). The decision on model acceptance can be formed based on the failure percentages with the hypothesis testing methods and the reliabilitybased method, and the values of areabased metric, given the desired acceptance threshold. It is shown that the test and the reliabilitybased metric give the same results in terms of failure percentage when is selected based on the significance level used in test. Similarly, classical and Bayesian hypothesis testing give the same result by choosing a specific threshold Bayes factor as shown in Section 3.3. It is also shown that the existence of directional bias can be reflected in the Bayesian interval hypothesis testing, reliabilitybased method with modified intervals, and the area metricbased method. Models that have directional bias will perform worse in these three validation methods than in classical hypothesis testing and in Bayesian hypothesis testing with equality hypothesis on probability density functions.
6Conclusion
This paper explored various quantitative validation methods, including classical hypothesis testing, Bayesian hypothesis testing, a reliabilitybased method, and an area metricbased method, in order to validate computational model prediction. The numerical example featured a generalized polynomial chaos (gPC) surrogate model which predicts the microscale squeezefilm damping coefficient for RF MEMS switches.
An Bayesian interval hypothesis testingbased method is formulated, which validates the accuracy of the predicted mean and standard deviation from a model, taking into account the existence of directional bias. Further, Bayesian hypothesis testing to validate the entire PDF of model prediction is formulated. These two formulations of Bayesian hypothesis testing can be applied to both fully characterized and partially characterized experiments, and the case when multiple validation points are available. It is shown that while the classical hypothesis testing is subject to type I and type II error, the Bayesian hypothesis testing can minimize such risk by (1) selecting a riskbased threshold, and (2) subsequent model averaging using posterior probabilities. It is observed that under some conditions, the value in the test or test can be mathematically related to the Bayes factor and the reliabilitybased metric.
The area metric is also sensitive to the direction of bias between model predictions and experimental data, and so is the reliabilitybased method. The Bayesian model validation result and reliabilitybased metric can be directly incorporated in longterm failure and reliability analysis of the device, thus explicitly accounting for model uncertainty, whereas the areabased metric lacks a direct interpretation for its results. In addition, due to the use of likelihood function in the Bayesian hypothesis testing, the Bayesian model validation method can be extended to the case that the validation data is in the form of interval, as shown in [?].
Acknowledgment
This paper is based upon research partly supported by the Department of Energy [National Nuclear Security Administration] under Award Number DEFC5208NA28617 to Purdue University (Principal Investigator: Prof. Jayathi Murthy), and Subaward to Vanderbilt University. The support is gratefully acknowledged. The authors also thank the U. S. DOE (NNSA) PSAAP Center for Prediction of Reliability, Integrity and Survivability of Microsystems (PRISM) at Purdue University for providing the models and validation data for the numerical example.