# Assessing Modeling Variability in Autonomous Vehicle Accelerated Evaluation

###### Abstract

Safety evaluation of autonomous vehicles is extensively studied recently, one line of studies considers Monte Carlo based evaluation. The Monte Carlo based evaluation usually estimates the probability of safety-critical events as a safety measurement based on Monte Carlo samples. These Monte Carlo samples are generated from a stochastic model that is constructed based on real-world data. In this paper, we propose an approach to assess the potential estimation error in the evaluation procedure caused by data variability. The proposed method merges the classical bootstrap method for estimating input uncertainty with a likelihood ratio based scheme to reuse experiment results. The proposed approach is highly economical and efficient in terms of implementation costs in assessing input uncertainty for autonomous vehicle evaluation.

## I Introduction

The competitive race toward mass deployment of autonomous vehicle (AV) driving side-by-side with human-driven vehicles on public roads advocates for accurate and highly precise safety evaluation framework to ensure safe driving. However, achieving meaningful precision when the safety-critical events under study are rare under naturalistic situations is a challenging task. A recent method adopting Monte Carlo method empowered by Importance Sampling technique as variance reduction scheme has been developed and has produced appealing results. In [1], it is shown that the efficiency is enhanced by ten thousand times with the incorporation of large-scale driving data sets and extensive statistical models employed in the framework. This improved efficiency is highly appealing for AV researchers as the required testing effort is overly demanding, an estimate of 8.8 billion driving miles required to provide ‘sufficient’ evidence to compare the safety of AV driving and human driving from logged data [2].

A common framework adopted to estimate the safety measure is Monte Carlo simulation. It generates a large number of samples and simulates experiments using each sample. Then a statistical analysis is performed on the experiment results and conclusions are drawn, taking into account the stochastic nature of the sample generator and experiments setting. By properly integrating Monte Carlo approach into the modeling scheme, researchers have been able to estimate the risk from driving situations based on control and trajectory prediction [3, 4] and corner cases in various driving scenarios [1, 5].

In these works, the AV system is viewed as a black box and a Monte Carlo simulation is used to evaluate its safety performance. Using this view, the evaluation of a vehicle system is estimated from Monte Carlo samples of variables that represent the uncertainties in the system, e.g. traffic environment or noise in control and observation. Usually, the Monte Carlo samples are drawn from empirical distributions of the real-world data, or stochastic models fitted from real-world data. Since the reliability of the Monte Carlo based tests largely depends on the correctness of these underlying distribution models, whose parameters are only estimated values from the data, the variability of the data has a direct effect on the test result. In this paper, this effect is referred to as input uncertainty. In order to provide a convincing test evaluation, the input uncertainty of the evaluation procedure needs to be addressed and highlighted as part of the test results.

The goal of this paper is to provide a way to construct a confidence interval for the evaluation results as a quantitative measurement of the input uncertainty. In particular, we focus on the evaluation procedures using parametric stochastic models that are fitted from real-world data. Under this setting, the input uncertainty imposes variability to the estimation of parameters and consequently perturbs the evaluation results. Such a problem setting is consistent with the accelerated evaluation approach studied in [6, 1, 5].

In this paper, we propose an extension of the classic bootstrap technique [7] for assessing the input uncertainty in Monte Carlo based evaluation approach. The main contribution in our approach is the likelihood ratio based scheme for estimation in bootstrap replications, which reuses the Monte Carlo estimators for the initial evaluation of safety measures. The proposed approach significantly reduces the computation burden in the standard bootstrap implementation and can be perfectly adapted to the accelerated evaluation approach. The likelihood ratio based scheme can efficiently utilize the accelerating distribution constructed in the initial evaluation stage and provides good importance sampling estimator for each bootstrap replication in the subsequent stages.

This paper is structured as follows: Section II introduces the notations and sets up the problem. Section III reviews classic bootstrap schemes and presents the proposed approach. Section IV illustrates the implementation details in the proposed approach using numerical examples and demonstrates using an AV evaluation example.

## Ii Problem Setting Under AV Testing

In this section, we set up the notations for the AV evaluation problem. We will first define the notations in a general setting. We formulate and discuss quantifying input uncertainty based on the general setting. We then link the general setting to Accelerated Evaluation. This part serves as an example in practice to illustrate the problem of interest and also as preparation of introducing the proposed approach that perfectly fits the Accelerated Evaluation setting.

### Ii-a General Setting

The goal of Monte Carlo based test approach is to understand the performance of the AV system under uncertain environment. We use to denote the uncertain factors to the AV system. Note that is usually vector valued, i.e. , where each element represents one attribute of the environment. We use to denote a realization of the random vector .

Next, we define as the parameters in parametric stochastic model. The stochastic model has a density function . We assume that is the truth, in other words, we have .

Here, we use to denote the performance measurement of an AV system under environment , which is referred to as performance function. Since is random and follows , our goal is to estimate the average performance measure where

(1) |

Usually the performance function is defined by complex systems, and the expectation (1) is hard to be analytically computed even if is fully known. Hence Monte Carlo approach is applied to estimate . Assume we have samples that is generated from a certain distribution (no need to be , even no need to be the same parametric model) and denote the estimator as . We can estimate as

(2) |

For instance, in crude Monte Carlo approach, we have and . Each evaluation of the performance function at a certain sample is referred to as one experiment trail.

In this paper, we consider the situation where is unknown but a finite number of data from is available. We use the maximum likelihood estimation (MLE) for the parameter in the stochastic model. Note that although is an unbiased estimator of , i.e. , since is estimated from data of , it is uncertain due to the variability of the samples.

If we consider as the true parameters for the stochastic model and simulate for the average performance measure, the estimation is given by

(3) |

which is an estimator for instead of . The influence of input uncertainty can be revealed by a decomposition of the variance of :

(4) |

In this decomposition, the first term is the input uncertainty and the second term is referred to as simulation uncertainty. We note that if we ignore the variation of , only would be considered as the variance of the estimator.

Usually a confidence interval is constructed as a reference of the accuracy of the estimation. For a confidence interval with confidence level , we want to have

(5) |

i.e. we want the confidence interval to cover the truth with probability greater than (in the ideal case, we want equality).

When the variation of is not considered, we construct confidence interval based on the asymptotic normality of sample mean of estimator. In particular, we use

(6) |

and

(7) |

In fact, this confidence interval is only valid for

(8) |

which only considers the simulation uncertainty and ignores whether the truth is covered or not. This could potentially be misleading when the interval might fail to cover the truth (which is highly likely if is not close to ). In this paper, we discuss approaches to construct confidence interval that targets to cover with confidence level .

### Ii-B Accelerated Evaluation Setting

In Accelerated Evaluation approaches [6, 1, 5], the traffic environment is considered as uncertainties for the AV system under study and is represented by a vector . A parametric stochastic model is used to represent the uncertainty. The model can be static, which only represents the initial condition [6, 1], or dynamic, which represents stochastic processes with finite time horizon [5]. The underlying assumption is that the parametric model contains the true distribution. In another word, there exist a parameter such that .

The performance function is defined as , which indicates whether a certain type of safety-critical event (e.g. a crash) is occurred to the AV system under the environment , where 0 and 1 represent whether a safety-critical event occurring (1 for positive, 0 for negative). Note that denotes the set of safety critical event . For example, can be the test results of a lane change scenario test in computer simulation or even of a real-world on-track test. Therefore, it can be rather expensive to run experiment trail for .

In Accelerated Evaluation, the average performance measure is the probability of the safety critical event, which is revealed by the equality

(9) |

This measure is used as the criterion for the safety of a tested AV; studies have considered to construct efficient estimators for it. We use to denote the value of this probability.

In AV testing context, we expect the safety-critical event to be very rare (). Under this setting, crude Monte Carlo approaches are inefficient in estimating this extremely small probability. The inefficiency is reflected in the large relative error () of the crude Monte Carlo estimator. To intuitively explained this, we can consider that every drawn from returns with probability , and therefore huge number of samples are required to obtain a safety-critical event. The computation cost is usually prohibitive for obtaining an accurate estimation (in terms of relative error) due to expensive experiment trials.

To improve the efficiency in estimating , [1] uses importance sampling estimator to reduce the variance. Instead of drawing samples from , we construct an accelerating distribution based on information of and . With samples from , we use an unbiased estimator

(10) |

With a good selection of , the importance sampling estimator can be very efficient. [6] has shown that the importance sampling estimator can achieve the same accuracy as crude Monte Carlo estimator using only of the crude Monte Carlo samples.

## Iii Measurement of Input Uncertainty

In this section, we first introduce some well-studied bootstrap framework. We then propose our approach based on these techniques.

### Iii-a Classic Bootstrap Approach

The bootstrap technique dates back to [7, 8], which is studied to estimate the variability of statistical estimators without collecting new samples. [9] considers a parametric version of bootstrap for assessing the input uncertainty in simulation. For further interests of input uncertainty quantification, one can refer to [10, 11, 12, 13, 14, 15] and [16], Section 7.2.

We first clarify some notations to avoid confusion. Note that the random vector and its samples appears in both the input modeling part and simulation part. We use ’s to denote samples that we collected from the real world and used to estimate . We use ’s to represent the samples in the simulation part, which are generated from a certain distribution and are used to evaluate the estimator .

In general, a bootstrap scheme for quantifying input uncertainty is as follows. We first generate samples that approximately follows the true distribution of . For each , we generate samples from and estimate using

(11) |

After computing , we find the and - the empirical quantiles of as lower and upper bound of the confidence interval, respectively. We denote as the lower bound and as the upper bound.

Here, we introduce three different schemes for generate samples that are straight-forward and easy to implement. The advantages of these three schemes will be further discussed in the numerical experiments in Section IV. For a sound empirical study on the performance of bootstrap schemes, refer to [17].

These bootstrap schemes assume that we start with a sample set and the MLE is estimated using these samples. Note that in the discussed approaches, we restrict the resampling size to be equal to the original sample size, namely . This is not required for the bootstrap technique, but we adopt this setting for convenience and simplicity.

#### Iii-A1 Direct Bootstrap

Direct bootstrap consider the sample set as an empirical distribution, say , and use it as an approximation of the real distribution of . We draw samples from , i.e. resample from with replacement, and then use these samples to estimate (using MLE). We repeat this procedure for times to obtain .

#### Iii-A2 Parametric Bootstrap

Here we use as an approximation of the real distribution of . We draw samples from and use them to estimate . We repeat this for times and collect .

#### Iii-A3 Sample Parameters from Asymptotic Distribution

Since the is estimated using MLE, we know the asymptotic behavior of . That is when , we have

(12) |

where is the inverse of Fisher’s information matrix of the parametric distribution . Since is unknown, we can use its MLE to obtain an approximation of the asymptotic distribution .

In practice, a closed form of the Fisher’s information matrix might not be available. Instead, one can use the empirical Fisher’s information matrix, which is an estimation based on the samples. That is

(13) |

where ’s are the samples we use to fit the model. Thus, we can direct sample from or . Note that this scheme reduces computation cost from resampling and estimating .

### Iii-B The Proposed Approach: A Likelihood Ratio Based Estimation for Bootstrap

To motivate the proposed approach, we first consider the computation cost for a classic bootstrap scheme. No matter what bootstrap scheme we use, after we obtain the bootstrapped parameters , we would need to estimate using . To obtain a good empirical quantile, we usually require to be larger than 30 (usually 100 or more). Also, in order to reduce the simulation uncertainty to avoid obtaining an over-covered confidence interval, we want to be as large as possible. The number of experiment trials in total will be , which is times more than estimating the probability. When the experiment is expensive and time-consuming, the price for assessing the input uncertainty might not be affordable. Here, we propose an approach that can assess the input uncertainty with no additional cost for experiment trials.

Assume we have already estimated the average performance measure from samples from using (3), where can be or an appropriate accelerating distribution for . Then, we obtain bootstrap parameters using any bootstrap scheme. For each , instead of generate a new sample from , we use the same set of samples , and estimate using

(14) |

We should note that each is still an unbiased estimator, i.e. we have

(15) |

Note that by estimating in this way, we do not need to evaluate (which is hidden in ) at any new realization of .

This approach is considered to be a perfect fit to accelerated evaluation, i.e. when is a good accelerated distribution for and is defined by (10). From the asymptotic distribution of , we know that the bootstrap parameters ’s should be distributed around especially when the number of sample is large. We can speculate the would also be a good accelerating distribution for . For instance, if we use exponential tilting of Exponential distribution, the optimal for a certain performance function is the same for any parameter values for the exponential distribution. By using the proposed approach, we saved experiment trials compared to the classical bootstrap approaches.

## Iv Numerical Experiments

In this section, we present some numerical experiments to illustrate the proposed approach and discuss some implementation details. We first discuss the performance of the three bootstrap schemes under different scenarios. We then use a simple illustrative problem to demonstrate the proposed approach. Lastly, we apply the proposed approach on an AV testing example problem.

### Iv-a Comparison of Bootstrap Schemes

In Section III-A, we introduced three bootstrap schemes. Here we use some numerical studies to show the advantages of each scheme and provide a guideline of choosing suitable scheme in different conditions.

The purpose of the experiment is check if ’s generated using these bootstrap schemes are roughly close to the true distribution of with different numbers of samples . In the experiment, we first generate samples from . For each bootstrap scheme, we use these samples to generate with . We use the and empirical quantile of these ’s as upper and lower bound for a confidence interval and check whether is covered. We repeat this procedure for 1000 times with independently generated sample set. We use the coverage of the truth to test the accuracy of the confidence interval obtained from these schemes. We use in our experiments.

The experiment results with different and different distribution models are tabulated in Tables I and II. In the table, “Direct” represents direct bootstrap, “Parametric” represents parametric bootstrap, “Asym Cls” stands for the asymptotic distribution scheme using closed form Fisher’s information and “Asym Est” stands for the asymptotic distribution scheme using empirical Fisher’s information.

Samples | Approach | Object | Coverage |
---|---|---|---|

k=10 | Direct | 84.70% | |

Parametric | 92.20% | ||

Asym Cls | 88.30% | ||

Asym Est | 90.20% | ||

k=20 | Direct | 91.40% | |

Parametric | 93.10% | ||

Asym Cls | 93.30% | ||

Asym Est | 92.00% | ||

k=100 | Direct | 94.10% | |

Parametric | 95.10% | ||

Asym Cls | 94.30% | ||

Asym Est | 95.20% |

Firstly, we consider exponential distribution for . From Table I, we observe that when , the coverage rates for all schemes have an obvious gap to the target 95%. For the direct bootstrap, this is because the empirical distribution has too small sample size, hence resulting in not a very good approximation for the true distribution. For the parametric bootstrap, this is caused by the error in estimating . For the asymptotic approaches, the poor performance is caused by both bad estimation of and the small value of (note that the asymptotic behavior requires ). Among these approaches, the parametric bootstrap has the smallest gap. This is partly because the assumption of the correct parametric model remedies the error from the variability of the samples. As we increase the value of , the gap between target coverage and the obtained coverage reduces. When we use , the coverage rates for all schemes are already close to the target.

Samples | Approach | Object | Coverage |
---|---|---|---|

k=20 | Direct | 92.10% | |

88.60% | |||

Parametric | 92.30% | ||

93.00% | |||

Asym Cls | 92.90% | ||

92.80% | |||

Asym Est | 92.60% | ||

93.50% | |||

k=100 | Direct | 95.20% | |

91.70% | |||

Parametric | 94.80% | ||

93.70% | |||

Asym Cls | 95.00% | ||

93.50% | |||

Asym Est | 94.90% | ||

93.40% |

We observe similar performance in the experiment results on Gaussian distribution presented in Table II. In this set of experiments, we note that the direct bootstrap cannot obtain a good coverage for the variance parameter with 100 samples.

In summary, the parametric bootstrap provides a better coverage of the truth, especially when the number of samples is very small. When the number of samples is large enough the coverage for these schemes are similar. In sufficient sample size situation, the asymptotic schemes have an upper hand for the efficiency of generating the parameters.

### Iv-B Illustrative Problem

We consider a simple probability estimation problem to demonstrate the effectiveness of the proposed approach in providing a valid confidence interval. This is shown in two aspects: a) we show the coverage of the proposed approach is close to the target, b) we show the confidence interval width is relatively narrow.

Samples | 100 | 1000 | 10000 |
---|---|---|---|

Coverage CF | 0.9432 | 0.9451 | 0.9505 |

CI Width CF | 1.33e-05 | 8.85e-07 | 2.20e-07 |

Coverage LR | 0.9426 | 0.9444 | 0.9486 |

CI Width LR | 1.33e-05 | 8.85e-07 | 2.20e-07 |

Coverage SU | 0.0177 | 0.0630 | 0.1903 |

CI Width SU | 8.28e-08 | 3.08e-08 | 2.72e-08 |

We consider estimating the probability of , where follows a standard Gaussian distribution and we use . The choice of the problem is because we have an analytic solution for the probability, which make is easier to validate whether the constructed confidence interval covers the truth or not. We use different numbers of sample size for estimating . We use bootstrap samples for construct the confidence interval. For the estimation of , we consider two approaches. The first is to use the proposed approach with 10,000 importance sampling estimators. To show the constructed CI has a relative narrow width, we also consider using the analytic solution for for each as a baseline (so that there is no simulation uncertainty). We repeat for 10000 total replications and compute the coverage of the confidence interval.

The experiment results are summarized in Figure III. “Coverage CF” and “CI Width CF” represent the coverage rate and confidence interval width computed using the closed form probability for each bootstrap parameter. “Coverage LR” and “CI Width LR” represent the results for the proposed approach that uses likelihood ratio estimation. “Coverage SU” and “CI Width SU” represent the results using (6) and (7) that only consider simulation uncertainty (with input uncertainty ignored).

We have two main observation from these experiment results: a) the proposed likelihood ratio scheme provides a good estimation of the probability of interest. This claim is supported by the similar coverage rates and confidence interval width for the closed form baseline approach and our proposed approach. b) the confidence interval that without incorporating the input uncertainty is misleading. This observation is revealed by the low coverage rates (especially when sample size is smaller, which means more variability) and narrow confidence interval width. This experiment shows that the proposed approach provides valid confidence intervals for the estimation and ignoring the input uncertainty might be problematic.

### Iv-C Accelerated Evaluation Example

To demonstrate the proposed approach, we consider the AV evaluation problem and autonomous vehicle model discussed in [1]. The lane change test scenario is shown in Figure 1, where we evaluate the safety level of a test AV by estimating the probability of crash when a frontal car cut into the lane. The traffic environment in this scenario is represent by , the initial velocity of the frontal vehicle, , the initial range between the two vehicles, and , the time-to-collision value defined by .

In our problem, we consider the frontal car to have an initial velocity , which is a common speed in highway driving. We extract 12,304 lane change scenario samples identified from the SPMD dataset [18] with similar velocity. We use the samples to fit and with exponential distribution. We used the cross-entropy method to find an optimal accelerating distribution by exponential tilting for and and generated samples from the accelerating distribution. We then use the proposed approach to construct confidence interval for the input uncertainty.

In Figure 2, we present the probability estimation and the two types of confidence interval we construct given different number of samples. The confidence interval for simulation uncertainty is estimated using (7) and (6). We observe that the confidence interval for simulation uncertainty has a much smaller width than the input uncertainty uncertainty width. This observation indicates that if the input uncertainty is ignored, the evaluation results can be misleading. For instance, if we use the confidence upper bound to interpret the safety level of a vehicle, the input uncertainty upper bound is roughly 1.5 times of the simulation uncertainty, hence using only the simulation uncertainty would underestimate the risk of crash.

Figure 3 shows how the widths of the two intervals changes as the number of experiment trials increases. As known in literature, the width of the simulation uncertainty confidence interval shrinks in the order of . This trend can be easily observed from the figure. On the other hand, since we are not changing the number of samples we use to estimate , the input uncertainty confidence interval should not change as increases. We can confirm this in the figure, where the confidence interval does not change much when the number of experiment trials is sufficiently large. When is small, the interval width of input uncertainty is not as stable as when is large. This is because when we do not have enough samples, the simulation uncertainty becomes large and perturbs the estimation of the input uncertainty. Therefore, accounting for the input uncertainty in AV evaluation that uses Monte Carlo approach, especially when the safety-critical cases are rare, plays a crucial role to derive accurate conclusions.

## V Discussion

In this paper, we propose an approach to assess the input uncertainty in Monte Carlo based AV test methods, which requires zero additional experiment trails. The proposed approach is shown to be computationally efficient and easy to implement while provides valid confidence intervals that incorporate input uncertainty. In the future, we consider to extend our study to model-free input uncertainty analysis to consider a wider application domains.

## Acknowledgment

## References

- [1] D. Zhao, H. Lam, H. Peng, S. Bao, D. J. LeBlanc, K. Nobukawa, and C. S. Pan, “Accelerated evaluation of automated vehicles safety in lane-change scenarios based on importance sampling techniques,” IEEE transactions on intelligent transportation systems, vol. 18, no. 3, pp. 595–607, 2017.
- [2] N. Kalra and S. M. Paddock, “Driving to safety: How many miles of driving would it take to demonstrate autonomous vehicle reliability?” Transportation Research Part A: Policy and Practice, vol. 94, pp. 182–193, 2016.
- [3] A. Broadhurst, S. Baker, and T. Kanade, “Monte carlo road safety reasoning,” in IEEE Proceedings. Intelligent Vehicles Symposium, 2005. IEEE, 2005, pp. 319–324.
- [4] A. Eidehall and L. Petersson, “Statistical threat assessment for general road scenes using monte carlo sampling,” IEEE Transactions on intelligent transportation systems, vol. 9, no. 1, pp. 137–147, 2008.
- [5] D. Zhao, X. Huang, H. Peng, H. Lam, and D. J. LeBlanc, “Accelerated evaluation of automated vehicles in car-following maneuvers,” IEEE Transactions on Intelligent Transportation Systems, vol. 19, no. 3, pp. 733–744, 2018.
- [6] Z. Huang, H. Lam, D. J. LeBlanc, and D. Zhao, “Accelerated evaluation of automated vehicles using piecewise mixture models,” IEEE Transactions on Intelligent Transportation Systems, no. 99, pp. 1–11, 2017.
- [7] B. Efron, “Bootstrap methods: another look at the jackknife,” in Breakthroughs in statistics. Springer, 1992, pp. 569–593.
- [8] ——, The jackknife, the bootstrap, and other resampling plans. Siam, 1982, vol. 38.
- [9] R. C. Cheng and W. Holloand, “Sensitivity of computer simulation experiments to errors in input data,” Journal of Statistical Computation and Simulation, vol. 57, no. 1-4, pp. 219–241, 1997.
- [10] R. Barton, S. Chick, R. Cheng, S. Henderson, A. Law, B. Schmeiser, L. Leemis, L. Schruben, and J. Wilson, “Panel discussion on current issues in input modeling,” in Proceedings of the 2002 Winter Simulation Conference. IEEE, 2002, pp. 353–369.
- [11] S. G. Henderson, “Input model uncertainty: Why do we care and what should we do about it?” in Winter Simulation Conference, vol. 1, 2003, pp. 90–100.
- [12] S. E. Chick, “Bayesian ideas and discrete event simulation: why, what and how,” in Proceedings of the 38th conference on Winter simulation. Winter Simulation Conference, 2006, pp. 96–105.
- [13] R. R. Barton, “Input uncertainty in outout analysis,” in Proceedings of the Winter Simulation Conference. Winter Simulation Conference, 2012, p. 6.
- [14] E. Song, B. L. Nelson, and C. D. Pegden, “Advanced tutorial: Input uncertainty quantification,” in Proceedings of the Winter Simulation Conference 2014. IEEE, 2014, pp. 162–176.
- [15] H. Lam, “Advanced tutorial: Input uncertainty and robust analysis in stochastic simulation,” in 2016 Winter Simulation Conference (WSC). IEEE, 2016, pp. 178–192.
- [16] B. Nelson, Foundations and methods of stochastic simulation: a first course. Springer Science & Business Media, 2013.
- [17] R. R. Barton, B. L. Nelson, and W. Xie, “A framework for input uncertainty analysis,” in Proceedings of the Winter Simulation Conference. Winter Simulation Conference, 2010, pp. 1189–1198.
- [18] D. Bezzina and J. R. Sayer, “Safety Pilot: Model Deployment Test Conductor Team Report,” UMTRI, Tech. Rep., 2014.