Sensitivity to Serial Dependency of Input Processes: A Robust Approach
Henry Lam
Department of Industrial and Operations Engineering, University of Michigan, Ann Arbor, MI 48109, khlam@umich.edu
Procedures in assessing the impact of serial dependency on performance analysis are usually built on parametrically specified models. In this paper, we propose a robust, nonparametric approach to carry out this assessment, by computing the worstcase deviation of the performance measure due to arbitrary dependence. The approach is based on optimizations, posited on the model space, that have constraints specifying the level of dependency measured by a nonparametric distance to some nominal i.i.d. input model. We study approximation methods for these optimizations via simulation and analysisofvariance (ANOVA). Numerical experiments demonstrate how the proposed approach can discover the hidden impacts of dependency beyond those revealed by conventional parametric modeling and correlation studies.
Key words: sensitivity analysis; serial dependency; nonparametric; simulation; optimization
While stylized models in operations and service systems often assume independently generated inputs, studies have pointed to the ubiquity of serial dependency in many reallife applications. Examples include arrival streams in file transfers (Ware et al. (1998)), video frames (Melamed et al. (1992)) and call centers (Ibrahim et al. (2012)). For decision makers, ignoring serial dependency in the input model for performance analysis potentially leads to substantial errors (e.g., Livny et al. (1993)). Being able to assess the impact of dependency on a given performance analysis is, therefore, crucial for reliable decision making.
The common way of assessing the effect of dependency is modelbased. Namely, one builds a parametric dependent model for the input, and carries out the performance analysis at different values of the parameter, which typically corresponds to correlation. While this line of study is no doubt convenient, it relies heavily on the correctness of the model: If the truth does not lie in the parametric model family, there is no guarantee that the estimate on the effect of dependency is accurate. As an illustration, consider the following simple example:
Example 1
Consider two uniform random variables and on the interval . Suppose the task is to assess how dependency between and affects a quantity such as . One way is to construct a Gaussian copula on with a correlation parameter, vary this parameter starting from 0 and observe how behaves. Despite being straightforward to apply, this approach may not provide valid information if the true dependency structure between and is nonGaussian (which might not even be expressible in closedform).
This issue arises more generally for stochastic inputs on a processlevel. For instance:
Example 2
Consider a firstcomefirstserve queue. Suppose the task is to assess what happens to a performance measure, say the probability of overload, if the interarrival times are not i.i.d. but are serially dependent. One can build a firstorder autoregressive model with Gaussian noise, i.e. AR(1), and embed it into the interarrival sequence with exponential marginal distribution (the socalled autoregressivetoanything (ARTA) procedure; see Cario and Nelson (1996)). Simulating the overload probability at different values of the firstorder lag parameter away from zero in the AR(1) model gives an assessment of the effect of serial dependency. Again, the AR(1) model is specific. Even within the class of all processes that are dependent only at lag one (or equivalently, Markovian processes), there can be many choices that are nonlinear or nonGaussian. If the true dependency structure is any of these other possibilities, the present assessment scheme can underestimate the actual impact of dependency.
This paper investigates a new scheme for assessing the implication of dependency on performance measures like those above for any form of dependency. Rather than tuning a correlation parameter in a prespecified parametric family of dependent models, the proposed approach relies on tuning a unit of measurement for the level of dependency that applies universally to all models. This unit of measurement is the socalled Pearson’s coefficient (Cramér (1999)), which can be defined on any bivariate joint probability distribution regardless of its parametric form. With this unit, the central ingredient of our methodology is a worstcase analysis: We aim to find the most extreme values of the performance measure of interest among all dependent models that are within a tunable threshold of the coefficient. This can be formulated as optimization programs over the space of models, with constraints on the type of input processes and the dependency level. We will illustrate our methodology on Examples 1 and 2 in the sequel.
Our worstcase framework is based on the lagdependent class of stationary processes, which is a generalization of autoregressive processes without restricting to parametric form, and has been used in nonparametric hypothesis tests in econometrics (e.g., Robinson (1991), Granger et al. (2004), Granger and Lin (1994), Chan and Tran (1992)). In general, worstcase optimizations over such a nonparametric class of dependent processes are nonconvex and intractable. To handle this issue, we develop approximations on the optimizations under the asymptotic regime that the coefficient shrinks to zero. Our main contributions lie in the extraction of conceptual insights and the computational methods for these approximations. First, we show that the worstcase dependency effects, measured by the deviations of the optimal objective values from a baseline independent model, can generally be approximated via the “interaction” variance of a suitable analysisofvariance (ANOVA) decomposition on the performance measure. The magnitude of this interaction variance depends on the degree to which the performance measure can be separated additively among its input variates. Second, based on these insights, we build simulationbased algorithms that combine with ANOVA procedures to numerically compute the interaction variance, and in turn approximate the worstcase values of the optimizations. We provide detailed analyses of the design, sampling efficiencies and statistical properties of our algorithms.
Asymptotic methods in approximating worstcase optimizations have been recently proposed in Lam (2013), which studies uncertainty on the marginal distributions among i.i.d. variates. Regarding the motivation for quantifying model misspecification, our results to handle dependency structures beyond Lam (2013) are important for the following reason: Whereas marginal distributions can often be inferred consistently from data via the empirical distribution, or via goodnessoffit selections among plenty of available parametric models, the counterparts for simulable dependent models are much less developed. Empirical methods for building transition or conditional densities in dependent processes involve kernel estimators (e.g. Fan and Yao (2003)) that require bandwidth tuning whose statistical errors can be difficult to quantify, and the resulting model can also be difficult to simulate. Moreover, the choices of parametric models for serially dependent input processes are limited, mostly to linear dependence. Thus, unlike the estimation of marginal distributions, the errors in even the best statistical fits for serial dependency structures may not go away with more abundant input data. One major view suggested by this paper is to therefore take the pragmatic approach of running a simple baseline model, and robustifying it by looking at the worstcase performance among models within a calibrated distance, such as the coefficient, that measures the amount of dependency not captured by the baseline. Our developed results set a foundation for more extended investigations along this line.
We close this introduction with a brief review of other related work. The literature of dependency modeling in stochastic simulation mostly surrounds the fitting of versatile, parametrically based models, e.g., ARTA (Cario and Nelson (1996), Biller and Nelson (2005)) and TES (Melamed et al. (1992), Melamed (1991)) in the context of serial dependence, and NORTA (a sister scheme of ARTA; Cario and Nelson (1997)), chessboard distribution (Ghosh and Henderson (2002)) and treedependent variables (Bedford and Cooke (2002)) for crosssectional and highdimensional dependence. Simulationbased studies on the effect of dependency in singleserver queues have been carried out by Livny et al. (1993) and Mitchell et al. (1977) among others. There are also some analytical results on the correlation effects in other stylized queueing models (e.g., Pang and Whitt (2012a, b)).
Our formulation is inspired by the literature on distributionally robust optimization and model uncertainty, which considers stochastic optimization problems where the underlying probabilistic model is itself not fully specified and instead believed to lie in some feasible set (e.g. BenTal et al. (2013), Goh and Sim (2010)). In particular, Delage and Ye (2010) consider moment constraints, including crossmoments to represent correlation structure. Glasserman and Xu (2014) consider the scenario of bivariate variables with conditional marginal constraints. Our formulation specifically utilizes statistical distance to measure the level of uncertainty centered around a nominal model, which has been used in economics (Hansen and Sargent (2008)), finance (Glasserman and Xu (2014, 2013)) and stochastic control (Petersen et al. (2000), Nilim and El Ghaoui (2005), Iyengar (2005)).
The remainder of this paper is organized as follows. Section id1 introduces the notion of coefficient and some notation. Section id1 discusses the worstcase optimization programs for bivariate settings. Section id1 discusses more general input processes within the 1dependence class. Section id1 gives the details of numerical implementations and illustrations. Section id1 extends the results to the 2dependence class. Lastly, Section id1 discusses future extensions and concludes. All proofs are presented in the Appendix and the Supplementary Materials.
The distance between two probability distributions, say and , is defined as
where is the RadonNikodym derivative between and , and denotes the expectation under . As a canonical example, if and are absolutely continuous with respect to the Lebesgue measure on with densities and , then is represented by , and .
The coefficient, defined on a bivariate probability distribution, is the distance between this distribution and its independent counterpart. To be clear, for a bivariate random vector , let us denote as its probability distribution, and as the distribution having the same marginal distributions as but with and being independent, e.g., if is absolutely continuous with respect to the Lebesgue measure on with density , then has density , where and . The coefficient of is defined as^{1}^{1}1It is also common to consider the square root of , which becomes the coefficient.
For example, suppose is a bivariate normal distribution with zero means, unit variances and correlation . Then is a bivariate normal distribution with zero means, unit variances and zero correlation, and
where denotes the standard normal density and denotes the bivariate normal density with standard marginals and correlation .
Note that the coefficient of any independent joint distribution is zero. By definition, it is calculated by fixing the marginal distributions, thus isolating only the amount of dependency. Importantly, its definition is not restricted to a particular parametric model.
To illustrate the basic idea of our method, we first focus on the bivariate setting. Consider a bivariate random vector for some space , and a performance measure denoted where . Our premise is that a baseline probability distribution has been chosen for , where and are independent under . We denote and as the (potentially different) marginal distributions of and respectively. In other words, . We denote and as the expectation and variance respectively under . We also denote as the space of all distributions that are absolutely continuous with respect to .
Our scheme relies on the worstcase optimization programs
(1) 
The decision variable is , and we use to denote the expectation under . The constraints and state that has the same marginal distributions as . The constraint controls the level of dependency of to be within units of coefficient, or equivalently units of distance away from . Overall, (1) gives the worstcase values (on both upper and lower ends) of the performance measure among all models with the same marginals as the baseline and coefficient within .
The optimality of (1) is characterized by the following proposition:
Proposition 1
Define . Suppose is bounded and . Then, for sufficiently small , the optimal values of the max and min formulations in (1) satisfy
(2) 
and
(3) 
respectively.
Therefore, to obtain the worstcase performance measure only requires calculating the standard deviation of . The intuition on boils down to the question: With fixed marginal distributions of and , how does dependency between and affect ? Obviously, when is additively separable, i.e. for some functions and , dependency does not exert any effect on the value of . In other words, any effect of dependency on must come from the structure of beyond additive separability, and this is exactly captured by the quantity . To see this, by the definition of , we can decompose as
(4)  
Using the terminology from ANOVA (Cox and Reid (2002)), and can be regarded as the “main effects” of the “factors” and , and can be regarded as the “interaction effect” that captures any nonlinear interaction beyond the main effects. In particular, when is additively separable, one can easily see that is exactly zero. The standard deviation of represents the magnitude of interaction. The larger it is, the faster is the rate of deviation from the baseline with respect to the level of dependency measured by .
Example 1 Revisited. We illustrate the use of Proposition 1 with Example 1. Consider the objective with uniform marginal distributions for and over . One standard way to assess the effect of dependency is to construct a Gaussian copula (Mai and Scherer (2012)), i.e. , as the bivariate distribution function of , where is the quantile function of a standard normal variable and is the distribution function of a bivariate normal distribution with standard marginals and correlation . Instead of this choice, a modeler could also use other copulas, such as:
The effect of dependency in these models can each be assessed by varying the parameter or (with corresponding to independence for Gaussian copula, for Gumbel, and for Clayton and AMH). Each model operates on its own specific unit and form of dependency.
Alternately, without any presumption on the true model, we can use the coefficient as the universal measurement of dependency. Figure 1 shows the upper and lower worstcase bounds, as functions of , obtained by simply calculating in Proposition 1. For comparison, we also plot the values of when evaluated at the specific copulas above with increasing or decreasing parameters. The worstcase bounds can be seen to dominate all these copulas. It appears that among them, Gaussian and AMH are the closest to the worstcase bounds, whereas Clayton is farther away and Gumbel has the least effect. To give some perspective, if a modeler uses the Gaussian model to assess dependency by only varying the correlation parameter, he/she is restricted to the Gaussian dots in Figure 1, which has a gap with the worstcase scenario and hence may underestimate the actual effect of dependency if the true model is not Gaussian. On the other hand, if the modeler uses the coefficient, then he/she can output a range of all possible performance values that cover any model form including Gaussian, hence avoiding the issue of potential model misspecification.
As a further detail, the worstcase joint densities that achieve the depicted bounds in Figure 1 are
This can be seen easily from the proof of Proposition 1 (in Appendix id1).
In practice, using our method requires interpreting and choosing . Some preliminary sense of its scale can be drawn by comparing with specific models. For instance, in Example 1 above, a coefficient of approximately corresponds to a correlation of if the model is known to be a bivariate normal distribution. In general, we suggest calibrating from historical data (see Section id1 for a discussion and related references).
We now turn our focus to serial dependency in more general stochastic input sequences. To fix some notation, we denote an input process as and is the time horizon. Let the performance measure of interest be , where is a known cost function (i.e. that can be evaluated by computer, but not necessarily has closed form). Our premise is that the modeler adopts a baseline model for the input process that describes as i.i.d. random variables. As in Section id1, we denote the probability distribution of the baseline model as , and similar definitions for and .
For instance, in the queueing setting in Example 2, is the sequence of interarrival times, can be the indicator of whether the th customer has waiting time longer than a threshold, so that , and the baseline model is described by the joint distribution of the i.i.d. sequence of exponential ’s.
To define our scope of assessment, we introduce the concept of dependence, commonly used in nonparametric time series modeling (see, e.g., Robinson (1991)). A dependent process is a process where, conditioned on the last steps, the current state is independent of all states that are more than steps back in the past, i.e. for any , conditioning on , the sequence and are independent. This definition in particular contains AR() models and is the natural nonlinear generalization of autoregression. Our assessment scheme for serial dependency operates on the space of all dependent processes that are stationary, where the lag parameter is prespecified to be 1 or 2. The most basic and important case, , is discussed in this section.
On a high level, the worstcase optimization is written as
(5) 
where denotes a pair of max and min problems, and denotes the set of all distributions on that are absolutely continuous to the i.i.d. baseline distribution . The second and the third constraints restrict attention to dependent processes with the same marginal structure, thus isolating the effect of dependency. Note that the decision variable is the probability measure of the whole sequence .
The question remains how to define the neighborhood in the first constraint in terms of the coefficient. By definition, for a dependent process under stationarity, the distribution of any two consecutive states completely characterizes the process. Hence one can define an neighborhood of to be any that satisfies , where and are the joint distributions of under and , for any , under stationarity. Since the baseline model entails i.i.d. ’s, this neighborhood is equivalent to .
Therefore, formulation (5) can be written mathematically as
(6) 
where and are shorthand for the marginal distributions of under and respectively (which clearly do not rely on under stationarity).
Unlike (1), the optimization (6) is generally nonconvex. This challenge is placed on top of the fact that its decision variable lies on the huge space of probability measures on the sample paths with time horizon . Therefore, we consider an approximation of (6) via an asymptotic analysis by shrinking to zero. Given , we define the function as
(7) 
We further define as
(8) 
where and are i.i.d. random variables each under the marginal distribution of .
With the above definitions, our main result is that:
Theorem 1
Assume is bounded and . The optimal values of the max and min formulations in (6) possess the expansions
(9) 
and
(10) 
respectively as , where
and and are i.i.d. random variables each distributed under the marginal distribution of .
The function captures the nonlinear interaction effect analogous to in Section id1, but now it is applied to the function instead of the cost function itself. The function can be interpreted as a summary of the infinitesimal effect when there is a change in the bivariate distribution of any pair of consecutive elements in the input sequence.
Note that the firstorder coefficients in the asymptotic expansions for the max and min formulations in Theorem 1 differ by the sign, whereas the secondorder coefficients (i.e. the “curvature”) are the same for both formulations. Thus, in terms of first or secondorder approximation, obtaining either the upper or lower bound will give another one “for free”.
We discuss how to compute the expansions in Theorem 1, focusing on the firstorder coefficient. Our first observation is that we can write the function as
(11) 
Here denotes the cost function with fixed at and fixed at , and all other ’s are independently randomly generated under the marginal distribution of . Algorithm 1 generates an unbiased sample for .

for each

for each

for each

Algorithm 1 is essentially a manifestation of the “interaction effect” interpretation of as discussed in Section id1. Casting in the setting of a twoway random effect model (Cox and Reid (2002)), we can regard each group of outer samples of ’s as the realizations of a factor. The inner samples are the responses for each realized pair of factors. Then, and are the interaction mean square and the residual mean square, respectively, in the corresponding twoway ANOVA table. The final output is an unbiased estimator of the interaction effect in the random effect model, which leads to:
Theorem 2
Algorithm 1 gives an unbiased sample for .
Moreover, this estimator possesses the following sampling error bound:
Theorem 3
Let where and are i.i.d. each under the marginal distribution of . The variance of the estimator from Algorithm 1 is of order
To construct a confidence interval for , one can generate (for example, ) replications of the output in Algorithm 1. Say they are . Using the delta method (Asmussen and Glynn (2007)), an asymptotically valid confidence interval for is given by
where is the sample mean, and is the sample variance of ’s. The quantity is the quantile of the distribution with degrees of freedom.
Consider the example of queue. The baseline model is described by i.i.d. interarrival times and service times. We are interested in assessing the impact on performance measures when the interarrival times are not i.i.d. but instead serially dependent.
First, we check the qualitative behavior of the firstorder coefficient in Theorem 1. We consider the performance measure , where is the waiting time of the th customer, and is a threshold. We set the arrival rate for the baseline model to be and the service rate is 1. We compute for different values of and . For the first set of experiments, we fix and vary from to ; for the second set, we fix and vary from 1 to . For each setting we run replications of Algorithm 1 to obtain point estimates and confidence intervals (CIs), and use an outer sample size and an inner sample size in Algorithm 1.
Figure 2 and Table 1 show the point estimates and CIs of against . Note the steady increase in the magnitude of : The point estimate of increases from around when to when . Intuitively, should increase with the number , since represents the number of variates from the timeseries input process consumed in each replication and hence should positively relate to the impact of serial dependency. The depicted trend coincides with this intuition. On the other hand, Figure 3 and Table 2 show an increasingdecreasing trend of against . increases first from for to around for , before dropping to for , showing that the impact of dependency is biggest for the tail probabilities that are at the central region, i.e. not too big or too small.
To put things into context, the tail probability of the th customer’s waiting time above is around , and the expected waiting time is around 3. Furthermore, the point estimate of for the expected waiting time is with 95% CI .
Next, we assess the performance of our worstcase bounds against some parametric alternatives. Let be the interarrival times. We consider:

Embedded AR(1) process: , where is the quantile function of Exp() and is the distribution function of a normal random variable with mean and variance . The process follows a stationary AR(1) model, i.e. , with , and (which is the stationary distribution of this AR(1) model). Note that by construction has marginal distribution Exp(). This is because under stationarity all ’s have distribution , and so , which in turn implies . ’s are correlated due to the embedded AR(1) process .

Embedded MC process: where
and is a discretetime Markov chain with two states and transition matrix
for some parameter and . The distribution of is set to be for state 0 and for state 1.
Note that is the stationary probability of state 0 for the Markov chain . Thus, by construction has a uniform distribution over , and hence . The correlation of is induced by the Markov chain .
Figures 4 to 7 compare our worstcase bounds using only the firstorder terms in Theorem 1 to the two parametric dependent models of interarrival times above. We use the mean waiting time of the th customer as our performance measure. In the figures, the curves are the lower and upper worstcase bounds (the solid ones are calculated using the point estimates of , and the dashed ones using the 95% CIs), and the dots are the estimated performance measures of each parametric model (each dot is estimated using 1 million samples, and the white dots above and below are the CIs). In Figure 4, we use the embedded AR(1) model, fix , and vary from to . In Figure 5, we use the embedded MC model, fix , and vary from to , while in Figure 6 we fix and vary from to . Finally, Figure 7 plots all the parametric models on the same scale of . As can be seen, the parametric models are all dominated by the worstcase bounds. The embedded AR(1) model is closest to the worstcase, followed by the embedded MC with and then . Similar to Example 1 in Section id1, the key message here is that using coefficient and Theorem 1 alleviate the possibility of underestimating the impact of dependency due to model misspecification.
The following example demonstrates another application of our proposed method. Consider the calculation of delta hedging error for a European option, studied in Glasserman and Xu (2014). Our focus here is the case where the stock price of an underlying asset deviates from the geometric Brownian motion assumption to serially dependent processes.
We adopt some notation from Glasserman and Xu (2014). Consider a European call option, i.e. the payoff of the option is where is the maturity, is the stock price, and is the strike price. We assume a BlackScholes baseline model for the stock, namely that the stock price follows a geometric Brownian motion
where is a standard Brownian motion, is the growth rate of the stock, and is the volatility. Assuming trading can be executed continuously over time without transaction fees, it is well known that a seller of the option at time 0 can perfectly hedge its payoff by investing , i.e. the “delta”, units of stock at any point of time, with the remainder of the portfolio held as cash. Here is the riskfree rate. This selfbalancing portfolio will exactly offset the payoff of the option at maturity. In practice, however, trading cannot be done continuously, and implementing the BlackScholes strategy will incur a discretization error.
Specifically, suppose that trading is allowed at times for , and that the trader holds the amounts of cash and stock :
at time 0, and the amounts of cash and stock :
at time . denotes the BlackScholes price for a European call option with maturity and initial stock price , and is the “delta” of the option at time with current stock price . The delta hedging error is given by
and the performance measure is set to be .
Our interest is to assess the impact on if deviates from i.i.d. lognormal distribution to a 1dependent process. We set and , so that the number of periods in consideration is 100. The initial stock price and the strike price are both set to be 100, i.e. the call option is atthemoney. We set the baseline geometric Brownian motion to have a growth rate and a volatility , and the riskfree rate as .
The firstorder coefficient in Theorem 1 is estimated to be , with 95% CI . Figure 8 depicts the performance of the worstcase bounds in Theorem 1 against an AR(1) model. The delta hedging error for the baseline model, i.e. geometric Brownian motion, is estimated to be about using 1 million sample paths, as depicted by the horizontal line. The curves are the upper and lower bounds using only the firstorder term in Theorem 1. As a comparison to our bounds, we compute for ’s that have logarithmic increments satisfying the AR(1) model, i.e. with ranging from to , and keeping the stationary distribution at . The dots show the point estimates of at different , each using 1 million sample paths. The fact that the dots are quite close to the worstcase bounds (although still bearing a gap) demonstrates that the AR(1) model is a conservative model choice here.
So far we have focused on bivariate and onelag serial dependencies. In practice, the type of dependency may exceed these scopes. This section provides some discussion on how to generalize our results to higherlag dependency, especially focusing on 2dependent processes. Further generalizations will be left to future work.
Consider a baseline that generates i.i.d. ’s, and we are interested in assessing the impact of 2dependency. Our formulation for the assessment is:
(12) 
where is a suitable extension of the definition of coefficient to trivariate probability distributions. The constraints and together describe the level of dependency among all the 2dependent stationary processes that have the same marginal distribution as . The first constraint restricts the amount on lag dependence, much like formulation (5), whereas the second constraint restricts the amount of any additional lag dependence.
More precisely, note that for any stationary 2dependent , the joint distribution of any three consecutive states, i.e. , completely determines . Next, given any that generates a stationary dependent sequence of ’s, one can find a measure, , that generates a dependent sequence such that the joint distribution of any pair of consecutive states is unchanged, i.e. . Now, in the first constraint of (12), we have the Pearson’s coefficient
In the second constraint, we shall define
This definition is illustrated with the following example:
Example 3
Suppose is multivariate normal with mean 0 and covariance matrix under , where
Then under , will still remain as a zeromean multivariate normal distribution, but with a covariance matrix replaced by
which is essentially the covariance matrix of an AR(1) model with lag parameter . Details of derivation is left to the Supplmentary Materials.
With the above definitions and formulation (12), we have the following approximation: