Example 1 (Robust bounds under expert opinion)

[10pt]

Robust Analysis in Stochastic Simulation: Computation and Performance Guarantees

Business Analytics and Math Sciences, IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, ghoshs@us.ibm.com

Henry Lam

Department of Industrial and Operations Engineering, University of Michigan, Ann Arbor, MI 48109, khlam@umich.edu

Any performance analysis based on stochastic simulation is subject to the errors inherent in misspecifying the modeling assumptions, particularly the input distributions. In situations with little support from data, we investigate the use of worst-case analysis to analyze these errors, by representing the partial, nonparametric knowledge of the input models via optimization constraints. We study the performance and robustness guarantees of this approach. We design and analyze a numerical scheme for a general class of simulation objectives and uncertainty specifications, based on a Frank-Wolfe (FW) variant of the stochastic approximation (SA) method (which we call FWSA) run on the space of input probability distributions. The key steps involve a randomized discretization of the probability spaces and the construction of simulable unbiased gradient estimators using a nonparametric analog of the classical likelihood ratio method. A convergence analysis for FWSA on non-convex problems is provided. We test the performance of our approach via several numerical examples.

Simulation-based performance analysis of stochastic models, or what is usually known as stochastic simulation, is built on input model assumptions that to some extent deviate from the truth. Consequently, the performance analysis subject to these input errors may lead to poor prediction and suboptimal decision-making. To address this important problem, the common study framework in the stochastic simulation literature focuses on output variability measures or confidence bounds that account for the input uncertainty when input data are available. Some established statistical techniques such as the bootstrap (e.g. Barton and Schruben (1993), Barton et al. (2013)), goodness-of-fit tests (e.g. Banks et al. (2009)), Bayesian inference and model selection (e.g. Chick (2001), Zouaoui and Wilson (2004)) and the delta method (e.g., Cheng and Holland (1998, 2004)) etc. have been proposed and have proven effective in many situations.

In this paper, we focus on a setting that diverges from most past literature: we are primarily interested in situations with insufficient data, or when the modeler wants to assess risk beyond what the data or the model indicates. Such situations can arise when the system, service target or operational policy in study is at a testing stage without much prior experience. To find reliable estimates for the output quantities in these settings, we investigate a worst-case approach operating on the space of input models. In this framework, the modeler represents the partial and nonparametric beliefs about the input model as constraints, and computes tight worst-case bounds among all models that satisfy them. More precisely, let be a performance measure that depends on input models, each generated from a probability distribution . The formulation for computing the worst-case bounds are

 minPi∈Ui,i=1,…,mZ(P1,…,Pm)\ \ \ \ % and\ \ \ \ maxPi∈Ui,i=1,…,mZ(P1,…,Pm) (1)

The set encodes the collection of all possible from the knowledge of the modeler. The decision variables in the optimizations in (id1) are the unknown models .

The primary motivation for using (id1) is the robustness against model misspecification, where a proper construction of the set avoids making specific assumptions beyond the modeler’s knowledge. The following three examples motivate and explain further.

###### Example 1 (Robust bounds under expert opinion)

When little information is available for an input model, a common practice in stochastic simulation is to summarize its range (say ) and mean (say ; or sometimes mode) as a triangular distribution (Banks et al. (2009), Chapter 5), where the base of the triangle denotes the range and the position of the peak is calibrated from the mean. This specific distribution, while easy to use, only crudely describes the knowledge of the modeler and may deviate from the true distribution, even if are correctly specified. Instead, using

 Ui={Pi:EPi[Xi]=μ, supp\ Pi=[a,b]} (2)

in formulation (id1), where is the random variate, is the expectation under , and is the support of , will give a valid interval that covers the true performance measure whenever are correctly specified. Moreover, when these parameters are not fully known but instead specified within a range, (1) can be relaxed to

 Ui={Pi:μ––≤EPi[Xi]≤¯¯¯μ, supp\ Xi=[a––,¯¯b]}

where denotes the range of the mean and denote the lower estimate of the lower support end and upper estimate of the upper support end respectively. The resulting bound will cover the truth as long as these ranges are supplied correctly.

###### Example 2 (Dependency modeling)

In constructing dependent input models, common approaches in the simulation literature fit the marginal description and the correlation of a multivariate model to a specified family. Examples include Gaussian copula (e.g., Lurie and Goldberg (1998), Channouf and L’Ecuyer (2009); also known as normal-to-anything (NORTA), e.g. Cario and Nelson (1997)) and chessboard distribution (Ghosh and Henderson (2002)) that uses a domain discretization. These distributions are correctly constructed up to their marginal description and correlation, provided that these information are correctly specified. However, dependency structure beyond correlation can imply errors on these approaches (e.g., Lam (2016c)). Formulation (id1) can be used to get bounds that address such dependency. For example, suppose is a bivariate input model with marginal distributions , marginal means and covariance . We can set

 Ui={Pi:PPi,1(Xi,1≤qi,1j)=ν1j,j=1,…,l1, PPi,2(Xi,2≤qi,2j)=ν2j,j=1,…,l2, E[Xi,1Xi,2]=ρi+μi,1μi,2}

where denote the random vector under , and are pre-specified quantiles and probabilities of the respective marginal distributions. Unlike previous approaches, (id1) outputs correct bounds on the truth given correctly specified marginal quantiles and correlation, regardless of the dependency structure.

###### Example 3 (Model risk)

Model risk refers broadly to the uncertainty in analysis arising from the adopted model not being fully accurate. This inaccuracy occurs as the adopted model (often known as the baseline model), typically obtained from the best statistical fit or expert opinion, deviates from the truth due to the real-world non-stationarity and the lack of full modeling knowledge or capability. To assess model risk, a recently surging literature studies the use of statistical distance as a measurement of model discrepancy (e.g., Glasserman and Xu (2014), Lam (2016b)). Given the baseline model , the idea is to represent the uncertainty in terms of the distance away from the baseline via a neighborhood ball

 Ui={Pi:d(Pi,Pib)≤ηi} (3)

where is a distance defined on the nonparametric space of distributions (i.e., without restricting to any parametric families). The bounds drawn from formulation (id1) assesses the effect of model risk due to the input models, tuned by the ball size parameter that denotes the uncertainty level. Besides risk assessment, this approach can also be used to obtain consistent confidence bounds for the true performance measure, when is taken as the empirical distribution and and are chosen suitably. For example, when is a -divergence and the model is finite discrete, one can choose as a scaled -quantile. When is the sup-norm and the model is continuous, then one can choose as the quantile of a Kolmogorov-Smirnov statistics. The statistical properties of these choices are endowed from the associated goodness-of-fit tests (Ben-Tal et al. (2013), Bertsimas et al. (2014)).

Our worst-case approach is inspired from the literature of robust optimization (Ben-Tal et al. (2009), Bertsimas et al. (2011)), which considers decision-making under uncertainty and advocates optimizing decisions over worst-case scenarios. In particular, when the uncertainty lies in the probability distributions that govern a stochastic problem, the decision is made to optimize under the worst-case distributions. This approach has been used widely in robust stochastic control (e.g. Hansen and Sargent (2008), Petersen et al. (2000)) and distributionally robust optimization (e.g. Delage and Ye (2010), Lim et al. (2006)). It has also appeared in so-called robust simulation or robust Monte Carlo in the simulation literature (Hu et al. (2012), Glasserman and Xu (2014)). However, the methodologies presented in these literature focus on structured problems where the objective function is tractable, such as linear or linearly decomposable. In contrast, for most problems in stochastic simulation is nonlinear, unstructured and is only amenable to simulation, obstructing the direct adaption of the existing methods. In view of this, our main objective is to design an efficient simulation-based method to compute the worst-case bounds for formulation (id1) that can be applied to broad classes of simulation models and input uncertainty representations.

We study a simulation-based iterative procedure for the worst-case optimizations (id1), based on a modified version of the celebrated stochastic approximation (SA) method (e.g. Kushner and Yin (2003)). Because of the iterative nature, it is difficult to directly operate on the space of continuous distributions except in very special cases. Thus, our first contribution is to provide a randomized discretization scheme that can provably approximate the continuous counterpart. This allows one to focus on the space of discrete distributions on fixed support points as the decision variable for feeding into our SA algorithm.

We develop the SA method in several aspects. First, we construct an unbiased gradient estimator for based on the idea of the Gateaux derivative for functionals of probability distributions (Serfling (2009)), which is used to obtain the direction in each subsequent iterate in the SA scheme. The need for such construction is motivated by the difficulty in naïve implementation of standard gradient estimators: An arbitrary perturbation of a probability distribution, which is the decision variable in the optimization, may shoot outside the probability simplex and results in a gradient that does not bear any probabilistic meaning and subsequently does not support simulation-based estimation. Our approach effectively restricts the direction of perturbation to within a probability simplex so that the gradient is guaranteed to be simulable. We justify it as a nonparametric version of the classical likelihood ratio method (also called the score function method) (Glynn (1990), Reiman and Weiss (1989), Rubinstein (1986)).

Next, we design and analyze our SA scheme under the uncertainty constraints. We choose to use a stochastic counterpart of the so-called Frank-Wolfe (FW) method (Frank and Wolfe (1956)), known synonymously as the conditional gradient method in deterministic nonlinear programming. For convenience we call our scheme FWSA. Note that a standard SA iteration follows the estimated gradient up to a pre-specified step size to find the next candidate iterate. When the formulation includes constraints, the common approach in the SA literature projects the candidate solution onto the feasible region in order to define the next iterate (e.g. Kushner and Yin (2003)). Instead, our method looks in advance for a feasible direction along which the next iterate is guaranteed to lie in the (convex) feasible region. In order to find this feasible direction, an optimization subproblem with a linear objective function is solved in each iteration. We base our choice of using FWSA on its computational benefit in solving these subproblems, as their linear objectives allow efficient solution scheme for high-dimensional decision variables for many choices of the set .

We characterize the convergence rate of FWSA in terms of the step size and the number of simulation replications used to estimate the gradient at each iteration. The form of our convergence bounds suggests prescriptions for the step-size and sample-size sequences that are efficient with respect to the cumulative number of sample paths simulated to generate all the gradients until the current iterate. The literature on the stochastic FW methods for non-convex problems is small. Kushner (1974) proves almost sure convergence under assumptions that can prescribe algorithmic specifications only for one-dimensional settings. During the review process of this paper, two other convergence rate studies Reddi et al. (2016) and Lafond et al. (2016) have appeared. Both of them assume the so-called -Lipschitz condition on the gradient estimator that does not apply to our setting. Consequently, our obtained convergence rates are generally inferior to their results. Nonetheless, we shall argue that our rates almost match theirs under stronger assumptions on the behavior of the iterates that we will discuss.

Finally, we provide numerical validation of our approach using two sets of experiments, one testing the performance of our proposed randomized discretization strategy, and one on the convergence of FWSA. We also illustrate the impacts from the choices of constraints and investigate the form of the resulting worst-case input distributions in these examples.

We briefly survey three lines of related work. First, our paper is related to the large literature on input model uncertainty. In the parametric regime, studies have focused on the construction of confidence intervals or variance decompositions to account for both parameter and stochastic uncertainty using data, via for instance the delta method (Cheng and Holland (1998, 2004)), the bootstrap (Barton et al. (2013), Cheng and Holloand (1997)), Bayesian approaches (Zouaoui and Wilson (2003), Xie et al. (2014), Saltelli et al. (2010, 2008)), and metamodel-assisted analysis (Xie et al. (2014, 2015)). Model selection beyond a single parametric model can be handled through goodness-of-fit or Bayesian model selection and averaging (Chick (2001), Zouaoui and Wilson (2004)). Fully nonparametric approaches using the bootstrap have also been investigated (Barton and Schruben (1993, 2001), Song and Nelson (2015)).

Second, formulation (id1) relates to the literature on robust stochastic control (Petersen et al. (2000), Iyengar (2005), Nilim and El Ghaoui (2005)) and distributionally robust optimization (Delage and Ye (2010), Goh and Sim (2010), Ben-Tal et al. (2013)), where the focus is to make decision rules under stochastic environments that are robust against the ambiguity of the underlying probability distributions. This is usually cast in the form of a minimax problem where the inner maximization is over the space of distributions. This idea has spanned across multiple areas like economics (Hansen and Sargent (2001, 2008)), finance (Glasserman and Xu (2013), Lim et al. (2011)), queueing (Jain et al. (2010)) and dynamic pricing (Lim and Shanthikumar (2007)). In the simulation context, Hu et al. (2012) compared different global warming policies using Gaussian models with uncertain mean and covariance information and coined the term “robust simulation” in their study. Glasserman and Xu (2014) and Lam (2016b) studied approximation methods for distance-based constraints in model risk quantification, via sample average approximation and infinitesimal approximation respectively. Simulation optimization under input uncertainty has also been studied in the framework of robust optimization (Fan et al. (2013), Ryzhov et al. (2012)) and via the closely related approach using risk measure (Qian et al. (2015), Zhou and Xie (2015)). Lastly, optimizations over probability distributions have also arisen as so-called generalized moment problems, applied to decision analysis (Smith (1995, 1993), Bertsimas and Popescu (2005)) and stochastic programming (Birge and Wets (1987)).

Our algorithm relates to the literature on the FW method (Frank and Wolfe (1956)) and constrained SA. The former is a nonlinear programming technique initially proposed for convex optimization, based on sequential linearization of the objective function using the gradient at the solution iterate. SA is the stochastic counterpart of gradient-based method under noisy observations. The classical work of Canon and Cullum (1968), Dunn (1979) and Dunn (1980) analyzed convergence properties of FW for deterministic convex programs. Recently, Jaggi (2013), Freund and Grigas (2014) and Hazan and Luo (2016) carried out new finite-time analysis for the FW method motivated by machine learning applications. For stochastic FW on non-convex problems, Kushner (1974) focused on almost sure convergence based on a set of assumptions about the probabilistic behavior of the iterations, which were then used to tune the algorithm for one-dimensional problems. Recently, while this paper was under review, Reddi et al. (2016) provided a complexity analysis in terms of the sample size in estimating gradients and the number of calls of the linear optimization routine. Lafond et al. (2016) studied the performance in terms of regret in an online setting. Both Reddi et al. (2016) and Lafond et al. (2016) relied on the -Lipschitz condition that our gradient estimator violated. Other types of constrained SA schemes include the Lagrangian method (Buche and Kushner (2002)) and mirror descent SA (Nemirovski et al. (2009)). Lastly, general convergence results for SA can be found in Fu (1994), Kushner and Yin (2003) and Pasupathy and Kim (2011).

The remainder of this paper is organized as follows. Section id1 describes our formulation and assumptions. Section id1 presents our performance guarantees and discretization strategy. Section id1 focuses on gradient estimation on the space of probability distributions. Section id1 introduces the FWSA procedure. Section id1 presents theoretical convergence results on FWSA. Section id1 shows some numerical performance. Section id1 concludes and suggests future research. Finally, Appendix id1 gives the remaining proofs of our theorems.

We focus on that is a finite horizon performance measure generated from i.i.d. replications from the independent input models . Let be i.i.d. random variables on the space , each generated under . The performance measure can be written as

 Z(P1,…,Pm)=EP1,…,Pm[h(X1,…,Xm)]=∫⋯∫h(x1,…,xm)T1∏t=1dP(x1t)⋯Tm∏t=1dP(xmt) (4)

where is a cost function, and denotes the expectation associated with the generation of the i.i.d. replications. We assume that can be evaluated by the computer given the inputs. In other words, the performance measure (4) can be approximated by running simulation.

(4) is the stylized representation for transient performance measures in discrete-event simulation. For example, and can be the sequences of interarrival and service times in a queue, and and are the interarrival time and service time distributions. When is the indicator function of the waiting time exceeding a threshold, (4) will denote the expected waiting time.

Next we discuss the constraints in (id1). Following the terminology in robust optimization, we call the uncertainty set for the -th input model. Motivated by the examples in the Introduction, we focus on two types of convex uncertainty sets:

1. Moment and support constraints: We consider

 Ui={Pi:EPi[fil(Xi)]≤μil,l=1,…,si, supp\ Pi=Ai} (5)

where is a generic random variable under distribution , , and . For instance, when , being or denotes the first two moments. When , denotes the cross-moment. Equalities can also be represented via (5) by including . Thus the uncertainty set (5) covers Examples 1 and 2 in the Introduction.

Furthermore, the neighborhood measured by certain types of statistical distance (Example 3) can also be cast as (5). For instance, suppose is induced by the sup-norm on the distribution function on . Suppose is a continuous distribution and the baseline distribution is discrete with support points . The constraint

 supx∈R|Fi(x)−Fib(x)|≤ηi (6)

where and denote the distribution functions for and respectively, can be reformulated as

 Fib(yj+)−ηi≤Fi(yj)≤Fib(yj−)+ηi, l=1,…,ni

where and denote the left and right limits of at , by using the monotonicity of distribution functions. Thus

 Ui={Pi:Fib(yj+)−ηi≤Ei[I(Xi≤yj)]≤Fib(yj−)+ηi, j=1,…,ni, supp\ Pi=R}

where denotes the indicator function, falls into the form of (5). Bertsimas et al. (2014) considers this reformulation for constructing uncertainty sets for stochastic optimization problems, and suggests to select as the quantile of Kolmogorov-Smirnov statistics if is the empirical distribution function constructed from i.i.d. data.

2. Neighborhood of a baseline model measured by -divergence: Consider

 Ui={Pi:dϕ(Pi,Pib)≤ηi} (7)

where denotes the -divergence from a baseline distribution given by

 dϕ(Pi,Pib)=∫ϕ(dPidPib)dPib

which is finite only when is absolutely continuous with respect to . The function is a convex function satisfying . This family covers many widely used distances. Common examples are giving the KL divergence, giving the (modified) -distance, and giving the Cressie-Read divergence. Details of -divergence can be found in, e.g., Pardo (2005) and Ben-Tal et al. (2013).

As precursed in the Introduction, in the context of simulation analysis where are the input models, in (4) is in general a complex nonlinear function. This raises challenges in solving (id1) beyond the literature of robust control and optimization that considers typically more tractable objectives. Indeed, if is a linear function in ’s, then optimizing over the two types of uncertainty sets above can both be cast as specialized classes of convex programs that can be efficiently solved. But linear is too restrictive to describe the input-output relation in simulation. To handle a broader class of and to address its simulation-based nature, we propose to use a stochastic iterative method. The next sections will discuss our methodology in relation to the performance guarantees provided by (id1).

Suppose there is a “ground true” distribution for each input model. Let and be the minimum and maximum values of the worst-case optimizations (id1). Let be the true performance measure, i.e. . The following highlights an immediate implication of using (id1):

###### Theorem 1

If for all , then .

In other words, the bounds from the worst-case optimizations form an interval that covers the true performance measure if the uncertainty sets contain the true distributions.

We discuss a discretization strategy for the worst-case optimizations for continuous input distributions. We will show that, by replacing the continuous distribution with a discrete distribution on support points that are initially sampled from some suitably chosen distribution, we can recover the guarantee in Theorem 1 up to a small error. The motivation for using discretization comes from the challenges in handling decision variables in the form of continuous distributions when running our iterative optimization scheme proposed later.

We focus on the two uncertainty sets (5) and (7). The following states our guarantee:

###### Theorem 2

Consider in (4). Assume is bounded a.s.. Suppose and are positive numbers such that for all for some . Consider the optimizations

 ^Z∗=minPi∈^Ui,i=1,…,mZ(P1,…,Pm)\ \ \ \ and\ \ \ \ ^Z∗=maxPi∈^Ui,i=1,…,mZ(P1,…,Pm) (8)

Suppose that for each , one of the two cases occurs:

1.  ^Ui={Pi:EPi[fil(Xi)]≤μil,l=1,…,si, supp\ Pi={yi1,…,yini}} (9)

where are observations drawn from a distribution such that the true distribution is absolutely continuous with respect to with satisfying , where denotes the essential supremum norm. Moreover, assume that satisfies and for all .

2.  ^Ui={Pi:dϕ(Pi,^Pib)≤ηi} (10)

where denotes the empirical distribution drawn from observations from a baseline . Assume that satisfies . Moreover, assume satisfies and . Additionally, assume for all .

Then

 ^Z∗≤Z0+Op(1√n)≤^Z∗ (11)

Theorem 2 is proved in Appendix id1. We have a few immediate remarks:

1. Optimizations (8) are the sample counterparts of the original worst-case optimizations (id1) with uncertainty sets given by (5) or (7). These sample counterparts optimize discrete distributions over support points that are sampled from either or depending on the type of uncertainty. Roughly speaking, Theorem 2 guarantees that, if the original worst-case optimizations give valid covering bounds for the true performance measure (in the spirit of Theorem 1), then so are the sample counterparts, up to an error where denotes the order of the sample size used to construct the sets of support points.

2. The condition implies that or has a tail at least as heavy as . In practice, the tail of the true distribution is not exactly known a priori. This means that it is safer to sample the support points from a heavy-tailed distribution when moment constraints are imposed, and to use a heavy-tailed baseline distribution in the case of -divergence neighborhood.

3. The conditions and state that and are in the interior of and respectively. These conditions guarantee that projected on a sample approximation of the support is feasible for (8), which helps lead to the guarantee (11). In general, the closer is to the boundary of the uncertainty set, i.e., the smaller the values of and , the larger the sample size is needed for the asymptotic behavior in (11) to kick in, a fact that is not revealed explicitly in Theorem 2. One way to control this required sample size is to expand the uncertainty set by a small margin, say , i.e., use and , in (9) and (10). Note that, in the case of moment equality constraint, say , one does have to deliberately relax the constraint to for the interior-point conditions to hold.

A probabilistic analog of Theorem 1 is:

###### Theorem 3

Suppose contains the true distribution for all with confidence , i.e. , then , where denotes the probability generated from a combination of data and prior belief.

Theorem 3 follows immediately from Theorem 1. In the frequentist framework, refers to the probability generated from data. However, Theorem 3 can also be cast in a Bayesian framework, in which can represent the prior (e.g., from expert opinion) or the posterior belief.

Theorem 3 reconciles with the established framework in distributionally robust optimization that the uncertainty set should be chosen as a confidence set for the true distribution, in order to provide a guarantee for the coverage probability on the true objective, in the case that represents the generation of data under a true model. For a moment constraint in the form , one can choose as the confidence bound of the moment. Section 3 in Delage and Ye (2010) discusses the construction of confidence regions based on more general semidefinite moment constraints. For the distance-based constraint in (6), under the assumption that is continuous, chosen as the -quantile of where is a standard Brownian bridge, gives an approximate confidence region by utilizing the limiting distribution of the Kolmogorov-Smirnov statistic. Bertsimas et al. (2014) studies more general use of goodness-of-fit statistics along this line to create confidence regions. For the -divergence-based constraint in (7), under the assumption that has finite support of size , Ben-Tal et al. (2013) proposes using in the case is taken as the empirical distribution, where is the -quantile of a -distribution with degree of freedom . This leads to an asymptotic confidence region by using divergence-based inference (Pardo (2005)). Recent works such as Lam and Zhou (2015), Lam (2016a) and Duchi et al. (2016) investigate the tightening of divergence-based regions using the empirical likelihood theory (Owen (2001)).

Similarly, the probabilistic analog of Theorem 2 is:

###### Theorem 4

Suppose all assumptions in Theorem 2 are in place except that or now holds true jointly for all with confidence under . Then .

Theorems 2 and 4 allow the translation from (id1), whose input models can be continuously represented, to (8) that is imposed over discrete distributions, by paying a small price of error. In the next section we discuss our algorithm over discrete distributions and point out clearly why the discretization is necessary. Obviously, when the input model is finite discrete, the sampling step depicted in Theorem 2 is unnecessary, and our subsequent results regarding the algorithm applies readily to this case.

Since we work in the discrete space, for simplicity we denote as the vector of probability weights for the discretized input model . This probability vector is understood to apply on the support points . Moreover, let where vec denotes a concatenation of the vectors ’s as a single vector, where . We denote as the -dimensional probability simplex. Hence . For convenience, let , so that . The performance measure in (8) can be written as . Furthermore, denote as the maximum length of replications among all input models. We also write and for simplicity. Recall that denotes the indicator function for the event . In the rest of this paper, denotes transpose, and denotes the Euclidean norm of a vector . We also write as the variance under the input distribution . Inequalities for vectors are defined component-wise.

We shall present an iterative simulation-based scheme for optimizing (8). The first step is to design a method to extract the gradient information of . Note that the standard gradient of , which we denote as , obtained through differentiation of , may not lead to any simulable object. This is because an arbitrary perturbation of may shoot out from the set of probability simplices, and the resulting gradient will be a high-degree polynomial in that may have no probabilistic interpretation and thus is not amenable to simulation-based estimation.

We address this issue by considering the set of perturbations within the simplices. Our approach resembles the Gateaux derivative on a functional of probability distribution (Serfling (2009)) as follows. Given any , define a mixture distribution , where represents a point mass on , i.e. and 1 is at the -th coordinate. The number is the mixture parameter. When , this reduces to the given distribution . We treat as a parameter and differentiate with respect to for each .

More precisely, let

 ψij(p)=ddϵZ(p1,…,pi−1,(1−ϵ)pi+ϵ1ij,pi+1,…,pm)∣∣ϵ=0

Denote , and . We show that possesses the following two properties:

###### Theorem 5

Given such that , we have:

1.  ∇Z(p)′(q−p)=m∑i=1∇iZ(p)′(qi−pi)=m∑i=1ψi(p)′(qi−pi)=ψ(p)′(q−p) (12)

for any and , where is the gradient of taken with respect to .

2.  ψij(p)=Ep[h(X)sij(Xi)] (13)

where is defined as

 sij(xi)=Ti∑t=1I(xit=yij)pij−Ti (14)

for .

The first property above states that and are identical when viewed as directional derivatives, as long as the direction lies within . Since the feasible region of optimizations (8) lies in , it suffices to focus on . The second property above states that can be estimated unbiasedly in a way similar to the classical likelihood ratio method (Glynn (1990), Reiman and Weiss (1989)), with playing the role of the score function. Since this representation holds without assuming any specific parametric form for , we view it as a nonparametric version of the likelihood ratio method.

\@trivlist

To prove 1., consider first a mixture of with an arbitrary , in the form . It satisfies

 ddϵZ(p1,…,pi−1,(1−ϵ)pi+ϵqi,pi+1,…,pm)∣∣ϵ=0=∇iZ(p)′(qi−pi)

by the chain rule. In particular, we must have

 ψij(p)=∇iZ(p)′(1ij−pi)=∂ijZ(p)−∇iZ(p)′pi (15)

where denotes partial derivative of with respect to . Writing (15) for all together gives

 ψi(p)=∇iZ(p)−(∇iZ(p)′pi)1i

where is a vector of 1. Therefore

 ψi(p)′(qi−pi)=(∇iZ(p)−(∇iZ(p)′pi)1i)′(qi−pi)=∇iZ(p)′(qi−pi)

since . Summing up over , (12) follows.

To prove 2., note that we have

 ψij(p) =ddϵZ(p1,…,pi−1,(1−ϵ)pi+ϵ1ij,pi+1,…,pm)∣∣ϵ=0 =ddϵEp1,…,pi−1,(1−ϵ)pi+ϵ1ij,pi+1,…,pm[h(X)]∣∣∣ϵ=0 =Ep[h(X)sij(Xi)] (16)

where is the score function defined as

 sij(xi)=Ti∑t=1ddϵlog((1−ϵ)pi(xit)+ϵI(xit=yij))∣∣∣ϵ=0. (17)

Here where is chosen such that . The last equality in (16) follows from the fact that

 ddϵTi∏t=1((1−ϵ)pi(xit)+ϵI(xit=yij))∣∣∣ϵ=0=ddϵTi∑t=1log((1−ϵ)pi(xit)+ϵI(xit=yij))∣∣∣ϵ=0⋅Ti∏t=1pi(xit)

Note that (17) can be further written as

 Ti∑t=1−pi(xit)+I(xit=yij)pi(xit)=−Ti+Ti∑t=1I(xit=yij)pi(xit)=−Ti+Ti∑t=1I(xit=yij)pij

The following provides a bound on the variance of the estimator for (See Appendix id1 for proof):

###### Lemma 1

Assume is bounded a.s., i.e. for some , and that . Each sample for estimating , by using one sample path of , possesses a variance bounded from above by .

The function derived via the above Gateaux derivative framework can be interpreted as a discrete version of the so-called influence function in robust statistics (Hampel (1974), Hampel et al. (2011)), which is commonly used to approximate the first order effect on a given statistics due to contamination of data. In general, the gradient represented by the influence function is defined as an operator on the domain of the random object distributed under . Thus, in the continuous case, this object has an infinite-dimensional domain, and except in very special cases (e.g., tail probabilities of i.i.d. sum; Lam and Mottet (2015)), computing and encoding it is not implementable. This is the main reason why we seek for a discretization in the first place.

With the implementable form of the gradient described in Section id1, we design a stochastic nonlinear programming technique to solve (8). We choose to use the Frank-Wolfe method because, for the types of we consider in Section id1, effective routines exist for solving the induced linearized subproblems.

FWSA works as follows: For convenience denote . To avoid repetition we focus only on the minimization formulation in (id1). First, pretending that can be computed exactly, it iteratively updates a solution sequence as follows. Given a current solution , solve

 (18)

Let the optimal solution to (18) be . The quantity gives a feasible minimization direction starting from (note that is convex). This is then used to update to via for some step size . This expression can be rewritten as , which can be interpreted as a mixture between the distributions and .

When is not exactly known, one can replace it by an empirical counterpart. Theorem 5 suggests that we can replace by , and so the empirical counterpart of (18) is

 minp∈^U^ψ(pk)′(p−pk) (19)

where is an estimator of using a sample size . Note that all components of can be obtained from these sample paths simultaneously. Letting be the optimal solution to (19), the update rule will be for some step size . The sample size at each step needs to grow suitably to compensate for the bias introduced in solving (19). All these are summarized in Procedure 1.

By (12) and the separability of uncertainty set , the subproblem at each iteration can be written as

 (20)

where is the empirical counterpart of obtained in Algorithm 1. Hence (20) can be solved by separate convex programs. The update step follows by taking , where and is the solution to the -th separate program.

The separate programs in (20) can be efficiently solved for the uncertainty sets considered in Section id1. To facilitate discussion, we denote a generic form of each separate program in (20) as

 minpi∈^Uiξ′pi (21)

for an arbitrary vector .

Case 1 in Theorem 2: Moment and support constraints. Consider where . Then (21) is a linear program.

Case 2 in Theorem 2: -divergence neighborhood. Consider