[10pt]
Robust Analysis in Stochastic Simulation: Computation and Performance Guarantees
Soumyadip Ghosh
Business Analytics and Math Sciences, IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, ghoshs@us.ibm.com
Henry Lam
Department of Industrial and Operations Engineering, University of Michigan, Ann Arbor, MI 48109, khlam@umich.edu
Any performance analysis based on stochastic simulation is subject to the errors inherent in misspecifying the modeling assumptions, particularly the input distributions. In situations with little support from data, we investigate the use of worstcase analysis to analyze these errors, by representing the partial, nonparametric knowledge of the input models via optimization constraints. We study the performance and robustness guarantees of this approach. We design and analyze a numerical scheme for a general class of simulation objectives and uncertainty specifications, based on a FrankWolfe (FW) variant of the stochastic approximation (SA) method (which we call FWSA) run on the space of input probability distributions. The key steps involve a randomized discretization of the probability spaces and the construction of simulable unbiased gradient estimators using a nonparametric analog of the classical likelihood ratio method. A convergence analysis for FWSA on nonconvex problems is provided. We test the performance of our approach via several numerical examples.
Simulationbased performance analysis of stochastic models, or what is usually known as stochastic simulation, is built on input model assumptions that to some extent deviate from the truth. Consequently, the performance analysis subject to these input errors may lead to poor prediction and suboptimal decisionmaking. To address this important problem, the common study framework in the stochastic simulation literature focuses on output variability measures or confidence bounds that account for the input uncertainty when input data are available. Some established statistical techniques such as the bootstrap (e.g. Barton and Schruben (1993), Barton et al. (2013)), goodnessoffit tests (e.g. Banks et al. (2009)), Bayesian inference and model selection (e.g. Chick (2001), Zouaoui and Wilson (2004)) and the delta method (e.g., Cheng and Holland (1998, 2004)) etc. have been proposed and have proven effective in many situations.
In this paper, we focus on a setting that diverges from most past literature: we are primarily interested in situations with insufficient data, or when the modeler wants to assess risk beyond what the data or the model indicates. Such situations can arise when the system, service target or operational policy in study is at a testing stage without much prior experience. To find reliable estimates for the output quantities in these settings, we investigate a worstcase approach operating on the space of input models. In this framework, the modeler represents the partial and nonparametric beliefs about the input model as constraints, and computes tight worstcase bounds among all models that satisfy them. More precisely, let be a performance measure that depends on input models, each generated from a probability distribution . The formulation for computing the worstcase bounds are
(1) 
The set encodes the collection of all possible from the knowledge of the modeler. The decision variables in the optimizations in (id1) are the unknown models .
The primary motivation for using (id1) is the robustness against model misspecification, where a proper construction of the set avoids making specific assumptions beyond the modeler’s knowledge. The following three examples motivate and explain further.
Example 1 (Robust bounds under expert opinion)
When little information is available for an input model, a common practice in stochastic simulation is to summarize its range (say ) and mean (say ; or sometimes mode) as a triangular distribution (Banks et al. (2009), Chapter 5), where the base of the triangle denotes the range and the position of the peak is calibrated from the mean. This specific distribution, while easy to use, only crudely describes the knowledge of the modeler and may deviate from the true distribution, even if are correctly specified. Instead, using
(2) 
in formulation (id1), where is the random variate, is the expectation under , and is the support of , will give a valid interval that covers the true performance measure whenever are correctly specified. Moreover, when these parameters are not fully known but instead specified within a range, (1) can be relaxed to
where denotes the range of the mean and denote the lower estimate of the lower support end and upper estimate of the upper support end respectively. The resulting bound will cover the truth as long as these ranges are supplied correctly.
Example 2 (Dependency modeling)
In constructing dependent input models, common approaches in the simulation literature fit the marginal description and the correlation of a multivariate model to a specified family. Examples include Gaussian copula (e.g., Lurie and Goldberg (1998), Channouf and L’Ecuyer (2009); also known as normaltoanything (NORTA), e.g. Cario and Nelson (1997)) and chessboard distribution (Ghosh and Henderson (2002)) that uses a domain discretization. These distributions are correctly constructed up to their marginal description and correlation, provided that these information are correctly specified. However, dependency structure beyond correlation can imply errors on these approaches (e.g., Lam (2016c)). Formulation (id1) can be used to get bounds that address such dependency. For example, suppose is a bivariate input model with marginal distributions , marginal means and covariance . We can set
where denote the random vector under , and are prespecified quantiles and probabilities of the respective marginal distributions. Unlike previous approaches, (id1) outputs correct bounds on the truth given correctly specified marginal quantiles and correlation, regardless of the dependency structure.
Example 3 (Model risk)
Model risk refers broadly to the uncertainty in analysis arising from the adopted model not being fully accurate. This inaccuracy occurs as the adopted model (often known as the baseline model), typically obtained from the best statistical fit or expert opinion, deviates from the truth due to the realworld nonstationarity and the lack of full modeling knowledge or capability. To assess model risk, a recently surging literature studies the use of statistical distance as a measurement of model discrepancy (e.g., Glasserman and Xu (2014), Lam (2016b)). Given the baseline model , the idea is to represent the uncertainty in terms of the distance away from the baseline via a neighborhood ball
(3) 
where is a distance defined on the nonparametric space of distributions (i.e., without restricting to any parametric families). The bounds drawn from formulation (id1) assesses the effect of model risk due to the input models, tuned by the ball size parameter that denotes the uncertainty level. Besides risk assessment, this approach can also be used to obtain consistent confidence bounds for the true performance measure, when is taken as the empirical distribution and and are chosen suitably. For example, when is a divergence and the model is finite discrete, one can choose as a scaled quantile. When is the supnorm and the model is continuous, then one can choose as the quantile of a KolmogorovSmirnov statistics. The statistical properties of these choices are endowed from the associated goodnessoffit tests (BenTal et al. (2013), Bertsimas et al. (2014)).
Our worstcase approach is inspired from the literature of robust optimization (BenTal et al. (2009), Bertsimas et al. (2011)), which considers decisionmaking under uncertainty and advocates optimizing decisions over worstcase scenarios. In particular, when the uncertainty lies in the probability distributions that govern a stochastic problem, the decision is made to optimize under the worstcase distributions. This approach has been used widely in robust stochastic control (e.g. Hansen and Sargent (2008), Petersen et al. (2000)) and distributionally robust optimization (e.g. Delage and Ye (2010), Lim et al. (2006)). It has also appeared in socalled robust simulation or robust Monte Carlo in the simulation literature (Hu et al. (2012), Glasserman and Xu (2014)). However, the methodologies presented in these literature focus on structured problems where the objective function is tractable, such as linear or linearly decomposable. In contrast, for most problems in stochastic simulation is nonlinear, unstructured and is only amenable to simulation, obstructing the direct adaption of the existing methods. In view of this, our main objective is to design an efficient simulationbased method to compute the worstcase bounds for formulation (id1) that can be applied to broad classes of simulation models and input uncertainty representations.
We study a simulationbased iterative procedure for the worstcase optimizations (id1), based on a modified version of the celebrated stochastic approximation (SA) method (e.g. Kushner and Yin (2003)). Because of the iterative nature, it is difficult to directly operate on the space of continuous distributions except in very special cases. Thus, our first contribution is to provide a randomized discretization scheme that can provably approximate the continuous counterpart. This allows one to focus on the space of discrete distributions on fixed support points as the decision variable for feeding into our SA algorithm.
We develop the SA method in several aspects. First, we construct an unbiased gradient estimator for based on the idea of the Gateaux derivative for functionals of probability distributions (Serfling (2009)), which is used to obtain the direction in each subsequent iterate in the SA scheme. The need for such construction is motivated by the difficulty in naïve implementation of standard gradient estimators: An arbitrary perturbation of a probability distribution, which is the decision variable in the optimization, may shoot outside the probability simplex and results in a gradient that does not bear any probabilistic meaning and subsequently does not support simulationbased estimation. Our approach effectively restricts the direction of perturbation to within a probability simplex so that the gradient is guaranteed to be simulable. We justify it as a nonparametric version of the classical likelihood ratio method (also called the score function method) (Glynn (1990), Reiman and Weiss (1989), Rubinstein (1986)).
Next, we design and analyze our SA scheme under the uncertainty constraints. We choose to use a stochastic counterpart of the socalled FrankWolfe (FW) method (Frank and Wolfe (1956)), known synonymously as the conditional gradient method in deterministic nonlinear programming. For convenience we call our scheme FWSA. Note that a standard SA iteration follows the estimated gradient up to a prespecified step size to find the next candidate iterate. When the formulation includes constraints, the common approach in the SA literature projects the candidate solution onto the feasible region in order to define the next iterate (e.g. Kushner and Yin (2003)). Instead, our method looks in advance for a feasible direction along which the next iterate is guaranteed to lie in the (convex) feasible region. In order to find this feasible direction, an optimization subproblem with a linear objective function is solved in each iteration. We base our choice of using FWSA on its computational benefit in solving these subproblems, as their linear objectives allow efficient solution scheme for highdimensional decision variables for many choices of the set .
We characterize the convergence rate of FWSA in terms of the step size and the number of simulation replications used to estimate the gradient at each iteration. The form of our convergence bounds suggests prescriptions for the stepsize and samplesize sequences that are efficient with respect to the cumulative number of sample paths simulated to generate all the gradients until the current iterate. The literature on the stochastic FW methods for nonconvex problems is small. Kushner (1974) proves almost sure convergence under assumptions that can prescribe algorithmic specifications only for onedimensional settings. During the review process of this paper, two other convergence rate studies Reddi et al. (2016) and Lafond et al. (2016) have appeared. Both of them assume the socalled Lipschitz condition on the gradient estimator that does not apply to our setting. Consequently, our obtained convergence rates are generally inferior to their results. Nonetheless, we shall argue that our rates almost match theirs under stronger assumptions on the behavior of the iterates that we will discuss.
Finally, we provide numerical validation of our approach using two sets of experiments, one testing the performance of our proposed randomized discretization strategy, and one on the convergence of FWSA. We also illustrate the impacts from the choices of constraints and investigate the form of the resulting worstcase input distributions in these examples.
We briefly survey three lines of related work. First, our paper is related to the large literature on input model uncertainty. In the parametric regime, studies have focused on the construction of confidence intervals or variance decompositions to account for both parameter and stochastic uncertainty using data, via for instance the delta method (Cheng and Holland (1998, 2004)), the bootstrap (Barton et al. (2013), Cheng and Holloand (1997)), Bayesian approaches (Zouaoui and Wilson (2003), Xie et al. (2014), Saltelli et al. (2010, 2008)), and metamodelassisted analysis (Xie et al. (2014, 2015)). Model selection beyond a single parametric model can be handled through goodnessoffit or Bayesian model selection and averaging (Chick (2001), Zouaoui and Wilson (2004)). Fully nonparametric approaches using the bootstrap have also been investigated (Barton and Schruben (1993, 2001), Song and Nelson (2015)).
Second, formulation (id1) relates to the literature on robust stochastic control (Petersen et al. (2000), Iyengar (2005), Nilim and El Ghaoui (2005)) and distributionally robust optimization (Delage and Ye (2010), Goh and Sim (2010), BenTal et al. (2013)), where the focus is to make decision rules under stochastic environments that are robust against the ambiguity of the underlying probability distributions. This is usually cast in the form of a minimax problem where the inner maximization is over the space of distributions. This idea has spanned across multiple areas like economics (Hansen and Sargent (2001, 2008)), finance (Glasserman and Xu (2013), Lim et al. (2011)), queueing (Jain et al. (2010)) and dynamic pricing (Lim and Shanthikumar (2007)). In the simulation context, Hu et al. (2012) compared different global warming policies using Gaussian models with uncertain mean and covariance information and coined the term “robust simulation” in their study. Glasserman and Xu (2014) and Lam (2016b) studied approximation methods for distancebased constraints in model risk quantification, via sample average approximation and infinitesimal approximation respectively. Simulation optimization under input uncertainty has also been studied in the framework of robust optimization (Fan et al. (2013), Ryzhov et al. (2012)) and via the closely related approach using risk measure (Qian et al. (2015), Zhou and Xie (2015)). Lastly, optimizations over probability distributions have also arisen as socalled generalized moment problems, applied to decision analysis (Smith (1995, 1993), Bertsimas and Popescu (2005)) and stochastic programming (Birge and Wets (1987)).
Our algorithm relates to the literature on the FW method (Frank and Wolfe (1956)) and constrained SA. The former is a nonlinear programming technique initially proposed for convex optimization, based on sequential linearization of the objective function using the gradient at the solution iterate. SA is the stochastic counterpart of gradientbased method under noisy observations. The classical work of Canon and Cullum (1968), Dunn (1979) and Dunn (1980) analyzed convergence properties of FW for deterministic convex programs. Recently, Jaggi (2013), Freund and Grigas (2014) and Hazan and Luo (2016) carried out new finitetime analysis for the FW method motivated by machine learning applications. For stochastic FW on nonconvex problems, Kushner (1974) focused on almost sure convergence based on a set of assumptions about the probabilistic behavior of the iterations, which were then used to tune the algorithm for onedimensional problems. Recently, while this paper was under review, Reddi et al. (2016) provided a complexity analysis in terms of the sample size in estimating gradients and the number of calls of the linear optimization routine. Lafond et al. (2016) studied the performance in terms of regret in an online setting. Both Reddi et al. (2016) and Lafond et al. (2016) relied on the Lipschitz condition that our gradient estimator violated. Other types of constrained SA schemes include the Lagrangian method (Buche and Kushner (2002)) and mirror descent SA (Nemirovski et al. (2009)). Lastly, general convergence results for SA can be found in Fu (1994), Kushner and Yin (2003) and Pasupathy and Kim (2011).
The remainder of this paper is organized as follows. Section id1 describes our formulation and assumptions. Section id1 presents our performance guarantees and discretization strategy. Section id1 focuses on gradient estimation on the space of probability distributions. Section id1 introduces the FWSA procedure. Section id1 presents theoretical convergence results on FWSA. Section id1 shows some numerical performance. Section id1 concludes and suggests future research. Finally, Appendix id1 gives the remaining proofs of our theorems.
We focus on that is a finite horizon performance measure generated from i.i.d. replications from the independent input models . Let be i.i.d. random variables on the space , each generated under . The performance measure can be written as
(4) 
where is a cost function, and denotes the expectation associated with the generation of the i.i.d. replications. We assume that can be evaluated by the computer given the inputs. In other words, the performance measure (4) can be approximated by running simulation.
(4) is the stylized representation for transient performance measures in discreteevent simulation. For example, and can be the sequences of interarrival and service times in a queue, and and are the interarrival time and service time distributions. When is the indicator function of the waiting time exceeding a threshold, (4) will denote the expected waiting time.
Next we discuss the constraints in (id1). Following the terminology in robust optimization, we call the uncertainty set for the th input model. Motivated by the examples in the Introduction, we focus on two types of convex uncertainty sets:

Moment and support constraints: We consider
(5) where is a generic random variable under distribution , , and . For instance, when , being or denotes the first two moments. When , denotes the crossmoment. Equalities can also be represented via (5) by including . Thus the uncertainty set (5) covers Examples 1 and 2 in the Introduction.
Furthermore, the neighborhood measured by certain types of statistical distance (Example 3) can also be cast as (5). For instance, suppose is induced by the supnorm on the distribution function on . Suppose is a continuous distribution and the baseline distribution is discrete with support points . The constraint
(6) where and denote the distribution functions for and respectively, can be reformulated as
where and denote the left and right limits of at , by using the monotonicity of distribution functions. Thus
where denotes the indicator function, falls into the form of (5). Bertsimas et al. (2014) considers this reformulation for constructing uncertainty sets for stochastic optimization problems, and suggests to select as the quantile of KolmogorovSmirnov statistics if is the empirical distribution function constructed from i.i.d. data.

Neighborhood of a baseline model measured by divergence: Consider
(7) where denotes the divergence from a baseline distribution given by
which is finite only when is absolutely continuous with respect to . The function is a convex function satisfying . This family covers many widely used distances. Common examples are giving the KL divergence, giving the (modified) distance, and giving the CressieRead divergence. Details of divergence can be found in, e.g., Pardo (2005) and BenTal et al. (2013).
As precursed in the Introduction, in the context of simulation analysis where are the input models, in (4) is in general a complex nonlinear function. This raises challenges in solving (id1) beyond the literature of robust control and optimization that considers typically more tractable objectives. Indeed, if is a linear function in ’s, then optimizing over the two types of uncertainty sets above can both be cast as specialized classes of convex programs that can be efficiently solved. But linear is too restrictive to describe the inputoutput relation in simulation. To handle a broader class of and to address its simulationbased nature, we propose to use a stochastic iterative method. The next sections will discuss our methodology in relation to the performance guarantees provided by (id1).
Suppose there is a “ground true” distribution for each input model. Let and be the minimum and maximum values of the worstcase optimizations (id1). Let be the true performance measure, i.e. . The following highlights an immediate implication of using (id1):
Theorem 1
If for all , then .
In other words, the bounds from the worstcase optimizations form an interval that covers the true performance measure if the uncertainty sets contain the true distributions.
We discuss a discretization strategy for the worstcase optimizations for continuous input distributions. We will show that, by replacing the continuous distribution with a discrete distribution on support points that are initially sampled from some suitably chosen distribution, we can recover the guarantee in Theorem 1 up to a small error. The motivation for using discretization comes from the challenges in handling decision variables in the form of continuous distributions when running our iterative optimization scheme proposed later.
Theorem 2
Consider in (4). Assume is bounded a.s.. Suppose and are positive numbers such that for all for some . Consider the optimizations
(8) 
Suppose that for each , one of the two cases occurs:

(9) where are observations drawn from a distribution such that the true distribution is absolutely continuous with respect to with satisfying , where denotes the essential supremum norm. Moreover, assume that satisfies and for all .

(10) where denotes the empirical distribution drawn from observations from a baseline . Assume that satisfies . Moreover, assume satisfies and . Additionally, assume for all .
Then
(11) 
Theorem 2 is proved in Appendix id1. We have a few immediate remarks:

Optimizations (8) are the sample counterparts of the original worstcase optimizations (id1) with uncertainty sets given by (5) or (7). These sample counterparts optimize discrete distributions over support points that are sampled from either or depending on the type of uncertainty. Roughly speaking, Theorem 2 guarantees that, if the original worstcase optimizations give valid covering bounds for the true performance measure (in the spirit of Theorem 1), then so are the sample counterparts, up to an error where denotes the order of the sample size used to construct the sets of support points.

The condition implies that or has a tail at least as heavy as . In practice, the tail of the true distribution is not exactly known a priori. This means that it is safer to sample the support points from a heavytailed distribution when moment constraints are imposed, and to use a heavytailed baseline distribution in the case of divergence neighborhood.

The conditions and state that and are in the interior of and respectively. These conditions guarantee that projected on a sample approximation of the support is feasible for (8), which helps lead to the guarantee (11). In general, the closer is to the boundary of the uncertainty set, i.e., the smaller the values of and , the larger the sample size is needed for the asymptotic behavior in (11) to kick in, a fact that is not revealed explicitly in Theorem 2. One way to control this required sample size is to expand the uncertainty set by a small margin, say , i.e., use and , in (9) and (10). Note that, in the case of moment equality constraint, say , one does have to deliberately relax the constraint to for the interiorpoint conditions to hold.
A probabilistic analog of Theorem 1 is:
Theorem 3
Suppose contains the true distribution for all with confidence , i.e. , then , where denotes the probability generated from a combination of data and prior belief.
Theorem 3 follows immediately from Theorem 1. In the frequentist framework, refers to the probability generated from data. However, Theorem 3 can also be cast in a Bayesian framework, in which can represent the prior (e.g., from expert opinion) or the posterior belief.
Theorem 3 reconciles with the established framework in distributionally robust optimization that the uncertainty set should be chosen as a confidence set for the true distribution, in order to provide a guarantee for the coverage probability on the true objective, in the case that represents the generation of data under a true model. For a moment constraint in the form , one can choose as the confidence bound of the moment. Section 3 in Delage and Ye (2010) discusses the construction of confidence regions based on more general semidefinite moment constraints. For the distancebased constraint in (6), under the assumption that is continuous, chosen as the quantile of where is a standard Brownian bridge, gives an approximate confidence region by utilizing the limiting distribution of the KolmogorovSmirnov statistic. Bertsimas et al. (2014) studies more general use of goodnessoffit statistics along this line to create confidence regions. For the divergencebased constraint in (7), under the assumption that has finite support of size , BenTal et al. (2013) proposes using in the case is taken as the empirical distribution, where is the quantile of a distribution with degree of freedom . This leads to an asymptotic confidence region by using divergencebased inference (Pardo (2005)). Recent works such as Lam and Zhou (2015), Lam (2016a) and Duchi et al. (2016) investigate the tightening of divergencebased regions using the empirical likelihood theory (Owen (2001)).
Similarly, the probabilistic analog of Theorem 2 is:
Theorem 4
Suppose all assumptions in Theorem 2 are in place except that or now holds true jointly for all with confidence under . Then .
Theorems 2 and 4 allow the translation from (id1), whose input models can be continuously represented, to (8) that is imposed over discrete distributions, by paying a small price of error. In the next section we discuss our algorithm over discrete distributions and point out clearly why the discretization is necessary. Obviously, when the input model is finite discrete, the sampling step depicted in Theorem 2 is unnecessary, and our subsequent results regarding the algorithm applies readily to this case.
Since we work in the discrete space, for simplicity we denote as the vector of probability weights for the discretized input model . This probability vector is understood to apply on the support points . Moreover, let where vec denotes a concatenation of the vectors ’s as a single vector, where . We denote as the dimensional probability simplex. Hence . For convenience, let , so that . The performance measure in (8) can be written as . Furthermore, denote as the maximum length of replications among all input models. We also write and for simplicity. Recall that denotes the indicator function for the event . In the rest of this paper, denotes transpose, and denotes the Euclidean norm of a vector . We also write as the variance under the input distribution . Inequalities for vectors are defined componentwise.
We shall present an iterative simulationbased scheme for optimizing (8). The first step is to design a method to extract the gradient information of . Note that the standard gradient of , which we denote as , obtained through differentiation of , may not lead to any simulable object. This is because an arbitrary perturbation of may shoot out from the set of probability simplices, and the resulting gradient will be a highdegree polynomial in that may have no probabilistic interpretation and thus is not amenable to simulationbased estimation.
We address this issue by considering the set of perturbations within the simplices. Our approach resembles the Gateaux derivative on a functional of probability distribution (Serfling (2009)) as follows. Given any , define a mixture distribution , where represents a point mass on , i.e. and 1 is at the th coordinate. The number is the mixture parameter. When , this reduces to the given distribution . We treat as a parameter and differentiate with respect to for each .
More precisely, let
Denote , and . We show that possesses the following two properties:
Theorem 5
Given such that , we have:

(12) for any and , where is the gradient of taken with respect to .

(13) where is defined as
(14) for .
The first property above states that and are identical when viewed as directional derivatives, as long as the direction lies within . Since the feasible region of optimizations (8) lies in , it suffices to focus on . The second property above states that can be estimated unbiasedly in a way similar to the classical likelihood ratio method (Glynn (1990), Reiman and Weiss (1989)), with playing the role of the score function. Since this representation holds without assuming any specific parametric form for , we view it as a nonparametric version of the likelihood ratio method.
To prove 1., consider first a mixture of with an arbitrary , in the form . It satisfies
by the chain rule. In particular, we must have
(15) 
where denotes partial derivative of with respect to . Writing (15) for all together gives
where is a vector of 1. Therefore
since . Summing up over , (12) follows.
To prove 2., note that we have
(16) 
where is the score function defined as
(17) 
Here where is chosen such that . The last equality in (16) follows from the fact that
Note that (17) can be further written as
which leads to (13). \@endparenv
The following provides a bound on the variance of the estimator for (See Appendix id1 for proof):
Lemma 1
Assume is bounded a.s., i.e. for some , and that . Each sample for estimating , by using one sample path of , possesses a variance bounded from above by .
The function derived via the above Gateaux derivative framework can be interpreted as a discrete version of the socalled influence function in robust statistics (Hampel (1974), Hampel et al. (2011)), which is commonly used to approximate the first order effect on a given statistics due to contamination of data. In general, the gradient represented by the influence function is defined as an operator on the domain of the random object distributed under . Thus, in the continuous case, this object has an infinitedimensional domain, and except in very special cases (e.g., tail probabilities of i.i.d. sum; Lam and Mottet (2015)), computing and encoding it is not implementable. This is the main reason why we seek for a discretization in the first place.
With the implementable form of the gradient described in Section id1, we design a stochastic nonlinear programming technique to solve (8). We choose to use the FrankWolfe method because, for the types of we consider in Section id1, effective routines exist for solving the induced linearized subproblems.
FWSA works as follows: For convenience denote . To avoid repetition we focus only on the minimization formulation in (id1). First, pretending that can be computed exactly, it iteratively updates a solution sequence as follows. Given a current solution , solve
(18) 
Let the optimal solution to (18) be . The quantity gives a feasible minimization direction starting from (note that is convex). This is then used to update to via for some step size . This expression can be rewritten as , which can be interpreted as a mixture between the distributions and .
When is not exactly known, one can replace it by an empirical counterpart. Theorem 5 suggests that we can replace by , and so the empirical counterpart of (18) is
(19) 
where is an estimator of using a sample size . Note that all components of can be obtained from these sample paths simultaneously. Letting be the optimal solution to (19), the update rule will be for some step size . The sample size at each step needs to grow suitably to compensate for the bias introduced in solving (19). All these are summarized in Procedure 1.
Initialization: where .
Input: Step size sequence , sample size sequence , .
Procedure: For each iteration , given :
By (12) and the separability of uncertainty set , the subproblem at each iteration can be written as
(20) 
where is the empirical counterpart of obtained in Algorithm 1. Hence (20) can be solved by separate convex programs. The update step follows by taking , where and is the solution to the th separate program.
The separate programs in (20) can be efficiently solved for the uncertainty sets considered in Section id1. To facilitate discussion, we denote a generic form of each separate program in (20) as
(21) 
for an arbitrary vector .
Case 1 in Theorem 2: Moment and support constraints. Consider where . Then (21) is a linear program.
Case 2 in Theorem 2: divergence neighborhood. Consider