Approximation Algorithms for Distributionally Robust
Stochastic Optimization with
BlackBox Distributions
Abstract
Twostage stochastic optimization is a widely used framework for modeling uncertainty, where we have a probability distribution over possible realizations of the data, called scenarios, and decisions are taken in two stages: we make firststage decisions knowing only the underlying distribution and before a scenario is realized, and may take additional secondstage recourse actions after a scenario is realized. The goal is typically to minimize the total expected cost. A common criticism levied at this model is that the underlying probability distribution is itself often imprecise! To address this, an approach that is quite versatile and has gained popularity in the stochasticoptimization literature is the distributionally robust 2stage model: given a collection of probability distributions, our goal now is to minimize the maximum expected total cost with respect to a distribution in .
There has been almost no prior work however on developing approximation algorithms for distributionally robust problems, when the underlying scenarioset is discrete, as is the case with discreteoptimization problems. We provide a framework for designing approximation algorithms in such settings when the collection is a ball around a central distribution and the central distribution is accessed only via a sampling black box.
We first show that one can utilize the sample average approximation (SAA) method—solve the distributionally robust problem with an empirical estimate of the central distribution—to reduce the problem to the case where the central distribution has polynomialsize support. This follows because we argue that a distributionally robust problem can be reduced in a novel way to a standard 2stage problem with bounded inflation factor, which enables one to use the SAA machinery developed for 2stage problems. Complementing this, we show how to approximately solve a fractional relaxation of the SAA (i.e., polynomialscenario centraldistribution) problem. Unlike in 2stage stochastic or robust optimization, this turns out to be quite challenging. We utilize the ellipsoid method in conjunction with several new ideas to show that this problem can be approximately solved provided that we have an (approximation) algorithm for a certain maxmin problem that is akin to, and generalizes, the  problem—find the worstcase scenario consisting of at most elements—encountered in 2stage robust optimization. We obtain such a procedure for various discreteoptimization problems; by complementing this via LProunding algorithms that provide local (i.e., perscenario) approximation guarantees, we obtain the first approximation algorithms for the distributionally robust versions of a variety of discreteoptimization problems including set cover, vertex cover, edge cover, facility location, and Steiner tree, with guarantees that are, except for set cover, within factors of the guarantees known for the deterministic version of the problem.
1 Introduction
Stochasticoptimization models capture uncertainty by modeling it via a probability distribution over a collection of possible realizations of the data, called scenarios. An important and widely used model is the 2stage recourse model, where one seeks to take actions both before and after the data has been realized (stages I and II) so as to minimize the expected total cost incurred. Many applications come under this setting. An oftcited prototypical example is 2stage stochastic facility location, wherein one needs to decide where to set up facilities to serve clients. The clientdemand pattern is uncertain, but one does have some statistical information about the demands. One can open some facilities initially, given only the distributional information about demands; after a specific demand pattern is realized (according to this distribution), one can take additional recourse actions such as opening more facilities incurring their recourse costs. The recourse costs are usually higher than the firststage costs, as they may entail making decisions in rapid reaction to the observed scenario (e.g., deploying resources with smaller lead time).
An issue with the above 2stage model, which is a common source of criticism, is that the distribution modeling the uncertainty is itself often imprecise! Usually, one models the distribution to be statistically consistent with some historical data, so we really have a collection of distributions, and a more robust approach is to hedge against the worst possible distribution. This gives rise to the distributionally robust 2stage model: the setup is similar to that of the 2stage model, but we now have a collection of probability distributions; our goal is to minimize the maximum expected total cost with respect to a distribution in . Formally, if is the set of firststage actions and the cost associated with is , we want to solve the following problem:
(DRO) 
where .
Distributionally robust (DR) stochastic optimization is a versatile approach dating back to [33] that has (re)gained interest recently in the Operations Research literature, where it is sometimes called datadriven or ambiguous stochastic optimization (see, e.g., [13, 2, 28, 9] and their references). The DR 2stage model also serves to nicely interpolate between the extremes of: (a) 2stage stochastic optimization, which optimistically assumes that one knows the underlying distribution precisely (i.e., ); and (b) 2stage robust optimization, which abandons the distributional view and seeks to minimize the maximum cost incurred in a scenario, thereby adopting the overly cautious approach of being robust against every possible scenario regardless of how likely it is for a scenario to materialize; this can be captured by letting , where is the scenariocollection in the 2stage robust problem. Both extremes can lead to suboptimal decisions: with stochastic optimization, the optimal solution for a specific distribution could be quite suboptimal even for a “nearby” distribution ;^{1}^{1}1There are examples where but an optimal solution for can be arbitrarily bad when evaluated under . with robust optimization, the presence of a single scenario, however unlikely, may force certain decisions that are undesirable for all other scenarios.
Despite its modeling benefits and popularity, to our knowledge, there has been almost no prior work on developing approximation algorithms for DR 2stage discreteoptimization, and, more generally, for DR 2stage problems with a discrete underlying scenario set (as is the case in discrete optimization). (The exception is [1], which we discuss in Section 1.2.^{2}^{2}2Peripherally related is [39], who consider a version of DR facility location, where the uncertainty only influences the costs and not the constraints, which yields a muchsimpler and more restrictive model.)
1.1 Our contributions
We initiate a systematic study of distributionally robust discrete 2stage problems from the perspective of approximation algorithms. We develop a general framework for designing approximation algorithms for these problems, when the collection is a ball around a central distribution in the metric, metric (totalvariation distance), or Wasserstein metric (defined below). (Note that this still allows interpolating between stochastic and robust optimization.) We make no assumptions about ; it could have exponentialsize support, and our only means of accessing is via a sampling black box.^{3}^{3}3The DR problem remains challenging even if has polynomialsize support, but is exponential. We view sampling from the black box as an elementary operation, so our running time bounds also imply samplecomplexity bounds. Settings where is a ball in some probability metric arise naturally when one tries to infer a scenario distribution from observed data (see, e.g. [8, 9, 40])—hence, the moniker datadriven optimization—and it has been argued that defining using the Wasserstein metric has various benefits [9, 40, 13, 28].
We view the frameworks that we develop for DR discrete 2stage problems as our chief contribution, and the techniques that we devise for dealing with Wasserstein metrics as the main feature of our work (see Theorem 1 below). We demonstrate the utility of our frameworks by using them to obtain the first approximation guarantees for the distributionally robust versions of various discreteoptimization problems such as set cover, vertex cover, edge cover, facility location, and Steiner tree. The guarantees that we obtain are, in most cases, within factors of the guarantees known for the deterministic (and 2stage{stochastic, robust}) counterpart of the problem (see Table 1).
Formal model description.
We study the following distributionally robust 2stage model. We are given an underlying set of scenarios, and a ball of distributions around a central distribution over under some metric on probability distributions. We can take firststage actions before a scenario is realized, incurring a firststage cost , and secondstage recourse actions after a scenario is realized; the combination of first and secondstage actions for a scenario must yield a feasible solution for each scenario . Using to denote that scenario is drawn according to distribution , we want to solve: .
We use to denote the input size, which always measures the encoding size of the underlying deterministic problem, along with the first and secondstage costs and the radius of the ball . It is standard in the study of 2stage problems in the CS literature to assume that every firststage action has a corresponding recourse action (e.g., facilities may be opened in either stage). We use to denote an inflation parameter that measures the maximum factor by which the cost of a firststage action increases in the second stage. We consider the cases where is the metric, ; metric, , which is the totalvariation distance; or a Wasserstein metric.
To motivate and define the rich class of Wasserstein metrics, note that while the choice of is a problemdependent modeling decision, we would like the ball to contain other “reasonably similar” distributions, and exclude completely unrelated distributions, as the latter could lead to overlyconservative decisions, à la robust optimization. One way of measuring the similarity between two distributions is to see if they they spread their probability mass on “similar” scenarios. Wasserstein metrics capture this viewpoint crisply, and lift an underlying scenario metric to a metric on distributions over scenarios. The Wasserstein distance between two distributions and is the minimal cost of moving probability mass to transform into , where the cost of moving mass from scenario to scenario is . (Observe that is the Wasserstein metric with respect to the discrete scenario metric: if , and otherwise.)
Example: DR 2stage facility location (). As a concrete example, consider the DR version of 2stage facility location. We have a metric space , where is a set of facilities, and is a set of clients. A scenario is a subset of indicating the set of clients that need to be served in that scenario. (We can model integer demands by creating colocated clients.) We may open a facility in stages I or II, incurring costs of and respectively. In scenario , we need to assign every to a facility opened in stage I or in scenario ; the secondstage cost of scenario is . The goal is to minimize . Here , and is the encoding size of .
We consider two common choices for : (a) the unrestricted setting: , which is the usual setting in 2stage stochastic optimization; and (b) the bounded setting: , which is the usual setup in 2stage robust optimization for modeling an exponential number of scenarios [11, 23, 17]. These two settings for arise for other problems as well (where is a suitable ground set).
In addition to being the or metrics, we can consider various ways of defining a scenario metric in terms of the underlying assignmentcost metric to capture that two scenarios involving demand locations in the same vicinity are deemed similar; lifting these scenario metrics to the Wasserstein metric over distributions yields a rich class of DR 2stage facility location models. For instance, we can define the asymmetric metric , where , which measures the maximum separation between clients in and locations in (the resulting Wasserstein metric will now be an asymmetric metric on distributions). (There are other natural scenario metrics: the asymmetric metric , and the symmetrizations of these asymmetric metrics:
Our results.
Our main result pertains to Wasserstein metrics, which have a great deal of modeling power. Let be the Wasserstein metric with respect to a scenario metric . To gain mathematical traction, it will be convenient to move to a relaxation of the DR 2stage problem where we allow fractional secondstage decisions. Let be the optimal secondstage cost of scenario given as the firststage actions when we allow fractional secondstage actions. (We will obtain integral secondstage actions by rounding an optimal solution to using an LPrelative approximation algorithm for the deterministic problem.)
We relate the approximability of the DR problem to that of known tasks in 2stagestochastic and deterministic optimization, and the following deterministic problem:
Notice that ties together three distinct sources of complexity in the DR 2stage problem: the combinatorial complexity of the underlying optimization problem, captured by ; the complexity of the scenario set ; and the complexity of the scenario metric , captured by the term.
Theorem 1 (Combination of Theorems 3.5 and 3.7).
Suppose that we have the following.

A approximation algorithm for computing , which is an algorithm that given returns such that ;

A local approximation algorithm for the underlying 2stage problem, which is an algorithm that rounds a fractional firststage solution to an integral one while incurring at most a factor blowup in the firststage cost, and in the cost of each scenario; and

An LPrelative approximation algorithm for the underlying deterministic problem.
Then we can obtain an approximation for the DR problem in time .
Ingredients 2 and 3 can be obtained using known results for 2stagestochastic and deterministic optimization; ingredient 1 is the new component we need to supply to instantiate Theorem 1 and obtain results for specific DR 2stage problems. (The nonstandard notion of approximation for is necessary, as the mixedsign objective precludes any guarantee under the standard notion of approximation; see Theorem 3.12.) In various settings, we show that a approximation for can be obtained by utilizing results for the simpler  problem— (i.e., )—encountered in 2stage robust optimization (see the proof of Theorem 3.14 in Section 3.3.6): in the bounded setting, where , this is called the  problem [11, 23, 17]. In particular, this applies to the metric, as in this case we have .
Corollary 1.
Theorem 1 (to a partial extent) and Corollary 1 thus provide novel, useful reductions from DR 2stage optimization to 2stage {stochastic, robust} (and deterministic) optimization. (For instance, [15] devise approximations for the  problem in Corollary 1 (i.e., ) for scenario sets defined by matroidindependence and/or knapsack constraints; Corollary 1 enables us to export these guarantees to the corresponding DR 2stage problem with the metric.) In some cases, we can improve upon the guarantees in Theorem 1. For certain covering problems, [34] showed how to obtain via a decoupling idea; by incorporating this idea within our reduction, we can improve the guarantee in Theorem 1 and obtain an approximation (see “Set cover” in Section 3.3).
We demonstrate the versatility of our framework by applying Theorem 1 and these refinements to obtain guarantees for the DR versions of set cover, vertex cover, edge cover, facility location, and Steiner tree (Section 3.3). These constitute the majority of problems investigated for 2stage optimization. Our strongest results are for facility location, vertex cover, and edge cover; for Steiner tree, we obtain results in the unrestricted setting. Table 1 summarizes these results.
Problem  Wasserstein metrics  ,  
(see § 2)  General , =approx. for  
Facility location  
Vertex cover  –  –  
Edge cover  –  –  
Set cover  –  –  
Steiner tree  160  *  160  *  *  * 
Technical takeaways for DR problems with Wasserstein metrics.
The reduction in Theorem 1 is obtained by supplementing tools from 2stage {stochastic, robust} optimization with various additional ideas. Its proof consists of two main components, both of which are of independent interest.
 Sample average approximation (SAA) for DR problems.

In Section 3.1, we prove that a simple and appealing approach in stochastic optimization called the SAA method can be applied to reduce the DR problem to the setting where has a polynomialsize support. In the SAA method, we draw some samples to estimate by its empirical distribution , and solve the distributionally robust problem for . We show that (roughly speaking) by taking samples, we can ensure that a approximate oracle for the SAA objective value can be combined with a approximation algorithm for the SAA problem, to obtain an approximate solution to the original problem, with high probability (see Theorem 3.5). It is well known that samples are needed even for (standard) 2stage stochastic problems in the blackbox model [34]. Our SAA result substantially expands the scope of problems for which the SAA method is known to be effective (with sample size). Previously, such results were known for the special case of 2stage stochastic problems [4, 37] (see also [24]), and multistage stochastic problems with a constant number of stages [37] (for ).
 Solving the polynomialsize centraldistribution case.

Complementing the above SAA result, we show how to approximately solve the DR 2stage problem with a polynomialsize central distribution (Section 3.2). It is natural to move to a fractional relaxation of the problem, by replacing the firststage set by a suitable polytope . In stark contrast with 2stage {stochastic, robust} optimization, where the fractional relaxation of the polynomialscenario problem immediately gives a polynomialsize LP and is therefore straightforward to solve in polytime, it is substantially more challenging to even approximately solve the fractional DR problem with a polynomialsize central distribution. In fact, this is perhaps the technically morechallenging part of the paper. The crux of the problem is that, while has polynomialsize support, there are (numerous) distributions in that have exponentialsize support, and one needs to optimize over such distributions. In particular, if we use duality to reformulate the problem as a minimization LP, this leads to an LP with an exponential number of both constraints and variables (see the discussion in Section 3.2). Thus, while we started with a polynomialsupport central distribution, we have ended up in a situation similar to that in 2stage stochastic or robust optimization with an exponential number of scenarios!
To surmount these obstacles, we work with the convex program , and solve this approximately by leveraging the ellipsoidbased machinery in [34] (see Theorem 3.7). Not surprisingly, this poses various fresh difficulties, chiefly because we are unable to compute approximate subgradients as required by [34]. We delve into these issues, and the ideas needed to overcome them in Section 3.2.
Approximating .
We use the following natural strategy: “guess” for the optimal , possibly within a factor, and solve the constrained problem (): . It is easy to show that a approximation to () yields a approximation for (Lemma 3.25). In the unrestricted setting (), we will usually be able to solve () exactly, exploiting the fact that our problems are covering problems. In the bounded setting, we cast () as a  problem (note that is integral), and utilize known results for this problem.
For , the result by [23] requires creating colocated clients, which does not work for us. We illuminate a novel connection between costsharing schemes and  problems by showing that a costsharing scheme for FL having certain properties can be leveraged to obtain an approximation algorithm for  {integral, fractional} FL (see the proof of Theorem 3.20). In doing so, we also end up improving the approximation factor for  FL from [23] to . Whereas costsharing schemes have played a role in 2stage stochastic optimization, in the context of the boostedsampling approach of [18], they have not been used previously for  problems. (The approach in [17] has some some similar elements, but there is no explicit use of cost shares.) Costsharing schemes offer a useful tool for designing algorithms for  problems, that we believe will find further application.
DR problems with the metric.
For the metric (Section 4), we directly consider the fractional relaxation of the problem. As with the Wasserstein metric, even for a polynomialscenario central distribution, solving the resulting problem is quite challenging since it (again) leads to an LP with exponentially many variables and constraints. We move to a proxy objective that is pointwise close to the true objective, and show that an subgradient of the proxy objective can be computed efficiently at any point, even for . This enables us to use the algorithm in [34] to solve the fractional problem; rounding this solution using a local approximation algorithm yields results for the DR discrete 2stage problem. Table 1 lists the results we obtain for the metric as well.
1.2 Related work
Stochastic optimization is a field with a vast amount of literature (see, e.g., [3, 30, 32]), but its study from an approximationalgorithms perspective is relatively recent. Various approximation results have been obtained in the 2stage recourse model over the last 15 years in the CS and OperationsResearch (OR) literature (see, e.g., [36]), but more general models, such as distributionally robust stochastic optimization, have received little or no attention in this regard.
To the best of our knowledge, with the exception of [1], which we discuss below, there are no prior approximation algorithms for distributionally robust 2stage discrete optimization problems, when the number of possible scenarios is (finite, but) exponentially large (even if has polynomialsize support). Much of the work in the stochasticoptimization and OR literature on these problems has focused on proving suitable duality results that sometimes allow one to reformulate the DR problem more compactly. Moreover, in many cases, the results obtained are for continuous scenario spaces and with other assumptions about the recourse costs. For instance, [9, 13, 40, 20] all consider the setting where is a ball in the Wasserstein metric, and provide a closedform description of the worstcase distribution in , which is then used to reformulate the DR problem under further convexity assumptions of the scenario collection . DR problems have gained attention in recent years due to their usefulness in inferring decisions from observed data while avoiding the risk of overfitting: here is used to model a class of distributions from which the observed data could arise (with high confidence). Various works have advocated the use of a Wasserstein ball around the empirical distribution for this purpose [9, 40, 13, 28], but there are no results proving polynomial bounds on the number of samples needed in order to produce provablygood results. Note that these works, by definition, consider the setting where the central distribution has polynomialsize support. The distributionally robust setting has also been considered for chanceconstrained problems; see, e.g. [8] and the references therein.
The work of [1] in the CS literature on correlation gap can be interpreted as studying distributionally robust discreteoptimization problems, but in a very different setting where is not a ball. Instead, is the collection of distributions that agree with some given expected values; the correlation gap quantifies the worstcase ratio of the DR objective when one chooses the optimal decisions with respect to the distribution in that treats all random variables as independent, versus the optimum of the DR problem. Agrawal et al. [1] proved various bounds on the correlation gap for submodular functions and subadditive functions admitting suitable cost shares. Various other works (see, e.g., [5, 29] and the references therein) have considered such momentbased collections, but again under continuity and/or convexity assumptions about the scenario space and/or recourse costs.
We now briefly survey the work on approximation algorithms under the stochastic and robust optimization models, which the DR model generalizes. As noted above, various approximation results have been obtained for 2stage, and even multistage problems. In the blackbox model, a common approach is the SAA method, which simply consists of solving the stochasticoptimization problem for the empirical distribution obtained by sampling. The effectiveness of this method has been analyzed both for 2stage stochastic problems [24, 4, 37] and multistage stochastic problems [37]. The samplecomplexity bound in [24] is a nonpolynomial bound for general 2stage stochastic problems, whereas [4, 37] both obtain bounds for structured problems. The proof in [37] applies also to structured multistage linear programs, and [4] show that even approximate solutions to the 2stage SAA problem translate to approximate solutions to the original 2stage problem. We build upon the SAA machinery of Charikar et al. [4]. Previously, Shmoys and Swamy [34] showed how to use the ellipsoid method to solve structured 2stage linear programs in the blackbox model, and how to round the resulting fractional solution. We utilize their machinery based on approximate subgradients to solve the polynomialscenario centraldistribution setting. Approximation algorithms for 2stage problems have also been developed via combinatorial means. The prominent technique here is the boosted sampling technique of Gupta et al. [18]; the survey [36] gives a detailed description of these and other approximation results for 2stage optimization.
Twostage robust optimization where uncertainty is reflected in the constraints and not the data was proposed in [6], who devised approximation algorithms for various problems in the polynomialscenario setting. Notice that it is not clear how to even specify problems with exponentially many scenarios in the robust model. Feige et al. [11] expanded the model of [6] by considering what we call the bounded setting, where every subset of at most elements is a scenario. Subsequently, [23] and [17] expanded the collection of results known for 2stage robust problems in the bounded setting. We utilize results for the closelyrelated  problem encountered in this setting in our work.
We briefly discuss a few other snippets that consider intermediary approaches between stochastic and robust optimization. Swamy [38] considers a model for riskaverse 2stage stochastic optimization that interpolates between the stochastic and robust optimization approaches. In the context of online algorithms, Mirrokni et al. [26] and Esfandiari et al. [10] give online algorithms for allocation problems that are simultaneously competitive both in a random input model and in an adversarial input model. Finally, we note that our distributionally robust setting can be seen to be in a similar spirit as a recent focus in algorithmic mechanism design, where one does not assume precise knowledge of the underlying distribution; rather one (implicitly) has a collection of distributions, and one seeks to design mechanisms that work for every distribution in this collection; see, e.g., [21].
2 Problem definitions, and our general class of DR 2stage problems
Recall that we consider settings where we have a ball of distributions (over the scenariocollection ) around a central distribution under some metric on distributions, and we seek to minimize the maximum expected cost with respect to a distribution in . As mentioned earlier, we make no assumptions about , and only require the ability to draw samples from . The metrics that we consider for are the metric, metric, and the Wasserstein metric. We now define Wasserstein metrics precisely.
Definition 2.1 (Wasserstein (a.k.a transportation or earthmover) distance).
The Wasserstein distance between two probability distributions and over is defined with respect to an underlying metric on . A transportation plan or flow from to is a vector such that: (i) for all ; and (ii) for all . The Wasserstein distance between and , denoted , is the minimum value of over all transportation plans from to .
If is an asymmetric metric, then is an asymmetric metric; if is a pseudometric—i.e., satisfies the triangle inequality but could be for —then so is .
In Section 3.3, we consider the DR versions of set cover (and some special cases), facility location, and Steiner tree. DR 2stage facility location () was defined in Section 1.1; we define the remaining problems below, and then discuss the general class of DR 2stage problems to which our framework applies. Recall that denotes the input size.

DR 2stage set cover (). We have a collection of subsets over a ground set . A scenario is a subset of and specifies the set of elements to be covered in that scenario. We may buy a set in either stage, incurring costs of and in stages I and II respectively. The sets chosen in stage I and in each scenario must together cover . The goal is to choose some firststage sets and sets in each scenario so as to minimize .
We have , and is the encoding size of . We consider the unrestricted () and bounded () settings. Different scenarios could be quite unrelated, so there does not seem to be a natural choice for a (nondiscrete) scenariometric; we therefore consider (balls in) the or metrics.

DR 2stage Steiner tree (). We have a complete graph with metric edge costs , root , and inflation factor . A scenario is a subset of nodes (called terminals) specifying the nodes that need to be connected to . We may buy an edge in stages I or II, incurring costs or respectively. The union of the edges bought in stage I, and bought in scenario , must connect all nodes in to , and we want to minimize . (With nonuniform inflation factors for different edges, even 2stage stochastic Steiner tree becomes at least as hard as group Steiner tree [31].)
Here is the encoding size of . We obtain results in the unrestricted setting, and leave the bounded setting for future work. As with , in addition to the and metrics, we can consider scenario metrics defined using (e.g., ) and the resulting Wasserstein metrics.
A general class of DR 2stage problems.
Abstracting away the key properties of , , , we now define the generic DR 2stage problem that we consider. As before, denotes the finite firststage action set of the discrete problem. It will be convenient to consider the natural fractional relaxation of the DR problem obtained by enlarging the discrete secondstage action set and to suitable polytopes. Recall that is the optimal secondstage cost of scenario given as the firststage decision, when we allow fractional secondstage actions. Let denote the polytope specifying the fractional firststage decisions, with . (For example, for , is the optimal value of a setcover LP where we may buy sets fractionally in the second stage, and .) One benefit of moving to the fractional relaxation is that, for every scenario , is a convex function of , whose value and subgradient can be exactly computed.
Definition 2.2.
Let be a function. We say that is a subgradient of at if we have for all . Given , we say that is an subgradient of at the point if for every , we have . We abbreviate subgradient to subgradient.
Following [4, 34, 37], we consider the following generic DR 2stage problem (Q) with discrete firststage set , and its (further) fractional relaxation (Q), and require that they satisfy properties 1–1 listed below. Let denote the norm of .
(Q) 
(Q) 
In proving their SAA result for 2stage stochastic problems, [4] define properties 1, 2 below to capture the fact that every firststage action has a corresponding recourse action that is more expensive by a bounded factor, and hence, it is always feasible to not take any firststage actions.

, , , and for all .

We know an inflation parameter such that for all .
Since we apply the ellipsoidbased machinery in [34] to solve the fractional problem with a polynomialsize central distribution, we need bounds on the feasible region in terms of enclosing and enclosed balls; this is captured by 1, which is directly lifted from [34]. Note that the vast majority of 2stage problems (including , , ) involve decisions, with and so , so 1 is readily satisfied. As in [34], we need to be able to compute the value and subgradient of the recourse cost , which is a benign requirement since is the optimal value of a polytimesolvable LP in all our applications. Whereas [34] define a syntactic class of 2stage stochastic LPs and show (implicitly) that they satisfy this requirement, we explicitly isolate this requirement in 2, 3.

We have positive bounds and such that and contains a ball of radius such that .

For every , is convex over , and can be efficiently computed for every .

For every , we can efficiently compute a subgradient of at with , where . Hence, the Lipschitz constant of is at most (due to Definition 2.2).
Finally, we need the following additional mild condition.

When is the Wasserstein metric with respect to a scenario metric , we know with such that for all and all with .
As noted above, 1–3 are gathered from [4, 34], and hold for all the 2stage problems considered in the CS literature (see [37, 6, 11, 23, 17]); 1 is a new requirement, but is also rather mild and holds for all the problems we consider. 1, 2 and 1 are used to prove that SAA works for the DR problem under the Wasserstein metric (Section 3.1). 1–3 pertain to the fractional relaxation, and are utilized to show that one can efficiently solve the SAA problem approximately (Section 3.2).
A solution to (Q) needs to be rounded to yield integral secondstage actions: any LPrelative approximation algorithm for the deterministic version of the problem can be used to obtain recourse actions for each scenario having cost at most . To round a fractional solution to (Q), we utilize a local approximation algorithm for the 2stage problem: we say that is a local approximation algorithm for (Q) if, given any , it returns an integral solution and implicitly specifies integral recourse actions for every , such that and for all . An approximate solution to (Q) combined with a local approximation yields an approximate solution to the discrete DR 2stage problem. Local approximation algorithms exist for various 2stage problems—e.g., set cover, vertex cover, facility location [34]—with approximation factors that are comparable to the approximation factors known for their deterministic counterparts.
3 Distributionally robust problems under the Wasserstein metric
We now focus on the DR 2stage problem (Q) when is the Wasserstein metric with respect to a metric on scenarios. Plugging in the definition of (with respect to scenario metric ), we can rewrite (Q) as follows.
(Q) 
(T)  
s.t.  (1)  
(2)  
(3) 
Let denote the optimal value of (Q). We note that a naive, simplistic approach that ignores the uncertainty in the underlying distribution, and only considers the central distribution , yields (expectedly) poor bounds. Suppose is an approximate solution for the 2stage problem . Given 1, one can show that (and is at least ), which implies , but this is too weak a guarantee since could be quite large compared to .
In Section 3.1, we work with (Q) and show that the SAA approach can be used to reduce to the case where the central distribution has polynomialsize support. In Section 3.2, we show how to approximately solve the polynomialsize support case by applying the ellipsoid method to its (further) relaxation (Q), where we replace with . Here, we utilize a local approximation algorithm to move from to , and thereby interface with, and complement, the SAA result for (Q) proved in Section 3.1. This result applies more generally, even when is not a metric; we only require that for all . (If is not a metric, the Wasserstein distance with respect to need not yield a metric on distributions.)
In Section 3.3, we consider various combinatorialoptimization problems, and utilize the above results in conjunction to obtain the first approximation results for the DR versions of these problems.
3.1 A sampleaverageapproximation (SAA) result for distributionally robust problems
The SAA approach is the following simple, intuitive idea: draw some samples from , estimate by the empirical distribution induced by these samples, and solve the SAA problem (Q). We prove the following SAA result. For any , if we construct SAA problems, each using independent samples, and if we have a approximation algorithm for computing the objective value of the SAA problem at any given point, then we can utilize approximate solutions to these SAA problems to obtain a solution satisfying with high probability; Theorem 3.5 gives the precise statement.
The proof has several ingredients. There are two main approaches [4, 37] for showing that the SAA method with a polynomial number of samples works for stochasticoptimization problems. Charikar et al. [4] prove the following SAA result for 2stage problems.
Theorem 3.1 ([4]).
Consider a 2stage problem (2StP) : , with scenario set , where satisfy 1, 2 with inflation parameter . With probability at least , any optimal solution to the SAA problem constructed using samples is a approximate solution to (2StP). More generally, there is a way of using an approximation algorithm for the SAA problem, in conjunction with a approximate objectivevalue oracle for the SAA problem, to obtain an approximate solution to (2StP) with high probability.
Note that (Q) is not a standard 2stage stochasticoptimization problem because constraint (2) couples the various scenarios, which prevents us from applying Theorem 3.1 to (Q). The SAA result in Swamy and Shmoys [37] applies to the fractional relaxation of the problem, and works whenever the objective functions of the SAA and original problems satisfy a certain “closenessinsubgradients” property. A subgradient of at a point is obtained from the optimal distribution to the inner maximization problem in (Q). This is however an exponentialsize object and utilizing this to prove closenessinsubgradients seems quite daunting.
Our first insight is that we can decouple the scenarios by Lagrangifying constraint (2) using a dual variable . By standard duality arguments, this leads to the following reformulation of (Q).
(R) 
Recall that . Let . The chief benefit of the reformulation (R) is that we can view (R) as a 2stage problem: the firststage actionset is , and the optimal secondstage cost of scenario under firststage actions is given by . This makes it more amenable to utilize the SAA machinery developed for 2stage problems. We can exploit 1 to show that we may limit to the range in (R), and use 2 to bound the inflation factor of (R).
Lemma 3.2.
Proof.
The second statement is immediate from the first one since (Q) and (R) have the same optimal values. So we focus on showing the first statement.
Consider any . There exists such that . If , then we are done. So suppose . We argue that . This completes the proof since we also have for all . Clearly, . If is such that , then it must be that . Otherwise, , where the last inequality follows from 1. This contradicts the choice of . Therefore, we have , completing the proof. ∎
Proof.
Given Lemmas 3.2 and 3.3, by suitably discretizing , one can use Theorem 3.1 to show that: if we construct the SAA problem using samples, and can compute (approximately) the SAA objective value at any given point, then, with high probability, one can translate an approximate solution to the SAA problem to an approximate solution to (Q). But this result does not quite suit our purposes due to various reasons.
The term could be rather large, and is not , so this does not yield polynomial sample complexity.^{4}^{4}4The problem persists even if we utilize the closenessinsubgradients machinery in [37] to the fractional version of (R). This would involve estimating to within an term, where , which requires samples. Moreover it seems difficult to compute the SAA objective value , or even approximate it. This difficulty arises because computing encompasses the NPhard  problem encountered in 2stage robust optimization, and furthermore, the mixedsign objective in makes it hard to even approximate (see Theorem 3.12).
We need various ideas to circumvent these issues. We show that we can eliminate the dependence on altogether at the expense of a slight deterioration in the approximation ratio when moving from the SAA to the original problem. The term arises because might be attained by a scenario where (see the proof of Lemma 3.3). Our crucial second insight is that we can eliminate this and reduce the sample complexity to , by specifically imposing that we never encounter pairs with ; we call such pairs long edges, and the remaining pairs short edges. Any satisfying (2) can send at most flow on the long edges. Motivated by this, we “decompose” into and , which are (roughly speaking) the contribution from the short and long edges respectively. (This decomposition is akin to the division of low and high cost scenarios used by [4] to prove Theorem 3.1, but there are significant technical differences, which complicate things for us, as we discuss below.) We define and as follows.
Lemma 3.4.
For every central distribution , and every , we have .
Proof.
We prove this by showing that: (i) ; and (ii) . Given these bounds, the upper bound on follows from the upper bounds on and in parts (i) and (ii) respectively. For the other direction, we have