Approximation Algorithms for Distributionally Robust Stochastic Optimization with Black-Box Distributions

# Approximation Algorithms for Distributionally Robust Stochastic Optimization with Black-Box Distributions

André Linhares {alinhare,cswamy}@uwaterloo.ca. Dept. of Combinatorics and Optimization, University of Waterloo, Waterloo, ON N2L 3G1. Supported in part by NSERC grant 327620-09 and an NSERC Discovery Accelerator Supplement award.    Chaitanya Swamy00footnotemark: 0
###### Abstract

Two-stage stochastic optimization is a widely used framework for modeling uncertainty, where we have a probability distribution over possible realizations of the data, called scenarios, and decisions are taken in two stages: we make first-stage decisions knowing only the underlying distribution and before a scenario is realized, and may take additional second-stage recourse actions after a scenario is realized. The goal is typically to minimize the total expected cost. A common criticism levied at this model is that the underlying probability distribution is itself often imprecise! To address this, an approach that is quite versatile and has gained popularity in the stochastic-optimization literature is the distributionally robust 2-stage model: given a collection of probability distributions, our goal now is to minimize the maximum expected total cost with respect to a distribution in .

There has been almost no prior work however on developing approximation algorithms for distributionally robust problems, when the underlying scenario-set is discrete, as is the case with discrete-optimization problems. We provide a framework for designing approximation algorithms in such settings when the collection is a ball around a central distribution and the central distribution is accessed only via a sampling black box.

We first show that one can utilize the sample average approximation (SAA) method—solve the distributionally robust problem with an empirical estimate of the central distribution—to reduce the problem to the case where the central distribution has polynomial-size support. This follows because we argue that a distributionally robust problem can be reduced in a novel way to a standard 2-stage problem with bounded inflation factor, which enables one to use the SAA machinery developed for 2-stage problems. Complementing this, we show how to approximately solve a fractional relaxation of the SAA (i.e., polynomial-scenario central-distribution) problem. Unlike in 2-stage stochastic- or robust- optimization, this turns out to be quite challenging. We utilize the ellipsoid method in conjunction with several new ideas to show that this problem can be approximately solved provided that we have an (approximation) algorithm for a certain max-min problem that is akin to, and generalizes, the -- problem—find the worst-case scenario consisting of at most elements—encountered in 2-stage robust optimization. We obtain such a procedure for various discrete-optimization problems; by complementing this via LP-rounding algorithms that provide local (i.e., per-scenario) approximation guarantees, we obtain the first approximation algorithms for the distributionally robust versions of a variety of discrete-optimization problems including set cover, vertex cover, edge cover, facility location, and Steiner tree, with guarantees that are, except for set cover, within -factors of the guarantees known for the deterministic version of the problem.

## 1 Introduction

Stochastic-optimization models capture uncertainty by modeling it via a probability distribution over a collection of possible realizations of the data, called scenarios. An important and widely used model is the 2-stage recourse model, where one seeks to take actions both before and after the data has been realized (stages I and II) so as to minimize the expected total cost incurred. Many applications come under this setting. An oft-cited prototypical example is 2-stage stochastic facility location, wherein one needs to decide where to set up facilities to serve clients. The client-demand pattern is uncertain, but one does have some statistical information about the demands. One can open some facilities initially, given only the distributional information about demands; after a specific demand pattern is realized (according to this distribution), one can take additional recourse actions such as opening more facilities incurring their recourse costs. The recourse costs are usually higher than the first-stage costs, as they may entail making decisions in rapid reaction to the observed scenario (e.g., deploying resources with smaller lead time).

An issue with the above 2-stage model, which is a common source of criticism, is that the distribution modeling the uncertainty is itself often imprecise! Usually, one models the distribution to be statistically consistent with some historical data, so we really have a collection of distributions, and a more robust approach is to hedge against the worst possible distribution. This gives rise to the distributionally robust 2-stage model: the setup is similar to that of the 2-stage model, but we now have a collection of probability distributions; our goal is to minimize the maximum expected total cost with respect to a distribution in . Formally, if is the set of first-stage actions and the cost associated with is , we want to solve the following problem:

 minx∈Xc⊺x+maxq∈DEA∼q[g(x,A)] (DRO)

where   .

Distributionally robust (DR) stochastic optimization is a versatile approach dating back to [33] that has (re)gained interest recently in the Operations Research literature, where it is sometimes called data-driven or ambiguous stochastic optimization (see, e.g., [13, 2, 28, 9] and their references). The DR 2-stage model also serves to nicely interpolate between the extremes of: (a) 2-stage stochastic optimization, which optimistically assumes that one knows the underlying distribution precisely (i.e., ); and (b) 2-stage robust optimization, which abandons the distributional view and seeks to minimize the maximum cost incurred in a scenario, thereby adopting the overly cautious approach of being robust against every possible scenario regardless of how likely it is for a scenario to materialize; this can be captured by letting , where is the scenario-collection in the 2-stage robust problem. Both extremes can lead to suboptimal decisions: with stochastic optimization, the optimal solution for a specific distribution could be quite suboptimal even for a “nearby” distribution ;111There are examples where but an optimal solution for can be arbitrarily bad when evaluated under . with robust optimization, the presence of a single scenario, however unlikely, may force certain decisions that are undesirable for all other scenarios.

Despite its modeling benefits and popularity, to our knowledge, there has been almost no prior work on developing approximation algorithms for DR 2-stage discrete-optimization, and, more generally, for DR 2-stage problems with a discrete underlying scenario set (as is the case in discrete optimization). (The exception is [1], which we discuss in Section 1.2.222Peripherally related is [39], who consider a version of DR facility location, where the uncertainty only influences the costs and not the constraints, which yields a much-simpler and more restrictive model.)

### 1.1 Our contributions

We initiate a systematic study of distributionally robust discrete 2-stage problems from the perspective of approximation algorithms. We develop a general framework for designing approximation algorithms for these problems, when the collection is a ball around a central distribution in the metric, metric (total-variation distance), or Wasserstein metric (defined below). (Note that this still allows interpolating between stochastic and robust optimization.) We make no assumptions about ; it could have exponential-size support, and our only means of accessing is via a sampling black box.333The DR problem remains challenging even if has polynomial-size support, but is exponential. We view sampling from the black box as an elementary operation, so our running time bounds also imply sample-complexity bounds. Settings where is a ball in some probability metric arise naturally when one tries to infer a scenario distribution from observed data (see, e.g. [8, 9, 40])—hence, the moniker data-driven optimization—and it has been argued that defining using the Wasserstein metric has various benefits [9, 40, 13, 28].

We view the frameworks that we develop for DR discrete 2-stage problems as our chief contribution, and the techniques that we devise for dealing with Wasserstein metrics as the main feature of our work (see Theorem 1 below). We demonstrate the utility of our frameworks by using them to obtain the first approximation guarantees for the distributionally robust versions of various discrete-optimization problems such as set cover, vertex cover, edge cover, facility location, and Steiner tree. The guarantees that we obtain are, in most cases, within -factors of the guarantees known for the deterministic (and 2-stage-{stochastic, robust}) counterpart of the problem (see Table 1).

##### Formal model description.

We study the following distributionally robust 2-stage model. We are given an underlying set of scenarios, and a ball of distributions around a central distribution over under some metric on probability distributions. We can take first-stage actions before a scenario is realized, incurring a first-stage cost , and second-stage recourse actions after a scenario is realized; the combination of first- and second-stage actions for a scenario must yield a feasible solution for each scenario . Using to denote that scenario is drawn according to distribution , we want to solve: .

We use to denote the input size, which always measures the encoding size of the underlying deterministic problem, along with the first- and second-stage costs and the radius of the ball . It is standard in the study of 2-stage problems in the CS literature to assume that every first-stage action has a corresponding recourse action (e.g., facilities may be opened in either stage). We use to denote an inflation parameter that measures the maximum factor by which the cost of a first-stage action increases in the second stage. We consider the cases where is the metric, ; metric, , which is the total-variation distance; or a Wasserstein metric.

To motivate and define the rich class of Wasserstein metrics, note that while the choice of is a problem-dependent modeling decision, we would like the ball to contain other “reasonably similar” distributions, and exclude completely unrelated distributions, as the latter could lead to overly-conservative decisions, à la robust optimization. One way of measuring the similarity between two distributions is to see if they they spread their probability mass on “similar” scenarios. Wasserstein metrics capture this viewpoint crisply, and lift an underlying scenario metric to a metric on distributions over scenarios. The Wasserstein distance between two distributions and is the minimal cost of moving probability mass to transform into , where the cost of moving mass from scenario to scenario is . (Observe that is the Wasserstein metric with respect to the discrete scenario metric: if , and otherwise.)

Example: DR 2-stage facility location (). As a concrete example, consider the DR version of 2-stage facility location. We have a metric space , where is a set of facilities, and is a set of clients. A scenario is a subset of indicating the set of clients that need to be served in that scenario. (We can model integer demands by creating co-located clients.) We may open a facility in stages I or II, incurring costs of and respectively. In scenario , we need to assign every to a facility opened in stage I or in scenario ; the second-stage cost of scenario is . The goal is to minimize . Here , and is the encoding size of .

We consider two common choices for : (a) the unrestricted setting: , which is the usual setting in 2-stage stochastic optimization; and (b) the -bounded setting: , which is the usual setup in 2-stage robust optimization for modeling an exponential number of scenarios [11, 23, 17]. These two settings for arise for other problems as well (where is a suitable ground set).

In addition to being the or metrics, we can consider various ways of defining a scenario metric in terms of the underlying assignment-cost metric to capture that two scenarios involving demand locations in the same vicinity are deemed similar; lifting these scenario metrics to the Wasserstein metric over distributions yields a rich class of DR 2-stage facility location models. For instance, we can define the asymmetric metric , where , which measures the maximum separation between clients in and locations in (the resulting Wasserstein metric will now be an asymmetric metric on distributions). (There are other natural scenario metrics: the asymmetric metric , and the symmetrizations of these asymmetric metrics:

##### Our results.

Our main result pertains to Wasserstein metrics, which have a great deal of modeling power. Let be the Wasserstein metric with respect to a scenario metric . To gain mathematical traction, it will be convenient to move to a relaxation of the DR 2-stage problem where we allow fractional second-stage decisions. Let be the optimal second-stage cost of scenario given as the first-stage actions when we allow fractional second-stage actions. (We will obtain integral second-stage actions by rounding an optimal solution to using an LP-relative -approximation algorithm for the deterministic problem.)

We relate the approximability of the DR problem to that of known tasks in 2-stage-stochastic- and deterministic- optimization, and the following deterministic problem:

 g(x,y,A) := maxA′∈A g(x,A′)−y⋅ℓ(A,A′)given a first-stage decision x∈X, scenario A∈A, y≥0.

Notice that ties together three distinct sources of complexity in the DR 2-stage problem: the combinatorial complexity of the underlying optimization problem, captured by ; the complexity of the scenario set ; and the complexity of the scenario metric , captured by the term.

###### Theorem 1 (Combination of Theorems 3.5 and 3.7).

Suppose that we have the following.

1. A -approximation algorithm for computing , which is an algorithm that given returns such that ;

2. A local -approximation algorithm for the underlying 2-stage problem, which is an algorithm that rounds a fractional first-stage solution to an integral one while incurring at most a -factor blowup in the first-stage cost, and in the cost of each scenario; and

3. An LP-relative -approximation algorithm for the underlying deterministic problem.

Then we can obtain an -approximation for the DR problem in time .

Ingredients 2 and 3 can be obtained using known results for 2-stage-stochastic- and deterministic- optimization; ingredient 1 is the new component we need to supply to instantiate Theorem 1 and obtain results for specific DR 2-stage problems. (The non-standard notion of approximation for is necessary, as the mixed-sign objective precludes any guarantee under the standard notion of approximation; see Theorem 3.12.) In various settings, we show that a -approximation for can be obtained by utilizing results for the simpler - problem— (i.e., )—encountered in 2-stage robust optimization (see the proof of Theorem 3.14 in Section 3.3.6): in the -bounded setting, where , this is called the -- problem [11, 23, 17]. In particular, this applies to the -metric, as in this case we have .

###### Corollary 1.

Consider a DR 2-stage problem where the Wasserstein metric is the metric. Suppose that we have a -approximation for the problem (given as input), and we have ingredients 2 and 3 in Theorem 1. Then we can obtain an -approximation for the DR problem in time .

Theorem 1 (to a partial extent) and Corollary 1 thus provide novel, useful reductions from DR 2-stage optimization to 2-stage {stochastic, robust} (and deterministic) optimization. (For instance, [15] devise approximations for the - problem in Corollary 1 (i.e., ) for scenario sets defined by matroid-independence and/or knapsack constraints; Corollary 1 enables us to export these guarantees to the corresponding DR 2-stage problem with the metric.) In some cases, we can improve upon the guarantees in Theorem 1. For certain covering problems, [34] showed how to obtain via a decoupling idea; by incorporating this idea within our reduction, we can improve the guarantee in Theorem 1 and obtain an -approximation (see “Set cover” in Section 3.3).

We demonstrate the versatility of our framework by applying Theorem 1 and these refinements to obtain guarantees for the DR versions of set cover, vertex cover, edge cover, facility location, and Steiner tree (Section 3.3). These constitute the majority of problems investigated for 2-stage optimization. Our strongest results are for facility location, vertex cover, and edge cover; for Steiner tree, we obtain results in the unrestricted setting. Table 1 summarizes these results.

##### Technical takeaways for DR problems with Wasserstein metrics.

The reduction in Theorem 1 is obtained by supplementing tools from 2-stage {stochastic, robust} optimization with various additional ideas. Its proof consists of two main components, both of which are of independent interest.

Sample average approximation (SAA) for DR problems.

In Section 3.1, we prove that a simple and appealing approach in stochastic optimization called the SAA method can be applied to reduce the DR problem to the setting where has a polynomial-size support. In the SAA method, we draw some samples to estimate by its empirical distribution , and solve the distributionally robust problem for . We show that (roughly speaking) by taking samples, we can ensure that a -approximate oracle for the SAA objective value can be combined with a -approximation algorithm for the SAA problem, to obtain an -approximate solution to the original problem, with high probability (see Theorem 3.5). It is well known that samples are needed even for (standard) 2-stage stochastic problems in the black-box model [34]. Our SAA result substantially expands the scope of problems for which the SAA method is known to be effective (with sample size). Previously, such results were known for the special case of 2-stage stochastic problems [4, 37] (see also [24]), and multi-stage stochastic problems with a constant number of stages [37] (for ).

Proving our SAA result requires augmenting the SAA machinery for 2-stage stochastic problems [4, 37] with various new ingredients to deal with the challenges presented by DR problems. We elaborate in Section 3.1.

Solving the polynomial-size central-distribution case.

Complementing the above SAA result, we show how to approximately solve the DR 2-stage problem with a polynomial-size central distribution (Section 3.2). It is natural to move to a fractional relaxation of the problem, by replacing the first-stage set by a suitable polytope . In stark contrast with 2-stage {stochastic, robust} optimization, where the fractional relaxation of the polynomial-scenario problem immediately gives a polynomial-size LP and is therefore straightforward to solve in polytime, it is substantially more challenging to even approximately solve the fractional DR problem with a polynomial-size central distribution. In fact, this is perhaps the technically more-challenging part of the paper. The crux of the problem is that, while has polynomial-size support, there are (numerous) distributions in that have exponential-size support, and one needs to optimize over such distributions. In particular, if we use duality to reformulate the problem as a minimization LP, this leads to an LP with an exponential number of both constraints and variables (see the discussion in Section 3.2). Thus, while we started with a polynomial-support central distribution, we have ended up in a situation similar to that in 2-stage stochastic or robust optimization with an exponential number of scenarios!

To surmount these obstacles, we work with the convex program , and solve this approximately by leveraging the ellipsoid-based machinery in [34] (see Theorem 3.7). Not surprisingly, this poses various fresh difficulties, chiefly because we are unable to compute approximate subgradients as required by [34]. We delve into these issues, and the ideas needed to overcome them in Section 3.2.

##### Approximating g(x,y,A).

We use the following natural strategy: “guess” for the optimal , possibly within a -factor, and solve the constrained problem (): . It is easy to show that a -approximation to () yields a -approximation for (Lemma 3.25). In the unrestricted setting (), we will usually be able to solve () exactly, exploiting the fact that our problems are covering problems. In the -bounded setting, we cast () as a -- problem (note that is integral), and utilize known results for this problem.

For , the result by [23] requires creating co-located clients, which does not work for us. We illuminate a novel connection between cost-sharing schemes and -- problems by showing that a cost-sharing scheme for FL having certain properties can be leveraged to obtain an approximation algorithm for -- {integral, fractional} FL (see the proof of Theorem 3.20). In doing so, we also end up improving the approximation factor for -- FL from  [23] to . Whereas cost-sharing schemes have played a role in 2-stage stochastic optimization, in the context of the boosted-sampling approach of [18], they have not been used previously for -- problems. (The approach in [17] has some some similar elements, but there is no explicit use of cost shares.) Cost-sharing schemes offer a useful tool for designing algorithms for -- problems, that we believe will find further application.

##### DR problems with the L∞ metric.

For the metric (Section 4), we directly consider the fractional relaxation of the problem. As with the Wasserstein metric, even for a polynomial-scenario central distribution, solving the resulting problem is quite challenging since it (again) leads to an LP with exponentially many variables and constraints. We move to a proxy objective that is pointwise close to the true objective, and show that an -subgradient of the proxy objective can be computed efficiently at any point, even for . This enables us to use the algorithm in [34] to solve the fractional problem; rounding this solution using a local approximation algorithm yields results for the DR discrete 2-stage problem. Table 1 lists the results we obtain for the metric as well.

### 1.2 Related work

Stochastic optimization is a field with a vast amount of literature (see, e.g., [3, 30, 32]), but its study from an approximation-algorithms perspective is relatively recent. Various approximation results have been obtained in the 2-stage recourse model over the last 15 years in the CS and Operations-Research (OR) literature (see, e.g., [36]), but more general models, such as distributionally robust stochastic optimization, have received little or no attention in this regard.

To the best of our knowledge, with the exception of [1], which we discuss below, there are no prior approximation algorithms for distributionally robust 2-stage discrete optimization problems, when the number of possible scenarios is (finite, but) exponentially large (even if has polynomial-size support). Much of the work in the stochastic-optimization and OR literature on these problems has focused on proving suitable duality results that sometimes allow one to reformulate the DR problem more compactly. Moreover, in many cases, the results obtained are for continuous scenario spaces and with other assumptions about the recourse costs. For instance, [9, 13, 40, 20] all consider the setting where is a ball in the Wasserstein metric, and provide a closed-form description of the worst-case distribution in , which is then used to reformulate the DR problem under further convexity assumptions of the scenario collection . DR problems have gained attention in recent years due to their usefulness in inferring decisions from observed data while avoiding the risk of overfitting: here is used to model a class of distributions from which the observed data could arise (with high confidence). Various works have advocated the use of a Wasserstein ball around the empirical distribution for this purpose [9, 40, 13, 28], but there are no results proving polynomial bounds on the number of samples needed in order to produce provably-good results. Note that these works, by definition, consider the setting where the central distribution has polynomial-size support. The distributionally robust setting has also been considered for chance-constrained problems; see, e.g. [8] and the references therein.

The work of [1] in the CS literature on correlation gap can be interpreted as studying distributionally robust discrete-optimization problems, but in a very different setting where is not a ball. Instead, is the collection of distributions that agree with some given expected values; the correlation gap quantifies the worst-case ratio of the DR objective when one chooses the optimal decisions with respect to the distribution in that treats all random variables as independent, versus the optimum of the DR problem. Agrawal et al. [1] proved various bounds on the correlation gap for submodular functions and subadditive functions admitting suitable cost shares. Various other works (see, e.g., [5, 29] and the references therein) have considered such moment-based collections, but again under continuity and/or convexity assumptions about the scenario space and/or recourse costs.

We now briefly survey the work on approximation algorithms under the stochastic- and robust- optimization models, which the DR model generalizes. As noted above, various approximation results have been obtained for 2-stage, and even multistage problems. In the black-box model, a common approach is the SAA method, which simply consists of solving the stochastic-optimization problem for the empirical distribution obtained by sampling. The effectiveness of this method has been analyzed both for 2-stage stochastic problems [24, 4, 37] and multi-stage stochastic problems [37]. The sample-complexity bound in [24] is a non-polynomial bound for general 2-stage stochastic problems, whereas [4, 37] both obtain bounds for structured problems. The proof in [37] applies also to structured multistage linear programs, and [4] show that even approximate solutions to the 2-stage SAA problem translate to approximate solutions to the original 2-stage problem. We build upon the SAA machinery of Charikar et al. [4]. Previously, Shmoys and Swamy [34] showed how to use the ellipsoid method to solve structured 2-stage linear programs in the black-box model, and how to round the resulting fractional solution. We utilize their machinery based on approximate subgradients to solve the polynomial-scenario central-distribution setting. Approximation algorithms for 2-stage problems have also been developed via combinatorial means. The prominent technique here is the boosted sampling technique of Gupta et al. [18]; the survey [36] gives a detailed description of these and other approximation results for 2-stage optimization.

Two-stage robust optimization where uncertainty is reflected in the constraints and not the data was proposed in [6], who devised approximation algorithms for various problems in the polynomial-scenario setting. Notice that it is not clear how to even specify problems with exponentially many scenarios in the robust model. Feige et al. [11] expanded the model of [6] by considering what we call the -bounded setting, where every subset of at most elements is a scenario. Subsequently, [23] and [17] expanded the collection of results known for 2-stage robust problems in the -bounded setting. We utilize results for the closely-related -- problem encountered in this setting in our work.

We briefly discuss a few other snippets that consider intermediary approaches between stochastic and robust optimization. Swamy [38] considers a model for risk-averse 2-stage stochastic optimization that interpolates between the stochastic and robust optimization approaches. In the context of online algorithms, Mirrokni et al. [26] and Esfandiari et al. [10] give online algorithms for allocation problems that are simultaneously competitive both in a random input model and in an adversarial input model. Finally, we note that our distributionally robust setting can be seen to be in a similar spirit as a recent focus in algorithmic mechanism design, where one does not assume precise knowledge of the underlying distribution; rather one (implicitly) has a collection of distributions, and one seeks to design mechanisms that work for every distribution in this collection; see, e.g., [21].

## 2 Problem definitions, and our general class of DR 2-stage problems

Recall that we consider settings where we have a ball of distributions (over the scenario-collection ) around a central distribution under some metric on distributions, and we seek to minimize the maximum expected cost with respect to a distribution in . As mentioned earlier, we make no assumptions about , and only require the ability to draw samples from . The metrics that we consider for are the metric, metric, and the Wasserstein metric. We now define Wasserstein metrics precisely.

###### Definition 2.1 (Wasserstein (a.k.a transportation or earth-mover) distance).

The Wasserstein distance between two probability distributions and over is defined with respect to an underlying metric on . A transportation plan or flow from to is a vector such that: (i) for all ; and (ii) for all . The Wasserstein distance between and , denoted , is the minimum value of over all transportation plans from to .

If is an asymmetric metric, then is an asymmetric metric; if is a pseudometric—i.e., satisfies the triangle inequality but could be for —then so is .

In Section 3.3, we consider the DR versions of set cover (and some special cases), facility location, and Steiner tree. DR 2-stage facility location () was defined in Section 1.1; we define the remaining problems below, and then discuss the general class of DR 2-stage problems to which our framework applies. Recall that denotes the input size.

1. DR 2-stage set cover (). We have a collection of subsets over a ground set . A scenario is a subset of and specifies the set of elements to be covered in that scenario. We may buy a set in either stage, incurring costs of and in stages I and II respectively. The sets chosen in stage I and in each scenario must together cover . The goal is to choose some first-stage sets and sets in each scenario so as to minimize .

We have , and is the encoding size of . We consider the unrestricted () and -bounded () settings. Different scenarios could be quite unrelated, so there does not seem to be a natural choice for a (non-discrete) scenario-metric; we therefore consider (balls in) the or metrics.

2. DR 2-stage Steiner tree (). We have a complete graph with metric edge costs , root , and inflation factor . A scenario is a subset of nodes (called terminals) specifying the nodes that need to be connected to . We may buy an edge in stages I or II, incurring costs or respectively. The union of the edges bought in stage I, and bought in scenario , must connect all nodes in to , and we want to minimize . (With non-uniform inflation factors for different edges, even 2-stage stochastic Steiner tree becomes at least as hard as group Steiner tree [31].)

Here is the encoding size of . We obtain results in the unrestricted setting, and leave the -bounded setting for future work. As with , in addition to the and metrics, we can consider scenario metrics defined using (e.g., ) and the resulting Wasserstein metrics.

##### A general class of DR 2-stage problems.

Abstracting away the key properties of , , , we now define the generic DR 2-stage problem that we consider. As before, denotes the finite first-stage action set of the discrete problem. It will be convenient to consider the natural fractional relaxation of the DR problem obtained by enlarging the discrete second-stage action set and to suitable polytopes. Recall that is the optimal second-stage cost of scenario given as the first-stage decision, when we allow fractional second-stage actions. Let denote the polytope specifying the fractional first-stage decisions, with . (For example, for , is the optimal value of a set-cover LP where we may buy sets fractionally in the second stage, and .) One benefit of moving to the fractional relaxation is that, for every scenario , is a convex function of , whose value and subgradient can be exactly computed.

###### Definition 2.2.

Let be a function. We say that is a subgradient of at if we have for all . Given , we say that is an -subgradient of at the point if for every , we have . We abbreviate -subgradient to -subgradient.

Following [4, 34, 37], we consider the following generic DR 2-stage problem (Q) with discrete first-stage set , and its (further) fractional relaxation (Q), and require that they satisfy properties 11 listed below. Let denote the -norm of .

 minx∈Xh(˚p;x):=c⊺x+maxq:L(˚p,q)≤rEA∼q[g(x,A)] (Q$˚p$)
 minx∈Ph(˚p;x) (Qfr$˚p$)

In proving their SAA result for 2-stage stochastic problems, [4] define properties 1, 2 below to capture the fact that every first-stage action has a corresponding recourse action that is more expensive by a bounded factor, and hence, it is always feasible to not take any first-stage actions.

1. , , , and for all .

2. We know an inflation parameter such that for all .

Since we apply the ellipsoid-based machinery in [34] to solve the fractional problem with a polynomial-size central distribution, we need bounds on the feasible region in terms of enclosing and enclosed balls; this is captured by 1, which is directly lifted from [34]. Note that the vast majority of 2-stage problems (including , , ) involve decisions, with and so , so 1 is readily satisfied. As in [34], we need to be able to compute the value and subgradient of the recourse cost , which is a benign requirement since is the optimal value of a polytime-solvable LP in all our applications. Whereas [34] define a syntactic class of 2-stage stochastic LPs and show (implicitly) that they satisfy this requirement, we explicitly isolate this requirement in 2, 3.

1. We have positive bounds and such that and contains a ball of radius such that .

2. For every , is convex over , and can be efficiently computed for every .

3. For every , we can efficiently compute a subgradient of at with , where . Hence, the Lipschitz constant of is at most (due to Definition 2.2).

Finally, we need the following additional mild condition.

1. When is the Wasserstein metric with respect to a scenario metric , we know with such that for all and all with .

As noted above, 13 are gathered from [4, 34], and hold for all the 2-stage problems considered in the CS literature (see [37, 6, 11, 23, 17]); 1 is a new requirement, but is also rather mild and holds for all the problems we consider. 1, 2 and 1 are used to prove that SAA works for the DR problem under the Wasserstein metric (Section 3.1). 13 pertain to the fractional relaxation, and are utilized to show that one can efficiently solve the SAA problem approximately (Section 3.2).

A solution to (Q) needs to be rounded to yield integral second-stage actions: any LP-relative -approximation algorithm for the deterministic version of the problem can be used to obtain recourse actions for each scenario having cost at most . To round a fractional solution to (Q), we utilize a local approximation algorithm for the 2-stage problem: we say that is a local -approximation algorithm for (Q) if, given any , it returns an integral solution and implicitly specifies integral recourse actions for every , such that and for all . An -approximate solution to (Q) combined with a local -approximation yields an -approximate solution to the discrete DR 2-stage problem. Local approximation algorithms exist for various 2-stage problems—e.g., set cover, vertex cover, facility location [34]—with approximation factors that are comparable to the approximation factors known for their deterministic counterparts.

## 3 Distributionally robust problems under the Wasserstein metric

We now focus on the DR 2-stage problem (Q) when is the Wasserstein metric with respect to a metric on scenarios. Plugging in the definition of (with respect to scenario metric ), we can rewrite (Q) as follows.

 minx∈X h(˚p;x):=c⊺x+z(˚p;x),where z(˚p;x) := (Q$˚p$)
 max ∑A,A′γA,A′ g(x,A′) (T$˚p$,x) s.t. ∑A′γA,A′ ≤˚pA ∀A∈A (1) ∑A,A′ℓ(A,A′) γA,A′≤r (2) γ ≥0. (3)

Let denote the optimal value of (Q). We note that a naive, simplistic approach that ignores the uncertainty in the underlying distribution, and only considers the central distribution , yields (expectedly) poor bounds. Suppose is an -approximate solution for the 2-stage problem . Given 1, one can show that (and is at least ), which implies , but this is too weak a guarantee since could be quite large compared to .

In Section 3.1, we work with (Q) and show that the SAA approach can be used to reduce to the case where the central distribution has polynomial-size support. In Section 3.2, we show how to approximately solve the polynomial-size support case by applying the ellipsoid method to its (further) relaxation (Q), where we replace with . Here, we utilize a local approximation algorithm to move from to , and thereby interface with, and complement, the SAA result for (Q) proved in Section 3.1. This result applies more generally, even when is not a metric; we only require that for all . (If is not a metric, the Wasserstein distance with respect to need not yield a metric on distributions.)

In Section 3.3, we consider various combinatorial-optimization problems, and utilize the above results in conjunction to obtain the first approximation results for the DR versions of these problems.

### 3.1 A sample-average-approximation (SAA) result for distributionally robust problems

The SAA approach is the following simple, intuitive idea: draw some samples from , estimate by the empirical distribution induced by these samples, and solve the SAA problem (Q). We prove the following SAA result. For any , if we construct SAA problems, each using independent samples, and if we have a -approximation algorithm for computing the objective value of the SAA problem at any given point, then we can utilize -approximate solutions to these SAA problems to obtain a solution satisfying with high probability; Theorem 3.5 gives the precise statement.

The proof has several ingredients. There are two main approaches [4, 37] for showing that the SAA method with a polynomial number of samples works for stochastic-optimization problems. Charikar et al. [4] prove the following SAA result for 2-stage problems.

###### Theorem 3.1 ([4]).

Consider a 2-stage problem (2St-P) : , with scenario set , where satisfy 1, 2 with inflation parameter . With probability at least , any optimal solution to the SAA problem constructed using samples is a -approximate solution to (2St-P). More generally, there is a way of using an -approximation algorithm for the SAA problem, in conjunction with a -approximate objective-value oracle for the SAA problem, to obtain an -approximate solution to (2St-P) with high probability.

Note that (Q) is not a standard 2-stage stochastic-optimization problem because constraint (2) couples the various scenarios, which prevents us from applying Theorem 3.1 to (Q). The SAA result in Swamy and Shmoys [37] applies to the fractional relaxation of the problem, and works whenever the objective functions of the SAA and original problems satisfy a certain “closeness-in-subgradients” property. A subgradient of at a point is obtained from the optimal distribution to the inner maximization problem in (Q). This is however an exponential-size object and utilizing this to prove closeness-in-subgradients seems quite daunting.

Our first insight is that we can decouple the scenarios by Lagrangifying constraint (2) using a dual variable . By standard duality arguments, this leads to the following reformulation of (Q).

 minx∈X[c⊺x+miny≥0(ry+max {∑A,A′γA,A′(g(x,A′)−y⋅ℓ(A,A′)):  γ≥0,  ∑A′γA,A′≤˚pA  ∀A∈A})\small{z(˚p;x)}] which simplifies tominx∈X,y≥0  h(˚p;x,y):=c⊺x+ry+EA∼˚p[maxA′∈A(g(x,A′)−y⋅ℓ(A,A′))]. (R$˚p$)

Recall that . Let . The chief benefit of the reformulation (R) is that we can view (R) as a 2-stage problem: the first-stage action-set is , and the optimal second-stage cost of scenario under first-stage actions is given by . This makes it more amenable to utilize the SAA machinery developed for 2-stage problems. We can exploit 1 to show that we may limit to the range in (R), and use 2 to bound the inflation factor of (R).

###### Lemma 3.2.

For any , there exists such that . Hence, is an -approximate solution to (Q) iff such that is an -approximate solution to (R).

###### Proof.

The second statement is immediate from the first one since (Q) and (R) have the same optimal values. So we focus on showing the first statement.

Consider any . There exists such that . If , then we are done. So suppose . We argue that . This completes the proof since we also have for all . Clearly, . If is such that , then it must be that . Otherwise, , where the last inequality follows from 1. This contradicts the choice of . Therefore, we have , completing the proof. ∎

###### Lemma 3.3.

For the 2-stage problem (R), we can set the parameter in Theorem 3.1 to be .

###### Proof.

Consider any , , and . Let be such that . Then

 g(0,0,A)−g(x,y,A)≤g(0,A′)−(g(x,A′)−y⋅ℓ(A,A′))≤λc⊺x+y⋅ℓmax≤max{λ,ℓmaxr}(c⊺x+ry).

The second inequality above follows from 2. ∎

Given Lemmas 3.2 and 3.3, by suitably discretizing , one can use Theorem 3.1 to show that: if we construct the SAA problem using samples, and can compute (approximately) the SAA objective value at any given point, then, with high probability, one can translate an -approximate solution to the SAA problem to an -approximate solution to (Q). But this result does not quite suit our purposes due to various reasons.

The term could be rather large, and is not , so this does not yield polynomial sample complexity.444The problem persists even if we utilize the closeness-in-subgradients machinery in [37] to the fractional version of (R). This would involve estimating to within an term, where , which requires samples. Moreover it seems difficult to compute the SAA objective value , or even approximate it. This difficulty arises because computing encompasses the NP-hard -- problem encountered in 2-stage robust optimization, and furthermore, the mixed-sign objective in makes it hard to even approximate (see Theorem 3.12).

We need various ideas to circumvent these issues. We show that we can eliminate the dependence on altogether at the expense of a slight deterioration in the approximation ratio when moving from the SAA to the original problem. The term arises because might be attained by a scenario where (see the proof of Lemma 3.3). Our crucial second insight is that we can eliminate this and reduce the sample complexity to , by specifically imposing that we never encounter pairs with ; we call such pairs long edges, and the remaining pairs short edges. Any satisfying (2) can send at most flow on the long edges. Motivated by this, we “decompose” into and , which are (roughly speaking) the contribution from the short and long edges respectively. (This decomposition is akin to the division of low- and high- cost scenarios used by [4] to prove Theorem 3.1, but there are significant technical differences, which complicate things for us, as we discuss below.) We define and as follows.

 zsh(˚p;x) := max {∑A,A′γA,A′g(x,A′):(???),(???),(???),  γA,A′=0  if  ℓ(A,A′)>M} zlg(˚p;x) := max {∑A,A′γA,A′g(x,A′):(???),(???),(???),∑A,A′γA,A′≤1λ}.
###### Lemma 3.4.

For every central distribution , and every , we have .

###### Proof.

We prove this by showing that: (i) ; and (ii) . Given these bounds, the upper bound on follows from the upper bounds on and in parts (i) and (ii) respectively. For the other direction, we have

 c⊺x+zsh(p;x)+zlg(p;0)≤c⊺x+zsh(p;x)+zlg(p;x)+c⊺x≤2c⊺x+2z(p;x)=2h