Data-Driven Distributionally Robust Optimization Using the Wasserstein Metric

# Data-Driven Distributionally Robust Optimization Using the Wasserstein Metric: Performance Guarantees and Tractable Reformulations

Peyman Mohajerin Esfahani and Daniel Kuhn
September 9, 2019
###### Abstract.

We consider stochastic programs where the distribution of the uncertain parameters is only observable through a finite training dataset. Using the Wasserstein metric, we construct a ball in the space of (multivariate and non-discrete) probability distributions centered at the uniform distribution on the training samples, and we seek decisions that perform best in view of the worst-case distribution within this Wasserstein ball. The state-of-the-art methods for solving the resulting distributionally robust optimization problems rely on global optimization techniques, which quickly become computationally excruciating. In this paper we demonstrate that, under mild assumptions, the distributionally robust optimization problems over Wasserstein balls can in fact be reformulated as finite convex programs—in many interesting cases even as tractable linear programs. Leveraging recent measure concentration results, we also show that their solutions enjoy powerful finite-sample performance guarantees. Our theoretical results are exemplified in mean-risk portfolio optimization as well as uncertainty quantification.

The authors are with the Delft Center for Systems and Control, TU Delft, The Netherlands (P.MohajerinEsfahani@tudelft.nl), and the Risk Analytics and Optimization Chair, EPFL, Switzerland (daniel.kuhn@epfl.ch).

## 1. Introduction

Stochastic programming is a powerful modeling paradigm for optimization under uncertainty. The goal of a generic single-stage stochastic program is to find a decision that minimizes an expected cost , where the expectation is taken with respect to the distribution of the continuous random vector . However, classical stochastic programming is challenged by the large-scale decision problems encountered in today’s increasingly interconnected world. First, the distribution is never observable but must be inferred from data. However, if we calibrate a stochastic program to a given dataset and evaluate its optimal decision on a different dataset, then the resulting out-of-sample performance is often disappointing—even if the two datasets are generated from the same distribution. This phenomenon is termed the optimizer’s curse and is reminiscent of overfitting effects in statistics [48]. Second, in order to evaluate the objective function of a stochastic program for a fixed decision , we need to compute a multivariate integral, which is #P-hard even if constitutes the positive part of an affine function, while is uniformly distributed on the unit hypercube [24, Corollary 1].

Distributionally robust optimization is an alternative modeling paradigm, where the objective is to find a decision that minimizes the worst-case expected cost . Here, the worst-case is taken over an ambiguity set , that is, a family of distributions characterized through certain known properties of the unknown data-generating distribution . Distributionally robust optimization problems have been studied since Scarf’s seminal treatise on the ambiguity-averse newsvendor problem in 1958 [43], but the field has gained thrust only with the advent of modern robust optimization techniques in the last decade [3, 9]. Distributionally robust optimization has the following striking benefits. First, adopting a worst-case approach regularizes the optimization problem and thereby mitigates the optimizer’s curse characteristic for stochastic programming. Second, distributionally robust models are often tractable even though the corresponding stochastic model with the true data-generating distribution (which is generically continuous) are -hard. So even if the data-generating distribution was known, the corresponding stochastic program could not be solved efficiently.

The ambiguity set is a key ingredient of any distributionally robust optimization model. A good ambiguity set should be rich enough to contain the true data-generating distribution with high confidence. On the other hand, the ambiguity set should be small enough to exclude pathological distributions, which would incentivize overly conservative decisions. The ambiguity set should also be easy to parameterize from data, and—ideally—it should facilitate a tractable reformulation of the distributionally robust optimization problem as a structured mathematical program that can be solved with off-the-shelf optimization software.

Distributionally robust optimization models where has finitely many realizations are reviewed in [2, 7, 39]. This paper focuses on situations where can have a continuum of realizations. In this setting, the existing literature has studied three types of ambiguity sets. Moment ambiguity sets contain all distributions that satisfy certain moment constraints, see for example [18, 22, 51] or the references therein. An attractive alternative is to define the ambiguity set as a ball in the space of probability distributions by using a probability distance function such as the Prohorov metric [20], the Kullback-Leibler divergence [27, 25], or the Wasserstein metric [38, 52] etc. Such metric-based ambiguity sets contain all distributions that are close to a nominal or most likely distribution with respect to the prescribed probability metric. By adjusting the radius of the ambiguity set, the modeler can thus control the degree of conservatism of the underlying optimization problem. If the radius drops to zero, then the ambiguity set shrinks to a singleton that contains only the nominal distribution, in which case the distributionally robust problem reduces to an ambiguity-free stochastic program. In addition, ambiguity sets can also be defined as confidence regions of goodness-of-fit tests [7].

In this paper we study distributionally robust optimization problems with a Wasserstein ambiguity set centered at the uniform distribution on independent and identically distributed training samples. The Wasserstein distance of two distributions and can be viewed as the minimum transportation cost for moving the probability mass from to , and the Wasserstein ambiguity set contains all (continuous or discrete) distributions that are sufficiently close to the (discrete) empirical distribution with respect to the Wasserstein metric. Modern measure concentration results from statistics guarantee that the unknown data-generating distribution belongs to the Wasserstein ambiguity set around with confidence if its radius is a sublinearly growing function of [11, 21]. The optimal value of the distributionally robust problem thus provides an upper confidence bound on the achievable out-of-sample cost.

While Wasserstein ambiguity sets offer powerful out-of-sample performance guarantees and enable the decision maker to control the model’s conservativeness, moment-based ambiguity sets appear to display better tractability properties. Specifically, there is growing evidence that distributionally robust models with moment ambiguity sets are more tractable than the corresponding stochastic models because the intractable high-dimensional integrals in the objective function are replaced with tractable (generalized) moment problems [18, 22, 51]. In contrast, distributionally robust models with Wasserstein ambiguity sets are believed to be harder than their stochastic counterparts [36]. Indeed, the state-of-the-art method for computing the worst-case expectation over a Wasserstein ambiguity set  relies on global optimization techniques. Exploiting the fact that the extreme points of are discrete distributions with a fixed number of atoms [52], one may reformulate the original worst-case expectation problem as a finite-dimensional non-convex program, which can be solved via “difference of convex programming” methods, see [52] or [36, Section 7.1]. However, the computational effort is reported to be considerable, and there is no guarantee to find the global optimum. Nevertheless, tractability results are available for special cases. Specifically, the worst case of a convex law-invariant risk measure with respect to a Wasserstein ambiguity set reduces to the sum of the nominal risk and a regularization term whenever is affine in and does not include any support constraints [53]. Moreover, while this paper was under review we became aware of the PhD thesis [54], which reformulates a distributionally robust two-stage unit commitment problem over a Wasserstein ambiguity set as a semi-infinite linear program, which is subsequently solved using a Benders decomposition algorithm.

The main contribution of this paper is to demonstrate that the worst-case expectation over a Wasserstein ambiguity set can in fact be computed efficiently via convex optimization techniques for numerous loss functions of practical interest. Furthermore, we propose an efficient procedure for constructing an extremal distribution that attains the worst-case expectation—provided that such a distribution exists. Otherwise, we construct a sequence of distributions that attain the worst-case expectation asymptotically. As a by-product, our analysis shows that many interesting distributionally robust optimization problems with Wasserstein ambiguity sets can be solved in polynomial time. We also investigate the out-of-sample performance of the resulting optimal decisions—both theoretically and experimentally—and analyze its dependence on the number of training samples. We highlight the following main contributions of this paper.

• We prove that the worst-case expectation of an uncertain loss over a Wasserstein ambiguity set coincides with the optimal value of a finite-dimensional convex program if constitutes a pointwise maximum of finitely many concave functions. Generalizations to convex functions or to sums of maxima of concave functions are also discussed. We conclude that worst-case expectations can be computed efficiently to high precision via modern convex optimization algorithms.

• We describe a supplementary finite-dimensional convex program whose optimal (near-optimal) solutions can be used to construct exact (approximate) extremal distributions for the infinite-dimensional worst-case expectation problem.

• We show that the worst-case expectation reduces to the optimal value of an explicit linear program if the -norm or the -norm is used in the definition of the Wasserstein metric and if belongs to any of the following function classes: (1) a pointwise maximum or minimum of affine functions; (2) the indicator function of a closed polytope or the indicator function of the complement of an open polytope; (3) the optimal value of a parametric linear program whose cost or right-hand side coefficients depend linearly on .

• Using recent measure concentration results from statistics, we demonstrate that the optimal value of a distributionally robust optimization problem over a Wasserstein ambiguity set provides an upper confidence bound on the out-of-sample cost of the worst-case optimal decision. We validate this theoretical performance guarantee in numerical tests.

If the uncertain parameter vector is confined to a fixed finite subset of , then the worst-case expectation problems over Wasserstein ambiguity sets simplify substantially and can often be reformulated as tractable conic programs by leveraging ideas from robust optimization. An elegant second-order conic reformulation has been discovered, for instance, in the context of distributionally robust regression analysis [32], and a comprehensive list of tractable reformulations of distributionally robust risk constraints for various risk measures is provided in [39]. Our paper extends these tractability results to the practically relevant case where has uncountably many possible realizations—without resorting to space tessellation or discretization techniques that are prone to the curse of dimensionality.

When is linear and the distribution of ranges over a Wasserstein ambiguity set without support constraints, one can derive a concise closed-form expression for the worst-case risk of for various convex risk measures [53]. However, these analytical solutions come at the expense of a loss of generality. We believe that the results of this paper may pave the way towards an efficient computational procedure for evaluating the worst-case risk of in more general settings where the loss function may be non-linear and may be subject to support constraints.

Among all metric-based ambiguity sets studied to date, the Kullback-Leibler ambiguity set has attracted most attention from the robust optimization community. It has first been used in financial portfolio optimization to capture the distributional uncertainty of asset returns with a Gaussian nominal distribution [19]. Subsequent work has focused on Kullback-Leibler ambiguity sets for discrete distributions with a fixed support, which offer additional modeling flexibility without sacrificing computational tractability [14, 2]. It is also known that distributionally robust chance constraints involving a generic Kullback-Leibler ambiguity set are equivalent to the respective classical chance constraints under the nominal distribution but with a rescaled violation probability [27, 26]. Moreover, closed-form counterparts of distributionally robust expectation constraints with Kullback-Leibler ambiguity sets have been derived in [25].

However, Kullback-Leibler ambiguity sets typically fail to represent confidence sets for the unknown distribution . To see this, assume that is absolutely continuous with respect to the Lebesgue measure and that the ambiguity set is centered at the discrete empirical distribution . Then, any distribution in a Kullback-Leibler ambiguity set around must assign positive probability mass to each training sample. As has a density function, it must therefore reside outside of the Kullback-Leibler ambiguity set irrespective of the training samples. Thus, Kullback-Leibler ambiguity sets around contain with probability 0. In contrast, Wasserstein ambiguity sets centered at contain discrete as well as continuous distributions and, if properly calibrated, represent meaningful confidence sets for . We will exploit this property in Section 3 to derive finite-sample guarantees. A comparison and critical assessment of various metric-based ambiguity sets is provided in [45]. Specifically, it is shown that worst-case expectations over Kullback-Leibler and other divergence-based ambiguity sets are law invariant. In contrast, worst-case expectations over Wasserstein ambiguity sets are not. The law invariance can be exploited to evaluate worst-case expectations via the sample average approximation.

The models proposed in this paper fall within the scope of data-driven distributionally robust optimization [20, 16, 7, 23]. Closest in spirit to our work is the robust sample average approximation [7], which seeks decisions that are robust with respect to the ambiguity set of all distributions that pass a prescribed statistical hypothesis test. Indeed, the distributions within the Wasserstein ambiguity set could be viewed as those that pass a multivariate goodness-of-fit test in light of the available training samples. This amounts to interpreting the Wasserstein distance between the empirical distribution and a given hypothesis as a test statistic and the radius of the Wasserstein ambiguity set as a threshold that needs to be chosen in view of the test’s desired significance level . The Wasserstein distance has already been used in tests for normality [17] and to devise nonparametric homogeneity tests [40].

The rest of the paper proceeds as follows. Section 2 sketches a generic framework for data-driven distributionally robust optimization, while Section 3 introduces our specific approach based on Wasserstein ambiguity sets and establishes its out-of-sample performance guarantees. In Section 4 we demonstrate that many worst-case expectation problems over Wasserstein ambiguity sets can be reduced to finite-dimensional convex programs, and we develop a systematic procedure for constructing worst-case distributions. Explicit linear programming reformulations of distributionally robust single and two-stage stochastic programs as well as uncertainty quantification problems are derived in Section 5. Section 6 extends the scope of the basic approach to broader classes of objective functions, and Section 7 reports on numerical results.

##### Notation

We denote by the non-negative and by the extended reals. Throughout this paper, we adopt the conventions of extended arithmetics, whereby and . The inner product of two vectors is denoted by . Given a norm on , the dual norm is defined through . A function is proper if for at least one and for every in . The conjugate of is defined as . Note that conjugacy preserves properness. For a set , the indicator function is defined through if ; otherwise. Similarly, the characteristic function is defined via if ; otherwise. The support function of is defined as . It coincides with the conjugate of . We denote by the Dirac distribution concentrating unit mass at . The product of two probability distributions and on and , respectively, is the distribution on . The -fold product of a distribution on is denoted by , which represents a distribution on the Cartesian product space . Finally, we set the expectation of under to , which is well-defined by the conventions of extended arithmetics.

## 2. Data-Driven Stochastic Programming

Consider the stochastic program

 J⋆\coloneqqinfx∈X{EP[h(x,ξ)]=∫Ξh(x,ξ)P(dξ)} (1)

with feasible set , uncertainty set and loss function . The loss function depends both on the decision vector and the random vector , whose distribution is supported on . Problem (1) can be viewed as the first-stage problem of a two-stage stochastic program, where represents the optimal value of a subordinate second-stage problem [46]. Alternatively, problem (1) may also be interpreted as a generic learning problem in the spirit of [49].

Unfortunately, in most situations of practical interest, the distribution is not precisely known, and therefore we miss essential information to solve problem (1) exactly. However, is often partially observable through a finite set of independent samples, e.g., past realizations of the random vector . We denote the training dataset comprising these samples by . We emphasize that—before its revelation—the dataset can be viewed as a random object governed by the distribution supported on .

A data-driven solution for problem (1) is a feasible decision that is constructed from the training dataset . Throughout this paper, we notationally suppress the dependence of on the training samples in order to avoid clutter. Instead, we reserve the superscript ‘  ’ for objects that depend on the training data and thus constitute random objects governed by the product distribution . The out-of-sample performance of is defined as and can thus be viewed as the expected cost of under a new sample that is independent of the training dataset. As is unknown, however, the exact out-of-sample performance cannot be evaluated in practice, and the best we can hope for is to establish performance guarantees in the form of tight bounds. The feasibility of in (1) implies , but this lower bound is again of limited use as is unknown and as our primary concern is to bound the costs from above. Thus, we seek data-driven solutions with performance guarantees of the type

 (2)

where constitutes an upper bound that may depend on the training dataset, and is a significance parameter with respect to the distribution , which governs both and . Hereafter we refer to as a certificate for the out-of-sample performance of and to the probability on the left-hand side of (2) as its reliability. Our ideal goal is to find a data-driven solution with the lowest possible out-of-sample performance. This is impossible, however, as is unknown, and the out-of-sample performance cannot be computed. We thus pursue the more modest but achievable goal to find a data-driven solution with a low certificate and a high reliability.

A natural approach to generate data-driven solutions is to approximate with the discrete empirical probability distribution

 ˆPN\coloneqq1NN∑i=1δˆξi, (3)

that is, the uniform distribution on . This amounts to approximating the original stochastic program (1) with the sample-average approximation (SAA) problem

 (4)

If the feasible set is compact and the loss function is uniformly continuous in across all , then the optimal value and optimal solutions of the SAA problem (4) converge almost surely to their counterparts of the true problem (1) as tends to infinity [46, Theorem 5.3]. Even though finite sample performance guarantees of the type (2) can be obtained under additional assumptions such as Lipschitz continuity of the loss function (see e.g.[47, Theorem 1]), the SAA problem has been conceived primarily for situations where the distribution is known and additional samples can be acquired cheaply via random number generation. However, the optimal solutions of the SAA problem tend to display a poor out-of-sample performance in situations where is small and where the acquisition of additional samples would be costly.

In this paper we address problem (1) with an alternative approach that explicitly accounts for our ignorance of the true data-generating distribution , and that offers attractive performance guarantees even when the acquisition of additional samples from is impossible or expensive. Specifically, we use to design an ambiguity set containing all distributions that could have generated the training samples with high confidence. This ambiguity set enables us to define the certificate as the optimal value of a distributionally robust optimization problem that minimize the worst-case expected cost.

 ˆJN\coloneqqinfx∈XsupQ∈ˆPNEQ[h(x,ξ)] (5)

Following [38], we construct as a ball around the empirical distribution (3) with respect to the Wasserstein metric. In the remainder of the paper we will demonstrate that the optimal value as well as any optimal solution (if it exists) of the distributionally robust problem (5) satisfy the following conditions.

1. [label=(), itemsep = 1mm, topsep = 1mm]

2. Finite sample guarantee: For a carefully chosen size of the ambiguity set, the certificate provides a confidence bound of the type (2) on the out-of-sample performance of .

3. Asymptotic consistency: As tends to infinity, the certificate and the data-driven solution converge—in a sense to be made precise below—to the optimal value and an optimizer of the stochastic program (1), respectively.

4. Tractability: For many loss functions and sets , the distributionally robust problem (5) is computationally tractable and admits a reformulation reminiscent of the SAA problem (4).

Conditions 13 have been identified in [7] as desirable properties of data-driven solutions for stochastic programs. Precise statements of these conditions will be provided in the remainder. In Section 3 we will use the Wasserstein metric to construct ambiguity sets of the type satisfying the conditions 1 and 2. In Section 4, we will demonstrate that these ambiguity sets also fulfill the tractability condition 3. We see this last result as the main contribution of this paper because the state-of-the-art method for solving distributionally robust problems over Wasserstein ambiguity sets relies on global optimization algorithms [36].

## 3. Wasserstein Metric and Measure Concentration

Probability metrics represent distance functions on the space of probability distributions. One of the most widely used examples is the Wasserstein metric, which is defined on the space of all probability distributions supported on with .

###### Definition 3.1 (Wasserstein metric [29]).

The Wasserstein metric is defined via

for all distributions , where represents an arbitrary norm on .

The decision variable can be viewed as a transportation plan for moving a mass distribution described by to another one described by . Thus, the Wasserstein distance between and represents the cost of an optimal mass transportation plan, where the norm encodes the transportation costs. We remark that a generalized -Wasserstein metric for is obtained by setting the transportation cost between and to . In this paper, however, we focus exclusively on the -Wasserstein metric of Definition 3.1, which is sometimes also referred to as the Kantorovich metric.

We will sometimes also need the following dual representation of the Wasserstein metric.

###### Theorem 3.2 (Kantorovich-Rubinstein [29]).

For any distributions we have

 dW(Q1,Q2)=supf∈L{∫Ξf(ξ)Q1(dξ)−∫Ξf(ξ)Q2(dξ)},

where denotes the space of all Lipschitz functions with for all .

Kantorovich and Rubinstein [29] originally established this result for distributions with bounded support. A modern proof for unbounded distributions is due to Villani [50, Remark 6.5, p. 107]. The optimization problems in Definition 3.1 and Theorem 3.2, which provide two equivalent characterizations of the Wasserstein metric, constitute a primal-dual pair of infinite-dimensional linear programs. The dual representation implies that two distributions and are close to each other with respect to the Wasserstein metric if and only if all functions with uniformly bounded slopes have similar integrals under and . Theorem 3.2 also demonstrates that the Wasserstein metric is a special instance of an integral probability metric (see e.g. [33]) and that its generating function class coincides with a family of Lipschitz continuous functions.

In the remainder we will examine the ambiguity set

 (6)

which can be viewed as the Wasserstein ball of radius centered at the empirical distribution . Under a common light tail assumption on the unknown data-generating distribution , this ambiguity set offers attractive performance guarantees in the spirit of Section 2.

###### Assumption 3.3 (Light-tailed distribution).

There exists an exponent such that

 A\coloneqqEP[exp(∥ξ∥a)]=∫Ξexp(∥ξ∥a)P(dξ)<∞.

Assumption 3.3 essentially requires the tail of the distribution to decay at an exponential rate. Note that this assumption trivially holds if is compact. Heavy-tailed distributions that fail to meet Assumption 3.3 are difficult to handle even in the context of the classical sample average approximation. Indeed, under a heavy-tailed distribution the sample average of the loss corresponding to any fixed decision may not even converge to the expected loss; see e.g. [13, 15]. The following modern measure concentration result provides the basis for establishing powerful finite sample guarantees.

###### Theorem 3.4 (Measure concentration [21, Theorem 2]).

If Assumption 3.3 holds, we have

 (7)

for all , , and , where are positive constants that only depend on , , and .111A similar but slightly more complicated inequality also holds for the special case ; see [21, Theorem 2] for details.

Theorem 3.4 provides an a priori estimate of the probability that the unknown data-generating distribution  resides outside of the Wasserstein ball . Thus, we can use Theorem 3.4 to estimate the radius of the smallest Wasserstein ball that contains with confidence for some prescribed . Indeed, equating the right-hand side of (7) to and solving for yields

 εN(β)\coloneqq⎧⎪ ⎪⎨⎪ ⎪⎩(log(c1β−1)c2N)1/max{m,2}if N≥log(c1β−1)c2,(log(c1β−1)c2N)1/aif N

Note that the Wasserstein ball with radius can thus be viewed as a confidence set for the unknown true distribution as in statistical testing; see also [7].

###### Theorem 3.5 (Finite sample guarantee).

Suppose that Assumption 3.3 holds and that . Assume also that and represent the optimal value and an optimizer of the distributionally robust program (5) with ambiguity set . Then, the finite sample guarantee (2) holds.

###### Proof.

The claim follows immediately from Theorem 3.4, which ensures via the definition of in (8) that . Thus, with probability . ∎

It is clear from (8) that for any fixed , the radius tends to as increases. Moreover, one can show that if converges to zero at a carefully chosen rate, then the solution of the distributionally robust optimization problem (5) with ambiguity set converges to the solution of the original stochastic program (1) as tends to infinity. The following theorem formalizes this statement.

###### Theorem 3.6 (Asymptotic consistency).

Suppose that Assumption 3.3 holds and that , , satisfies and .222A possible choice is . Assume also that and represent the optimal value and an optimizer of the distributionally robust program (5) with ambiguity set , .

1. [label=(), itemsep = 1mm, topsep = 1mm]

2. If is upper semicontinuous in and there exists with for all and , then -almost surely we have as where is the optimal value of (1).

3. If the assumptions of assertion 1 hold, is closed, and is lower semicontinuous in for every , then any accumulation point of is -almost surely an optimal solution for (1).

The proof of Theorem 3.6 will rely on the following technical lemma.

###### Lemma 3.7 (Convergence of distributions).

If Assumption 3.3 holds and , , satisfies and , then, any sequence , , where may depend on the training data, converges under the Wasserstein metric (and thus weakly) to almost surely with respect to , that is,

 P∞{limN→∞dW(P,ˆQN)=0}=1.
###### Proof.

As , the triangle inequality for the Wasserstein metric ensures that

 dW(P,ˆQN)≤dW(P,ˆPN)+dW(ˆPN,ˆQN)≤dW(P,ˆPN)+εN(βN).

Moreover, Theorem 3.4 implies that , and thus we have . As , the Borel-Cantelli Lemma [28, Theorem 2.18] further implies that

 P∞{dW(P,ˆQN)≤εN(βN) for all sufficiently large N}=1.

Finally, as , we conclude that almost surely. Note that convergence with respect to the Wasserstein metric implies weak convergence [10]. ∎

###### Proof of Theorem 3.6.

As , we have . Moreover, Theorem 3.5 implies that

 PN{J⋆≤EP[h(ˆxN,ξ)]≤ˆJN}≥PN{P∈BεN(βN)(ˆPN)}≥1−βN,

for all . As , the Borel-Cantelli Lemma further implies that

 P∞{J⋆≤EP[h(ˆxN,ξ)]≤ˆJN for all sufficiently large N}=1.

To prove assertion 1, it thus remains to be shown that with probability . As is upper semicontinuous and grows at most linearly in , there exists a non-increasing sequence of functions , , such that , and is Lipschitz continuous in for any fixed and with Lipschitz constant ; see Lemma A.1 in the appendix. Next, choose any , fix a -optimal decision for (1) with , and for every let be a -optimal distribution corresponding to with

 supQ∈ˆPNEQ[h(xδ,ξ)]≤EQN[h(xδ,ξ)]+δ.

Then, we have

 limsupN→∞ˆJN≤limsupN→∞supQ∈ˆPNEQ[h(xδ,ξ)] ≤limsupN→∞EˆQN[h(xδ,ξ)]+δ ≤limk→∞limsupN→∞(EP[hk(xδ,ξ)]+LkdW(P,ˆQN))+δ =limk→∞EP[hk(xδ,ξ)]+δ,P∞-almost surely =EP[h(xδ,ξ)]+δ≤J⋆+2δ,

where the second inequality holds because converges from above to , and the third inequality follows from Theorem 3.2. Moreover, the almost sure equality holds due to Lemma 3.7, and the last equality follows from the Monotone Convergence Theorem [30, Theorem 5.5], which applies because . Indeed, recall that has an exponentially decaying tail due to Assumption 3.3 and that is Lipschitz continuous in . As was chosen arbitrarily, we thus conclude that .

To prove assertion 2, fix an arbitrary realization of the stochastic process such that and for all sufficiently large . From the proof of assertion 1 we know that these two conditions are satisfied -almost surely. Using these assumptions, one easily verifies that

 liminfN→∞EP[h(ˆxN,ξ)]≤limN→∞ˆJN=J⋆. (9)

Next, let be an accumulation point of the sequence , and note that as is closed. By passing to a subsequence, if necessary, we may assume without loss of generality that . Thus,

 J⋆≤EP[h(x⋆,ξ)] ≤EP[liminfN→∞h(ˆxN,ξ)]≤liminfN→∞EP[h(ˆxN,ξ)]≤J⋆,

where the first inequality exploits that , the second inequality follows from the lower semicontinuity of in , the third inequality holds due to Fatou’s lemma (which applies because grows at most linearly in ), and the last inequality follows from (9). Therefore, we have . ∎

In the following we show that all assumptions of Theorem 3.6 are necessary for asymptotic convergence, that is, relaxing any of these conditions can invalidate the convergence result.

###### Example 1 (Necessity of regularity conditions).
1. [itemsep = 1mm, topsep = 1mm]

2. Upper semicontinuity of in Theorem 3.6 1:
Set , and , whereby . As concentrates unit mass at , we have irrespective of . For any , the Dirac distribution thus resides within the Wasserstein ball . Hence, fails to converge to for because

 ˆJN≥Eδε[h(x,ξ)]=h(x,ε)=1,∀ε>0.
3. Linear growth of in Theorem 3.6 1:
Set , and , which implies that . Note that for any , the two-point distribution is contained in the Wasserstein ball of radius . Hence, fails to converge to for because

4. Lower semicontinuity of in Theorem 3.6 2:
Set and , whereby irrespective of . As the objective is independent of , the distributionally robust optimization problem (5) is equivalent to (1). Then, is a sequence of minimizers for (5) whose accumulation point fails to be optimal in (1).

A convergence result akin to Theorem 3.6 for goodness-of-fit-based ambiguity sets is discussed in [7, Section 4]. This result is complementary to Theorem 3.6. Indeed, Theorem 3.6(i) requires to be upper semicontinuous in , which is a necessary condition in our setting (see Example 1) that is absent in [7]. Moreover, Theorem 3.6(ii) only requires to be lower semicontinuous in , while [7] asks for equicontinuity of this mapping. This stronger requirement provides a stronger result, that is, the almost sure convergence of to uniformly in on any compact subset of .

Theorems 3.5 and 3.6 indicate that a careful a priori design of the Wasserstein ball results in attractive finite sample and asymptotic guarantees for the distributionally robust solutions. In practice, however, setting the Wasserstein radius to yields over-conservative solutions for the following reasons:

• Even though the constants and in (8) can be computed based on the proof of [21, Theorem 2], the resulting Wasserstein ball is larger than necessary, i.e., with probability .

• Even if , the optimal value of (5) may still provide an upper bound on .

• The formula for in (8) is independent of the training data. Allowing for random Wasserstein radii, however, results in a more efficient use of the available training data.

While Theorems 3.5 and 3.6 provide strong theoretical justification for using Wasserstein ambiguity sets, in practice, it is prudent to calibrate the Wasserstein radius via bootstrapping or cross-validation instead of using the conservative a priori bound ; see Section 7.2 for further details. A similar approach has been advocated in [7] to determine the sizes of ambiguity sets that are constructed via goodness-of-fit tests.

So far we have seen that the Wasserstein metric allows us to construct ambiguity sets with favorable asymptotic and finite sample guarantees. In the remainder of the paper we will further demonstrate that the distributionally robust optimization problem (5) with a Wasserstein ambiguity set (6) is not significantly harder to solve than the corresponding SAA problem (4).

## 4. Solving Worst-Case Expectation Problems

We now demonstrate that the inner worst-case expectation problem in (5) over the Wasserstein ambiguity set (6) can be reformulated as a finite convex program for many loss functions of practical interest. For ease of notation, throughout this section we suppress the dependence on the decision variable . Thus, we examine a generic worst-case expectation problem

 supQ∈Bε(ˆPN)EQ[ℓ(ξ)] (10)

involving a decision-independent loss function , which is defined as the pointwise maximum of more elementary measurable functions , . The focus on loss functions representable as pointwise maxima is non-restrictive unless we impose some structure on the functions . Many tractability results in the remainder of this paper are predicated on the following convexity assumption.

###### Assumption 4.1 (Convexity).

The uncertainty set is convex and closed, and the negative constituent functions are proper, convex, and lower semicontinuous for all . Moreover, we assume that is not identically on for all .

Assumption 4.1 essentially stipulates that can be written as a maximum of concave functions. As we will showcase in Section 5, this mild restriction does not sacrifice much modeling power. Moreover, generalizations of this setting will be discussed in Section 6. We proceed as follows. Subsection 4.1 addresses the reduction of (10) to a finite convex program, while Subsection 4.2 describes a technique for constructing worst-case distributions.

### 4.1. Reduction to a Finite Convex Program

The worst-case expectation problem (10) constitutes an infinite-dimensional optimization problem over probability distributions and thus appears to be intractable. However, we will now demonstrate that (10) can be re-expressed as a finite-dimensional convex program by leveraging tools from robust optimization.

###### Theorem 4.2 (Convex reduction).

If the convexity Assumption 4.1 holds, then for any the worst-case expectation (10) equals the optimal value of the finite convex program

 (11)

Recall that denotes the conjugate of evaluated at and the dual norm of . Moreover, represents the characteristic function of and its conjugate, that is, the support function of .

###### Proof of Theorem 4.2.

By using Definition 3.1 we can re-express the worst-case expectation (10) as

 supQ∈Bε(ˆPN)EQ[ℓ(ξ)] =⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩supΠ,Q∫Ξℓ(ξ)Q(dξ)s.t.∫Ξ2∥ξ−ξ′∥Π(dξ,dξ′)≤ε{Π is a joint distribution of ξ and ξ′with marginals Q and ˆPN, respectively%

The second equality follows from the law of total probability, which asserts that any joint probability distribution of and can be constructed from the marginal distribution of and the conditional distributions of given , , that is, we may write . The resulting optimization problem represents a generalized moment problem in the distributions , . Using a standard duality argument, we obtain

 supQ∈Bε(ˆPN)EQ[ℓ(ξ)] =supQi∈M(Ξ)infλ≥01NN∑i=1∫Ξℓ(ξ)Qi(dξ)+λ(ε−1NN∑i=1∫Ξ∥ξ−ˆξi∥Qi(dξ)) ≤infλ≥0supQi∈M(Ξ)λε+1NN∑i=1∫Ξ(ℓ(ξ)−λ∥ξ−ˆξi∥)Qi(dξ) (12a) =infλ≥0λε+1NN∑i=1supξ∈Ξ(ℓ(ξ)−λ∥ξ−ˆξi∥), (12b) where (12a) follows from the max-min inequality, and (12b) follows from the fact that M(Ξ) contains all the Dirac distributions supported on Ξ. Introducing epigraphical auxiliary variables si, i≤N, allows us to reformulate (12b) as ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩infλ,siλε+1NN∑i=1sis.t.supξ∈Ξ(ℓ(ξ)−λ∥ξ−ˆξi∥)≤si∀i≤Nλ≥0 (12c) = (12d) ≤ (12e) Equality (12d) exploits the definition of the dual norm and the decomposability of ℓ(ξ) into its constituents ℓk(ξ), k≤K. Interchanging the maximization over zik with the minus sign (thereby converting the maximization to a minimization) and then with the maximization over ξ leads to a restriction of the feasible set of (12d). The resulting upper bound (12e) can be re-expressed as ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩infλ,si,zikλε+1NN∑i=1sis.t.supξ∈Ξ(ℓk(ξ)−⟨zik,ξ⟩)+⟨zik,ˆξi⟩≤si∀i≤N,∀k≤K∥zik∥∗≤λ∀i≤N,∀k≤K = ⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩infλ,si,zikλε+1NN∑i=1sis.t.[−ℓk+χΞ]∗(zik)−⟨zik,ˆξi⟩≤si∀i≤N,∀k≤K∥zik∥∗≤λ∀i≤N,∀k≤K, (12f)

where (12f) follows from the definition of conjugacy, our conventions of extended arithmetic, and the substitution of with . Note that (12f) is already a finite convex program.

Next, we show that Assumption 4.1 reduces the inequalities (12a) and (12e) to equalities. Under Assumption 4.1, the inequality (12a) is in fact an equality for any by virtue of an extended version of a well-known strong duality result for moment problems [44, Proposition 3.4]. One can show that (12a) continues to hold as an equality even for , in which case the Wasserstein ambiguity set (6) reduces to the singleton , while (10) reduces to the sample average . Indeed, for the variable in (12b) can be increased indefinitely at no penalty. As constitutes a pointwise maximum of upper semicontinuous concave functions, an elementary but tedious argument shows that (12b) converges to the sample average as tends to infinity.

The inequality (12e) also reduces to an equality under Assumption 4.1 thanks to the classical minimax theorem [4, Proposition 5.5.4], which applies because the set is compact for any finite . Thus, the optimal values of (10) and (12f) coincide.

Assumption 4.1 further implies that the function is proper, convex and lower semicontinuous. Properness holds because is not identically on . By [42, Theorem 11.23(a), p. 493], its conjugate essentially coincides with the epi-addition (also known as inf-convolution) of the conjugates of the functions and . Thus,

 [−ℓk+χΞ]∗(zik) =infνik([−ℓk]∗(zik−νik)+[χΞ]∗(νik)) =cl[infνik([−ℓk]∗(zik−νik)+σΞ(νik))],

where denotes the closure operator that maps any function to its largest lower semicontinuous minorant. As if and only if for any function , we may conclude that (12f) is indeed equivalent to (11) under Assumption 4.1. ∎

Note that the semi-infinite inequality in (12c) generalizes the nonlinear uncertain constraints studied in [1] because it involves an additional norm term and as the loss function is not necessarily concave under Assumption 4.1. As in [1], however, the semi-infinite constraint admits a robust counterpart that involves the conjugate of the loss function and the support function of the uncertainty set.

From the proof of Theorem 4.2 it is immediately clear that the worst-case expectation (10) is conservatively approximated by the optimal value of the finite convex program (12f) even if Assumption 4.1 fails to hold. In this case the sum in (12f) must be evaluated under our conventions of extended arithmetics, whereby . These observations are formalized in the following corollary.

###### Corollary 4.3 (Approximate convex reduction).

For any , the worst-case expectation (10) is smaller or equal to the optimal value of the finite convex program (12f).

### 4.2. Extremal Distributions

Stress test experiments are instrumental to assess the quality of candidate decisions in stochastic optimization. Meaningful stress tests require a good understanding of the extremal distributions from within the Wasserstein ball that achieve the worst-case expectation (10) for various loss functions. We now show that such extremal distributions can be constructed systematically from the solution of a convex program akin to (11).

###### Theorem 4.4 (Worst-case distributions).

If Assumption 4.1 holds, then the worst-case expectation (10) coincides with the optimal value of the finite convex program

 ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩supαik,qik1NN∑i=1K∑k=1αikℓk(ˆξi−qikαik)s.t.1NN∑i=1K∑k=1∥qik∥≤εK∑k=1αik=1∀i≤Nαik≥0∀i≤N,∀k≤Kˆξi−qikαik∈Ξ∀i≤N,∀k≤K (13)

irrespective of . Let be a sequence of feasible decisions whose objective values converge to the supremum of (13). Then, the discrete probability distributions

 Qr\coloneqq1NN∑i=1K∑k=1αik(r)δξik(r)withξik(r)\coloneqqˆξi−qik(r)αik(r)

belong to the Wasserstein ball and attain the supremum of (10) asymptotically, i.e.,

 supQ∈Bε(ˆPN)EQ[ℓ(ξ)]=limr→∞EQr[ℓ(ξ)]=limk→∞1NN∑i=1<