DataDriven Distributionally Robust Optimization Using the Wasserstein Metric: Performance Guarantees and Tractable Reformulations
Abstract.
We consider stochastic programs where the distribution of the uncertain parameters is only observable through a finite training dataset. Using the Wasserstein metric, we construct a ball in the space of (multivariate and nondiscrete) probability distributions centered at the uniform distribution on the training samples, and we seek decisions that perform best in view of the worstcase distribution within this Wasserstein ball. The stateoftheart methods for solving the resulting distributionally robust optimization problems rely on global optimization techniques, which quickly become computationally excruciating. In this paper we demonstrate that, under mild assumptions, the distributionally robust optimization problems over Wasserstein balls can in fact be reformulated as finite convex programs—in many interesting cases even as tractable linear programs. Leveraging recent measure concentration results, we also show that their solutions enjoy powerful finitesample performance guarantees. Our theoretical results are exemplified in meanrisk portfolio optimization as well as uncertainty quantification.
1. Introduction
Stochastic programming is a powerful modeling paradigm for optimization under uncertainty. The goal of a generic singlestage stochastic program is to find a decision that minimizes an expected cost , where the expectation is taken with respect to the distribution of the continuous random vector . However, classical stochastic programming is challenged by the largescale decision problems encountered in today’s increasingly interconnected world. First, the distribution is never observable but must be inferred from data. However, if we calibrate a stochastic program to a given dataset and evaluate its optimal decision on a different dataset, then the resulting outofsample performance is often disappointing—even if the two datasets are generated from the same distribution. This phenomenon is termed the optimizer’s curse and is reminiscent of overfitting effects in statistics [48]. Second, in order to evaluate the objective function of a stochastic program for a fixed decision , we need to compute a multivariate integral, which is #Phard even if constitutes the positive part of an affine function, while is uniformly distributed on the unit hypercube [24, Corollary 1].
Distributionally robust optimization is an alternative modeling paradigm, where the objective is to find a decision that minimizes the worstcase expected cost . Here, the worstcase is taken over an ambiguity set , that is, a family of distributions characterized through certain known properties of the unknown datagenerating distribution . Distributionally robust optimization problems have been studied since Scarf’s seminal treatise on the ambiguityaverse newsvendor problem in 1958 [43], but the field has gained thrust only with the advent of modern robust optimization techniques in the last decade [3, 9]. Distributionally robust optimization has the following striking benefits. First, adopting a worstcase approach regularizes the optimization problem and thereby mitigates the optimizer’s curse characteristic for stochastic programming. Second, distributionally robust models are often tractable even though the corresponding stochastic model with the true datagenerating distribution (which is generically continuous) are hard. So even if the datagenerating distribution was known, the corresponding stochastic program could not be solved efficiently.
The ambiguity set is a key ingredient of any distributionally robust optimization model. A good ambiguity set should be rich enough to contain the true datagenerating distribution with high confidence. On the other hand, the ambiguity set should be small enough to exclude pathological distributions, which would incentivize overly conservative decisions. The ambiguity set should also be easy to parameterize from data, and—ideally—it should facilitate a tractable reformulation of the distributionally robust optimization problem as a structured mathematical program that can be solved with offtheshelf optimization software.
Distributionally robust optimization models where has finitely many realizations are reviewed in [2, 7, 39]. This paper focuses on situations where can have a continuum of realizations. In this setting, the existing literature has studied three types of ambiguity sets. Moment ambiguity sets contain all distributions that satisfy certain moment constraints, see for example [18, 22, 51] or the references therein. An attractive alternative is to define the ambiguity set as a ball in the space of probability distributions by using a probability distance function such as the Prohorov metric [20], the KullbackLeibler divergence [27, 25], or the Wasserstein metric [38, 52] etc. Such metricbased ambiguity sets contain all distributions that are close to a nominal or most likely distribution with respect to the prescribed probability metric. By adjusting the radius of the ambiguity set, the modeler can thus control the degree of conservatism of the underlying optimization problem. If the radius drops to zero, then the ambiguity set shrinks to a singleton that contains only the nominal distribution, in which case the distributionally robust problem reduces to an ambiguityfree stochastic program. In addition, ambiguity sets can also be defined as confidence regions of goodnessoffit tests [7].
In this paper we study distributionally robust optimization problems with a Wasserstein ambiguity set centered at the uniform distribution on independent and identically distributed training samples. The Wasserstein distance of two distributions and can be viewed as the minimum transportation cost for moving the probability mass from to , and the Wasserstein ambiguity set contains all (continuous or discrete) distributions that are sufficiently close to the (discrete) empirical distribution with respect to the Wasserstein metric. Modern measure concentration results from statistics guarantee that the unknown datagenerating distribution belongs to the Wasserstein ambiguity set around with confidence if its radius is a sublinearly growing function of [11, 21]. The optimal value of the distributionally robust problem thus provides an upper confidence bound on the achievable outofsample cost.
While Wasserstein ambiguity sets offer powerful outofsample performance guarantees and enable the decision maker to control the model’s conservativeness, momentbased ambiguity sets appear to display better tractability properties. Specifically, there is growing evidence that distributionally robust models with moment ambiguity sets are more tractable than the corresponding stochastic models because the intractable highdimensional integrals in the objective function are replaced with tractable (generalized) moment problems [18, 22, 51]. In contrast, distributionally robust models with Wasserstein ambiguity sets are believed to be harder than their stochastic counterparts [36]. Indeed, the stateoftheart method for computing the worstcase expectation over a Wasserstein ambiguity set relies on global optimization techniques. Exploiting the fact that the extreme points of are discrete distributions with a fixed number of atoms [52], one may reformulate the original worstcase expectation problem as a finitedimensional nonconvex program, which can be solved via “difference of convex programming” methods, see [52] or [36, Section 7.1]. However, the computational effort is reported to be considerable, and there is no guarantee to find the global optimum. Nevertheless, tractability results are available for special cases. Specifically, the worst case of a convex lawinvariant risk measure with respect to a Wasserstein ambiguity set reduces to the sum of the nominal risk and a regularization term whenever is affine in and does not include any support constraints [53]. Moreover, while this paper was under review we became aware of the PhD thesis [54], which reformulates a distributionally robust twostage unit commitment problem over a Wasserstein ambiguity set as a semiinfinite linear program, which is subsequently solved using a Benders decomposition algorithm.
The main contribution of this paper is to demonstrate that the worstcase expectation over a Wasserstein ambiguity set can in fact be computed efficiently via convex optimization techniques for numerous loss functions of practical interest. Furthermore, we propose an efficient procedure for constructing an extremal distribution that attains the worstcase expectation—provided that such a distribution exists. Otherwise, we construct a sequence of distributions that attain the worstcase expectation asymptotically. As a byproduct, our analysis shows that many interesting distributionally robust optimization problems with Wasserstein ambiguity sets can be solved in polynomial time. We also investigate the outofsample performance of the resulting optimal decisions—both theoretically and experimentally—and analyze its dependence on the number of training samples. We highlight the following main contributions of this paper.

We prove that the worstcase expectation of an uncertain loss over a Wasserstein ambiguity set coincides with the optimal value of a finitedimensional convex program if constitutes a pointwise maximum of finitely many concave functions. Generalizations to convex functions or to sums of maxima of concave functions are also discussed. We conclude that worstcase expectations can be computed efficiently to high precision via modern convex optimization algorithms.

We describe a supplementary finitedimensional convex program whose optimal (nearoptimal) solutions can be used to construct exact (approximate) extremal distributions for the infinitedimensional worstcase expectation problem.

We show that the worstcase expectation reduces to the optimal value of an explicit linear program if the norm or the norm is used in the definition of the Wasserstein metric and if belongs to any of the following function classes: (1) a pointwise maximum or minimum of affine functions; (2) the indicator function of a closed polytope or the indicator function of the complement of an open polytope; (3) the optimal value of a parametric linear program whose cost or righthand side coefficients depend linearly on .

Using recent measure concentration results from statistics, we demonstrate that the optimal value of a distributionally robust optimization problem over a Wasserstein ambiguity set provides an upper confidence bound on the outofsample cost of the worstcase optimal decision. We validate this theoretical performance guarantee in numerical tests.
If the uncertain parameter vector is confined to a fixed finite subset of , then the worstcase expectation problems over Wasserstein ambiguity sets simplify substantially and can often be reformulated as tractable conic programs by leveraging ideas from robust optimization. An elegant secondorder conic reformulation has been discovered, for instance, in the context of distributionally robust regression analysis [32], and a comprehensive list of tractable reformulations of distributionally robust risk constraints for various risk measures is provided in [39]. Our paper extends these tractability results to the practically relevant case where has uncountably many possible realizations—without resorting to space tessellation or discretization techniques that are prone to the curse of dimensionality.
When is linear and the distribution of ranges over a Wasserstein ambiguity set without support constraints, one can derive a concise closedform expression for the worstcase risk of for various convex risk measures [53]. However, these analytical solutions come at the expense of a loss of generality. We believe that the results of this paper may pave the way towards an efficient computational procedure for evaluating the worstcase risk of in more general settings where the loss function may be nonlinear and may be subject to support constraints.
Among all metricbased ambiguity sets studied to date, the KullbackLeibler ambiguity set has attracted most attention from the robust optimization community. It has first been used in financial portfolio optimization to capture the distributional uncertainty of asset returns with a Gaussian nominal distribution [19]. Subsequent work has focused on KullbackLeibler ambiguity sets for discrete distributions with a fixed support, which offer additional modeling flexibility without sacrificing computational tractability [14, 2]. It is also known that distributionally robust chance constraints involving a generic KullbackLeibler ambiguity set are equivalent to the respective classical chance constraints under the nominal distribution but with a rescaled violation probability [27, 26]. Moreover, closedform counterparts of distributionally robust expectation constraints with KullbackLeibler ambiguity sets have been derived in [25].
However, KullbackLeibler ambiguity sets typically fail to represent confidence sets for the unknown distribution . To see this, assume that is absolutely continuous with respect to the Lebesgue measure and that the ambiguity set is centered at the discrete empirical distribution . Then, any distribution in a KullbackLeibler ambiguity set around must assign positive probability mass to each training sample. As has a density function, it must therefore reside outside of the KullbackLeibler ambiguity set irrespective of the training samples. Thus, KullbackLeibler ambiguity sets around contain with probability 0. In contrast, Wasserstein ambiguity sets centered at contain discrete as well as continuous distributions and, if properly calibrated, represent meaningful confidence sets for . We will exploit this property in Section 3 to derive finitesample guarantees. A comparison and critical assessment of various metricbased ambiguity sets is provided in [45]. Specifically, it is shown that worstcase expectations over KullbackLeibler and other divergencebased ambiguity sets are law invariant. In contrast, worstcase expectations over Wasserstein ambiguity sets are not. The law invariance can be exploited to evaluate worstcase expectations via the sample average approximation.
The models proposed in this paper fall within the scope of datadriven distributionally robust optimization [20, 16, 7, 23]. Closest in spirit to our work is the robust sample average approximation [7], which seeks decisions that are robust with respect to the ambiguity set of all distributions that pass a prescribed statistical hypothesis test. Indeed, the distributions within the Wasserstein ambiguity set could be viewed as those that pass a multivariate goodnessoffit test in light of the available training samples. This amounts to interpreting the Wasserstein distance between the empirical distribution and a given hypothesis as a test statistic and the radius of the Wasserstein ambiguity set as a threshold that needs to be chosen in view of the test’s desired significance level . The Wasserstein distance has already been used in tests for normality [17] and to devise nonparametric homogeneity tests [40].
The rest of the paper proceeds as follows. Section 2 sketches a generic framework for datadriven distributionally robust optimization, while Section 3 introduces our specific approach based on Wasserstein ambiguity sets and establishes its outofsample performance guarantees. In Section 4 we demonstrate that many worstcase expectation problems over Wasserstein ambiguity sets can be reduced to finitedimensional convex programs, and we develop a systematic procedure for constructing worstcase distributions. Explicit linear programming reformulations of distributionally robust single and twostage stochastic programs as well as uncertainty quantification problems are derived in Section 5. Section 6 extends the scope of the basic approach to broader classes of objective functions, and Section 7 reports on numerical results.
Notation
We denote by the nonnegative and by the extended reals. Throughout this paper, we adopt the conventions of extended arithmetics, whereby and . The inner product of two vectors is denoted by . Given a norm on , the dual norm is defined through . A function is proper if for at least one and for every in . The conjugate of is defined as . Note that conjugacy preserves properness. For a set , the indicator function is defined through if ; otherwise. Similarly, the characteristic function is defined via if ; otherwise. The support function of is defined as . It coincides with the conjugate of . We denote by the Dirac distribution concentrating unit mass at . The product of two probability distributions and on and , respectively, is the distribution on . The fold product of a distribution on is denoted by , which represents a distribution on the Cartesian product space . Finally, we set the expectation of under to , which is welldefined by the conventions of extended arithmetics.
2. DataDriven Stochastic Programming
Consider the stochastic program
(1) 
with feasible set , uncertainty set and loss function . The loss function depends both on the decision vector and the random vector , whose distribution is supported on . Problem (1) can be viewed as the firststage problem of a twostage stochastic program, where represents the optimal value of a subordinate secondstage problem [46]. Alternatively, problem (1) may also be interpreted as a generic learning problem in the spirit of [49].
Unfortunately, in most situations of practical interest, the distribution is not precisely known, and therefore we miss essential information to solve problem (1) exactly. However, is often partially observable through a finite set of independent samples, e.g., past realizations of the random vector . We denote the training dataset comprising these samples by . We emphasize that—before its revelation—the dataset can be viewed as a random object governed by the distribution supported on .
A datadriven solution for problem (1) is a feasible decision that is constructed from the training dataset . Throughout this paper, we notationally suppress the dependence of on the training samples in order to avoid clutter. Instead, we reserve the superscript ‘ ’ for objects that depend on the training data and thus constitute random objects governed by the product distribution . The outofsample performance of is defined as and can thus be viewed as the expected cost of under a new sample that is independent of the training dataset. As is unknown, however, the exact outofsample performance cannot be evaluated in practice, and the best we can hope for is to establish performance guarantees in the form of tight bounds. The feasibility of in (1) implies , but this lower bound is again of limited use as is unknown and as our primary concern is to bound the costs from above. Thus, we seek datadriven solutions with performance guarantees of the type
(2) 
where constitutes an upper bound that may depend on the training dataset, and is a significance parameter with respect to the distribution , which governs both and . Hereafter we refer to as a certificate for the outofsample performance of and to the probability on the lefthand side of (2) as its reliability. Our ideal goal is to find a datadriven solution with the lowest possible outofsample performance. This is impossible, however, as is unknown, and the outofsample performance cannot be computed. We thus pursue the more modest but achievable goal to find a datadriven solution with a low certificate and a high reliability.
A natural approach to generate datadriven solutions is to approximate with the discrete empirical probability distribution
(3) 
that is, the uniform distribution on . This amounts to approximating the original stochastic program (1) with the sampleaverage approximation (SAA) problem
(4) 
If the feasible set is compact and the loss function is uniformly continuous in across all , then the optimal value and optimal solutions of the SAA problem (4) converge almost surely to their counterparts of the true problem (1) as tends to infinity [46, Theorem 5.3]. Even though finite sample performance guarantees of the type (2) can be obtained under additional assumptions such as Lipschitz continuity of the loss function (see e.g., [47, Theorem 1]), the SAA problem has been conceived primarily for situations where the distribution is known and additional samples can be acquired cheaply via random number generation. However, the optimal solutions of the SAA problem tend to display a poor outofsample performance in situations where is small and where the acquisition of additional samples would be costly.
In this paper we address problem (1) with an alternative approach that explicitly accounts for our ignorance of the true datagenerating distribution , and that offers attractive performance guarantees even when the acquisition of additional samples from is impossible or expensive. Specifically, we use to design an ambiguity set containing all distributions that could have generated the training samples with high confidence. This ambiguity set enables us to define the certificate as the optimal value of a distributionally robust optimization problem that minimize the worstcase expected cost.
(5) 
Following [38], we construct as a ball around the empirical distribution (3) with respect to the Wasserstein metric. In the remainder of the paper we will demonstrate that the optimal value as well as any optimal solution (if it exists) of the distributionally robust problem (5) satisfy the following conditions.

[label=(), itemsep = 1mm, topsep = 1mm]

Finite sample guarantee: For a carefully chosen size of the ambiguity set, the certificate provides a confidence bound of the type (2) on the outofsample performance of .

Asymptotic consistency: As tends to infinity, the certificate and the datadriven solution converge—in a sense to be made precise below—to the optimal value and an optimizer of the stochastic program (1), respectively.
Conditions 1–3 have been identified in [7] as desirable properties of datadriven solutions for stochastic programs. Precise statements of these conditions will be provided in the remainder. In Section 3 we will use the Wasserstein metric to construct ambiguity sets of the type satisfying the conditions 1 and 2. In Section 4, we will demonstrate that these ambiguity sets also fulfill the tractability condition 3. We see this last result as the main contribution of this paper because the stateoftheart method for solving distributionally robust problems over Wasserstein ambiguity sets relies on global optimization algorithms [36].
3. Wasserstein Metric and Measure Concentration
Probability metrics represent distance functions on the space of probability distributions. One of the most widely used examples is the Wasserstein metric, which is defined on the space of all probability distributions supported on with .
Definition 3.1 (Wasserstein metric [29]).
The Wasserstein metric is defined via
for all distributions , where represents an arbitrary norm on .
The decision variable can be viewed as a transportation plan for moving a mass distribution described by to another one described by . Thus, the Wasserstein distance between and represents the cost of an optimal mass transportation plan, where the norm encodes the transportation costs. We remark that a generalized Wasserstein metric for is obtained by setting the transportation cost between and to . In this paper, however, we focus exclusively on the Wasserstein metric of Definition 3.1, which is sometimes also referred to as the Kantorovich metric.
We will sometimes also need the following dual representation of the Wasserstein metric.
Theorem 3.2 (KantorovichRubinstein [29]).
For any distributions we have
where denotes the space of all Lipschitz functions with for all .
Kantorovich and Rubinstein [29] originally established this result for distributions with bounded support. A modern proof for unbounded distributions is due to Villani [50, Remark 6.5, p. 107]. The optimization problems in Definition 3.1 and Theorem 3.2, which provide two equivalent characterizations of the Wasserstein metric, constitute a primaldual pair of infinitedimensional linear programs. The dual representation implies that two distributions and are close to each other with respect to the Wasserstein metric if and only if all functions with uniformly bounded slopes have similar integrals under and . Theorem 3.2 also demonstrates that the Wasserstein metric is a special instance of an integral probability metric (see e.g. [33]) and that its generating function class coincides with a family of Lipschitz continuous functions.
In the remainder we will examine the ambiguity set
(6) 
which can be viewed as the Wasserstein ball of radius centered at the empirical distribution . Under a common light tail assumption on the unknown datagenerating distribution , this ambiguity set offers attractive performance guarantees in the spirit of Section 2.
Assumption 3.3 (Lighttailed distribution).
There exists an exponent such that
Assumption 3.3 essentially requires the tail of the distribution to decay at an exponential rate. Note that this assumption trivially holds if is compact. Heavytailed distributions that fail to meet Assumption 3.3 are difficult to handle even in the context of the classical sample average approximation. Indeed, under a heavytailed distribution the sample average of the loss corresponding to any fixed decision may not even converge to the expected loss; see e.g. [13, 15]. The following modern measure concentration result provides the basis for establishing powerful finite sample guarantees.
Theorem 3.4 (Measure concentration [21, Theorem 2]).
Theorem 3.4 provides an a priori estimate of the probability that the unknown datagenerating distribution resides outside of the Wasserstein ball . Thus, we can use Theorem 3.4 to estimate the radius of the smallest Wasserstein ball that contains with confidence for some prescribed . Indeed, equating the righthand side of (7) to and solving for yields
(8) 
Note that the Wasserstein ball with radius can thus be viewed as a confidence set for the unknown true distribution as in statistical testing; see also [7].
Theorem 3.5 (Finite sample guarantee).
Proof.
It is clear from (8) that for any fixed , the radius tends to as increases. Moreover, one can show that if converges to zero at a carefully chosen rate, then the solution of the distributionally robust optimization problem (5) with ambiguity set converges to the solution of the original stochastic program (1) as tends to infinity. The following theorem formalizes this statement.
Theorem 3.6 (Asymptotic consistency).
Suppose that Assumption 3.3 holds and that , , satisfies and .^{2}^{2}2A possible choice is . Assume also that and represent the optimal value and an optimizer of the distributionally robust program (5) with ambiguity set , .

[label=(), itemsep = 1mm, topsep = 1mm]

If is upper semicontinuous in and there exists with for all and , then almost surely we have as where is the optimal value of (1).
The proof of Theorem 3.6 will rely on the following technical lemma.
Lemma 3.7 (Convergence of distributions).
If Assumption 3.3 holds and , , satisfies and , then, any sequence , , where may depend on the training data, converges under the Wasserstein metric (and thus weakly) to almost surely with respect to , that is,
Proof.
As , the triangle inequality for the Wasserstein metric ensures that
Moreover, Theorem 3.4 implies that , and thus we have . As , the BorelCantelli Lemma [28, Theorem 2.18] further implies that
Finally, as , we conclude that almost surely. Note that convergence with respect to the Wasserstein metric implies weak convergence [10]. ∎
Proof of Theorem 3.6.
As , we have . Moreover, Theorem 3.5 implies that
for all . As , the BorelCantelli Lemma further implies that
To prove assertion 1, it thus remains to be shown that with probability . As is upper semicontinuous and grows at most linearly in , there exists a nonincreasing sequence of functions , , such that , and is Lipschitz continuous in for any fixed and with Lipschitz constant ; see Lemma A.1 in the appendix. Next, choose any , fix a optimal decision for (1) with , and for every let be a optimal distribution corresponding to with
Then, we have
where the second inequality holds because converges from above to , and the third inequality follows from Theorem 3.2. Moreover, the almost sure equality holds due to Lemma 3.7, and the last equality follows from the Monotone Convergence Theorem [30, Theorem 5.5], which applies because . Indeed, recall that has an exponentially decaying tail due to Assumption 3.3 and that is Lipschitz continuous in . As was chosen arbitrarily, we thus conclude that .
To prove assertion 2, fix an arbitrary realization of the stochastic process such that and for all sufficiently large . From the proof of assertion 1 we know that these two conditions are satisfied almost surely. Using these assumptions, one easily verifies that
(9) 
Next, let be an accumulation point of the sequence , and note that as is closed. By passing to a subsequence, if necessary, we may assume without loss of generality that . Thus,
where the first inequality exploits that , the second inequality follows from the lower semicontinuity of in , the third inequality holds due to Fatou’s lemma (which applies because grows at most linearly in ), and the last inequality follows from (9). Therefore, we have . ∎
In the following we show that all assumptions of Theorem 3.6 are necessary for asymptotic convergence, that is, relaxing any of these conditions can invalidate the convergence result.
Example 1 (Necessity of regularity conditions).

[itemsep = 1mm, topsep = 1mm]
A convergence result akin to Theorem 3.6 for goodnessoffitbased ambiguity sets is discussed in [7, Section 4]. This result is complementary to Theorem 3.6. Indeed, Theorem 3.6(i) requires to be upper semicontinuous in , which is a necessary condition in our setting (see Example 1) that is absent in [7]. Moreover, Theorem 3.6(ii) only requires to be lower semicontinuous in , while [7] asks for equicontinuity of this mapping. This stronger requirement provides a stronger result, that is, the almost sure convergence of to uniformly in on any compact subset of .
Theorems 3.5 and 3.6 indicate that a careful a priori design of the Wasserstein ball results in attractive finite sample and asymptotic guarantees for the distributionally robust solutions. In practice, however, setting the Wasserstein radius to yields overconservative solutions for the following reasons:
While Theorems 3.5 and 3.6 provide strong theoretical justification for using Wasserstein ambiguity sets, in practice, it is prudent to calibrate the Wasserstein radius via bootstrapping or crossvalidation instead of using the conservative a priori bound ; see Section 7.2 for further details. A similar approach has been advocated in [7] to determine the sizes of ambiguity sets that are constructed via goodnessoffit tests.
So far we have seen that the Wasserstein metric allows us to construct ambiguity sets with favorable asymptotic and finite sample guarantees. In the remainder of the paper we will further demonstrate that the distributionally robust optimization problem (5) with a Wasserstein ambiguity set (6) is not significantly harder to solve than the corresponding SAA problem (4).
4. Solving WorstCase Expectation Problems
We now demonstrate that the inner worstcase expectation problem in (5) over the Wasserstein ambiguity set (6) can be reformulated as a finite convex program for many loss functions of practical interest. For ease of notation, throughout this section we suppress the dependence on the decision variable . Thus, we examine a generic worstcase expectation problem
(10) 
involving a decisionindependent loss function , which is defined as the pointwise maximum of more elementary measurable functions , . The focus on loss functions representable as pointwise maxima is nonrestrictive unless we impose some structure on the functions . Many tractability results in the remainder of this paper are predicated on the following convexity assumption.
Assumption 4.1 (Convexity).
The uncertainty set is convex and closed, and the negative constituent functions are proper, convex, and lower semicontinuous for all . Moreover, we assume that is not identically on for all .
Assumption 4.1 essentially stipulates that can be written as a maximum of concave functions. As we will showcase in Section 5, this mild restriction does not sacrifice much modeling power. Moreover, generalizations of this setting will be discussed in Section 6. We proceed as follows. Subsection 4.1 addresses the reduction of (10) to a finite convex program, while Subsection 4.2 describes a technique for constructing worstcase distributions.
4.1. Reduction to a Finite Convex Program
The worstcase expectation problem (10) constitutes an infinitedimensional optimization problem over probability distributions and thus appears to be intractable. However, we will now demonstrate that (10) can be reexpressed as a finitedimensional convex program by leveraging tools from robust optimization.
Theorem 4.2 (Convex reduction).
Recall that denotes the conjugate of evaluated at and the dual norm of . Moreover, represents the characteristic function of and its conjugate, that is, the support function of .
Proof of Theorem 4.2.
By using Definition 3.1 we can reexpress the worstcase expectation (10) as
The second equality follows from the law of total probability, which asserts that any joint probability distribution of and can be constructed from the marginal distribution of and the conditional distributions of given , , that is, we may write . The resulting optimization problem represents a generalized moment problem in the distributions , . Using a standard duality argument, we obtain
(12a)  
(12b)  
where (12a) follows from the maxmin inequality, and (12b) follows from the fact that contains all the Dirac distributions supported on . Introducing epigraphical auxiliary variables , , allows us to reformulate (12b) as  
(12c)  
(12d)  
(12e)  
Equality (12d) exploits the definition of the dual norm and the decomposability of into its constituents , . Interchanging the maximization over with the minus sign (thereby converting the maximization to a minimization) and then with the maximization over leads to a restriction of the feasible set of (12d). The resulting upper bound (12e) can be reexpressed as  
(12f) 
where (12f) follows from the definition of conjugacy, our conventions of extended arithmetic, and the substitution of with . Note that (12f) is already a finite convex program.
Next, we show that Assumption 4.1 reduces the inequalities (12a) and (12e) to equalities. Under Assumption 4.1, the inequality (12a) is in fact an equality for any by virtue of an extended version of a wellknown strong duality result for moment problems [44, Proposition 3.4]. One can show that (12a) continues to hold as an equality even for , in which case the Wasserstein ambiguity set (6) reduces to the singleton , while (10) reduces to the sample average . Indeed, for the variable in (12b) can be increased indefinitely at no penalty. As constitutes a pointwise maximum of upper semicontinuous concave functions, an elementary but tedious argument shows that (12b) converges to the sample average as tends to infinity.
The inequality (12e) also reduces to an equality under Assumption 4.1 thanks to the classical minimax theorem [4, Proposition 5.5.4], which applies because the set is compact for any finite . Thus, the optimal values of (10) and (12f) coincide.
Assumption 4.1 further implies that the function is proper, convex and lower semicontinuous. Properness holds because is not identically on . By [42, Theorem 11.23(a), p. 493], its conjugate essentially coincides with the epiaddition (also known as infconvolution) of the conjugates of the functions and . Thus,
where denotes the closure operator that maps any function to its largest lower semicontinuous minorant. As if and only if for any function , we may conclude that (12f) is indeed equivalent to (11) under Assumption 4.1. ∎
Note that the semiinfinite inequality in (12c) generalizes the nonlinear uncertain constraints studied in [1] because it involves an additional norm term and as the loss function is not necessarily concave under Assumption 4.1. As in [1], however, the semiinfinite constraint admits a robust counterpart that involves the conjugate of the loss function and the support function of the uncertainty set.
From the proof of Theorem 4.2 it is immediately clear that the worstcase expectation (10) is conservatively approximated by the optimal value of the finite convex program (12f) even if Assumption 4.1 fails to hold. In this case the sum in (12f) must be evaluated under our conventions of extended arithmetics, whereby . These observations are formalized in the following corollary.
4.2. Extremal Distributions
Stress test experiments are instrumental to assess the quality of candidate decisions in stochastic optimization. Meaningful stress tests require a good understanding of the extremal distributions from within the Wasserstein ball that achieve the worstcase expectation (10) for various loss functions. We now show that such extremal distributions can be constructed systematically from the solution of a convex program akin to (11).
Theorem 4.4 (Worstcase distributions).
If Assumption 4.1 holds, then the worstcase expectation (10) coincides with the optimal value of the finite convex program
(13) 
irrespective of . Let be a sequence of feasible decisions whose objective values converge to the supremum of (13). Then, the discrete probability distributions
belong to the Wasserstein ball and attain the supremum of (10) asymptotically, i.e.,