Distributionally Robust Optimization with Confidence Bands for Probability Density Functions

# Distributionally Robust Optimization with Confidence Bands for Probability Density Functions

Xi Chen        Qihang Lin        Guanglin Xu Stein School of Business, New York University, New York City, NY, 10012, USA. Email: xichen@nyu.eduTippie College of Business, University of Iowa, Iowa City, IA 52245, USA. Email: qihang-lin@uiowa.eduInstitute for Mathematics and its Applications, University of Minnesota, Minneapolis, MN, 55455, USA. Email: gxu@umn.edu.
###### Abstract

Distributionally robust optimization (DRO) has been introduced for solving stochastic programs where the distribution of the random parameters is unknown and must be estimated by samples from that distribution. A key element of DRO is the construction of the ambiguity set, which is a set of distributions that covers the true distribution with a high probability. Assuming that the true distribution has a probability density function, we propose a class of ambiguity sets based on confidence bands of the true density function. The use of the confidence band enables us to take the prior knowledge of the shape of the underlying density function into consideration (e.g., unimodality or monotonicity). Using the confidence band constructed by density estimation techniques as the ambiguity set, we establish the convergence of the optimal value of DRO to that of the stochastic program as the sample size increases. However, the resulting DRO problem is computationally intractable, as it involves functional decision variables as well as infinitely many constraints. To address this challenge, using the duality theory, we reformulate it into a finite-dimensional stochastic program, which is amenable to a stochastic subgradient scheme as a solution method. We compare our approach with existing state-of-the-art DRO methods on the newsvendor problem and the portfolio management problem, and the numerical results showcase the advantage of our approach.

Keywords: Distributionally robust optimization; first-order method; confidence band; data-driven ambiguity sets

## 1 Introduction

The goal of stochastic programming (SP) is to minimize the expectation of an objective function that depends on both decision variables and some random parameters. Assuming that the random parameters follow a distribution denoted by , a stochastic program can be formulated as follows:

 v⋆:=infx∈X{EP⋆[f(x,ξ)]:=∫Rmf(x,ξ)P⋆(dξ)}, (1)

where is the vector of decision variables, is the feasible set, is the vector of random variables taking values in with the distribution , and is the objective function. SP has been actively studied for several decades: see [5], [41] and references thereafter. Suppose is absolutely continuous and thus has a probability density function with respect to the Lebesgue measure. In other words, we have such that for any Borel set . We can further write (1) as follows:

 v⋆=infx∈X{Ep⋆[f(x,ξ)]:=∫Rmf(x,ξ)p⋆(ξ)dξ}. (2)

Despite of their popularity, (2) is often challenging to solve since the distribution or the density of is rarely known in real-life applications. When a set of historical observations of is collected, one may solve the approximation of (1) by replacing with an estimated distribution from the dataset, e.g., the empirical distribution. However, due to the approximation error, the decision obtained from the approximate problem may be of inferior quality and thus may have an undesirable out-of-sample performance; see [2, 13, 29]. An alternative approach for solving (2) with an unknown distribution is distributionally robust optimization (DRO), in which one constructs an ambiguity set consisting of all distributions that are likely to be and then minimizes the expectation of the objective function over the worst-case distribution from the ambiguity set. In particular, letting be the ambiguity set, we can formulate DRO as follows:

 infx∈XsupP∈D{EP[f(x,ξ)]:=∫Rmf(x,ξ)P(dξ)}. (3)

Scarf first proposed this model for a newsvendor problem [39], and it has been extensively studied over the past in operations research and operations management community since then.

The ambiguity sets constructed by most of the aforementioned approaches contain distributions that are not absolutely continuous. In fact, as shown in many of the works, the worst-case distribution in their ambiguity set corresponding to the optimal decision of (3) is discrete. However, in some applications, the vector of random parameter is known to be absolutely continuous (e.g., when models the price of electricity or the return of securities). In this case, by solving (3) with these ambiguity sets, one may obtain a solution that is hedging against a discrete distribution that will never be the true distribution. This phenomenon potentially leads to over-conservative decisions in the DRO problems.

In this paper, we consider the situation where is absolutely continuous and propose a family of ambiguity sets consisting of only absolutely continuous distributions, or equivalently, distributions with density functions. With such an ambiguity set, the DRO model corresponding to (2) is given as:

 infx∈Xsupp∈D{Ep[f(x,ξ)]:=∫Rmf(x,ξ)p(ξ)dξ}. (4)

Thus, the worst-case distribution corresponding to the optimal solution of (4) will be absolutely continuous.

We note that the ambiguity set of density functions has also been considered by [30] and [8]. While their ambiguity sets contain only polynomial density functions, our approach does not require for similar restrictions. We also note that the ambiguity set considered in [30] utilizes the kernel density estimation, as does one of our ambiguity sets. While their method must specify the Legendre polynomial series density estimator, our method allows for using a broader family of kernel density estimations. Moreover, our ambiguity set is constructed with data samples and fully utilizes shape information of the density function (e.g., unimodality or monotonicity), while the ambiguity set constructed in [8] is not based on a data-driven approach. Indeed, their ambiguity set is constructed using other prior knowledge on (e.g., moment information).

Other works closely related to our method include [25, 27]. Li and Jiang consider an ambiguity set with moment and generalized unimodal constraints [27]. Lam imposes convexity constraints on the tail of the density function [25]. The main difference between our approach and theirs is that our methods use the shape information on and a dataset generated by to construct the ambiguity set, while theirs does not use data samples and directly impose shape information as constraints in their optimization problems.

In the rest of the paper, we first propose the generic DRO in Section 2, followed by the construction of our data-driven ambiguity sets by using density estimation techniques from the statistics literature. In particular, we will present two classes of ambiguity sets and showcase their convergence to the true density function and further prove the convergence of the optimal value of (4) to the optimal objective value of the SP in (2) as the sample size increases to infinity; see detail in Section 3. The setting of our ambiguity set gives rise to a challenging problem to solve, as the resulting optimization problem (4) involves functional decision variables (i.e., the density ) and infinitely many constraints. In Section 4, taking the exploitation of the special structure of our ambiguity set, we show that (4) can be reformulated into a finite-dimensional convex stochastic program, which is amenable to an efficient stochastic first-order method approach as the solution method. Finally, we validate our approach with a newsvendor problem and a portfolio management problem in Section 5. The numerical results demonstrate that our approach can generate decisions with superior out-of-sample performances, especially when the number of observations is limited.

### 1.1 Literature review

In the existing literature, different approaches have been utilized to construct ambiguity sets. We briefly review some popular approaches as follows:

• A moment-based ambiguity set is often constructed to consist of all distributions that share common marginal or cross moments; see [4, 6, 7, 9, 8, 11, 12, 16, 17, 27, 31, 43, 46, 47]. DRO problems with moment-based ambiguity sets in the format of (4) can usually be reformulated into tractable conic programs, e.g., second-order cone programming problems or semidefinite programming problems. However, the constructed ambiguity sets are not guaranteed to converge to the true distribution , as the size of the historical observations increases to infinity, although the estimations of the moments of the random variables are guaranteed to converge to their true values.

• A distance-based ambiguity set is constructed by using some distance function to measure the distance between two distributions in the probability space. In fact, such an ambiguity set can be considered as a ball centered at a reference distribution, e.g., the empirical distribution, in the space of probability distributions. The distance functions considered in the literature include Kullback-Leibler divergence [20, 22], -divergence[1, 10, 24], Prohorov metric [12], empirical Burg-entropy divergence balls [25], and Wasserstein metric [34, 45, 13, 15, 14]. Many distance-based ambiguity sets have both asymptotic and finite-sample convergences. However, there is evidence that the resulting DRO problems have the tendency to be more challenging to solve compared to their counterparts with moment-based ambiguity sets; see [13].

• More recently, hypothesis-test-based ambiguity sets have been proposed; see [2, 3]. Based on a hypothesis test, e.g., -test, -test, etc., and a confidence level, these approaches construct ambiguity sets consisting of the distributions that pass the hypothesis test with a given set of historical data. The methods that we use in this paper belong to this category.

• Likelihood approaches are also considered to construct ambiguity sets in the literature; see, e.g.,  [10, 26, 35, 44]. The likelihood approaches construct ambiguity sets consisting of all distributions that make a set of observations to achieve a certain level of likelihood.

### 1.2 Notation and terminology

Let denote the Euclidean projection operator on to the set , i.e., Let be the indicator function that equals one when and zero when . Unless stated otherwise, the terms “almost every”, “measurable”, and “integrable” are defined with respect to the Lebesgue sense. For an extended-real valued function on , let be its epigraph, be its domain, be its subdifferential, and be its any subgradient. We call a multifunction if it maps a point in to a subset of . A multifunction is closed valued if is a closed subset of for every and is measurable if, for every closed set , the set is measurable. We define . A mapping is called a measurable (integrable) selection of if it is measurable (integrable) and for every . We use to represent the set .

## 2 Data-Driven Distributionally Robust Optimization

In this paper, we consider an ambiguity set that consists of the density functions whose value is between two known non-negative functions. To construct such a set, we assume that there exists a set of independent realizations of the random variable (i.e., samples from ) denoted by

 ˆΞN:={^ξ1,…,^ξN}⊆Ξ.

Then, for a given , we construct two functions and based on , , and some prior knowledge on (e.g., its shape property) such that

 P{lα(ξ)≤p∗(ξ)≤uα(ξ), ∀ξ∈[a,b]}≥1−α. (5)

We call the pair the confidence bands for the density functions at a confidence level of and is called the significance level. We will introduce two methods to construct such a ambiguity set in Section 3.

Using , we can construct an ambiguity set that contains with a confidence level of . More specifically, let be the space of all non-negative Lebesgue-measurable functions on . We consider the following ambiguity set:

 D(ˆΞN,α):={p∈L∣∣lα(ξ)≤p(ξ)≤uα(ξ), ∀ξ∈Ξ,∫Ξp(ξ)dξ=1}, (6)

which satisfies according to (5).

With , we can specify the distributionally robust optimization problem in (4) as

 v∗D(ˆΞN,α):=infx∈Xsupp∈D(ˆΞN,α)∫Ξf(x,ξ)p(ξ)dξ. (7)

Immediately, we have the following result for the optimal value of (7).

###### Proposition 1.

Suppose the minimal objective value of (7) is finite and achieved at . Let . We have

###### Proof.

Proof. Whenever , we have . By the construction of the ambiguity set , we have

## 3 Data-Driven Ambiguity Sets

In this section, we present two existing methods from the statistics literature to construct a confidence band for a density function based on observed data. The first method is applicable to only univariate distributions while the second one is applicable to multivariate distributions. Although we only described two confidence bands as the specific instances for constructing the ambiguity set in (6), the optimization method we propose in Section 4 can be applied to DRO with any ambiguity set in the form of (6) with other constructions of confidence bands.

### 3.1 Shape-restricted confidence bands

In this subsection, we present the method by [19] to construct a confidence band for with a confidence level of and use it to build an ambiguity set like (6). Although this method only applies to a univariate density function, it is able to incorporate some shape information about (e.g., unimodality and monotonicity) into the construction of the ambiguity set which improves the theoretical convergence rate of the set to the true density .

We need the following assumptions in this subsection.

###### Assumption 1 (For shape-restricted confidence bands.).

We assume:

• (the support of ) is connected and contained in with known and satisfying .

• is unimodal with a known mode , meaning that is monotonically increasing on and decreasing on .

• There exists a known constant such that for any .

In Assumption 1 [A1.], we assume that and are finite for the simplicity of the notations in the derivation below. In fact, our results can be generalized when and . Moreover, for most applications, a conservative estimation of the range of is usually available, which can be directly used as . We also note that Assumption 1 [A2.] also covers the case where is known to be monotonically increasing () or decreasing (). In some real applications, the unimodality or monotonicity of a random parameter is a well-known fact and thus should be incorporated into the problem formulation. Note that the method by [19] can also used to construct a confidence band even when the mode is only known to be in an interval .

First, let be the order statistics of constructed from satisfying . We choose a group size satisfying and define and . Then we partition the sorted sequence into groups with the first groups of size and the th group (if ) of size . We then define for and for if . Let be the cumulative density function of . It is well-known that (citation?) the random variable has the same distribution as

 ~Δi:=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩Γ(K,1)∑M′j=1Γ(K,1)+Γ(M−M′+1,1) if i=1,2,…,M′Γ(N−KM′,1)∑M′j=1Γ(K,1)+Γ(M−M′+1,1) if i=M≠M′, (8)

where represents a gamma random variable with a shape parameter and a rate parameter . Let and be two constants that satisfy

Both and can be estimated to arbitrary precision by sampling according to (8). Let be the set of all density functions on with mode at , i.e.

 Lμ:={p∈L|p(ξ)≥0,∫Rp(ξ)dξ=1,the mode of p is μ.}

and

Then, by the definitions of and , we have

 P{p∗∈Dμ(ˆΞ,α)}≥1−α.

Given the property above, can be used as an uncertainty set in (7). However, this uncertainty set may be too large to ensure a good solution from solving (7). Therefore, we need to further refine it using the prior knowledge about the shape of given in Assumption 1. Next, we show how to construct a shape-constrained confidence band using the technique from in [19].

Given any , let () be the number of distinct elements in the set and be the smallest element in this set. Then, we consider the following two sets of non-negative step functions on defined as

 D−μ(ˆΞ,ξ) := ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩p∈L∣∣ ∣ ∣ ∣ ∣∣p(⋅)=∑{j:zj+1<μ}βjI(zj,zj+1](⋅)+max{j:μ∈[zj,zj+1]}βjIμ(⋅)+∑{j:μ∈[zj,zj+1]}βjI(zj,zj+1)(⋅)+∑{j:zj>μ}βjI[zj,zj+1)(⋅),where βj∈[0,U] for j=1,2,…,^N−1.⎫⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎬⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎭ (9)
 D+μ(ˆΞ,ξ) := ⎧⎪⎨⎪⎩p∈L∣∣∣p(⋅)=∑{j:zj+1≤μ}βjI[zj,zj+1)(⋅)+∞⋅Iμ(ξ)+∑{j:zj>μ}βjI(zj,zj+1](⋅),where βj∈[0,U] for j=1,2,…,^N−1.⎫⎪⎬⎪⎭ (10)

where denotes the indicator function. Both and consist of piecewise constant density functions on with break points on but with different continuous properties (i.e., left-continuity or right-continuity) on these break points. Then, the confidence band for given by [19] is a pair of functions defined as

 lSRα(ξ):=infp∈D−μ(ˆΞ,ξ)∩Dμ(ˆΞ,α)p(ξ)%anduSRα(ξ):=supp∈D+μ(ˆΞ,ξ)∩Dμ(ˆΞ,α)p(ξ). (11)

Here, the superscript “SR” refers to “shape restricted”. By its definition, the value of function () equals the smallest (largest) value at of all density functions which have the form in (9) ((10)) and satisfy for all . Then, our shape-restricted uncertainty set are defined as

 DSR(ˆΞN,α):={p∈L∣∣lSRα(ξ)≤p(ξ)≤uSRα(ξ), ∀ξ∈Ξ,∫Ξp(ξ)dξ=1}. (12)

The following theorem is given by [19].

###### Theorem 1.

.

We then describe the method for compute and for any . Given , recall that denotes the smallest element in . Let and be the indexes in such that and . Using these indexes, we define a polyhedron

 H(ˆΞN,α)=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩β∈RˆN∣∣ ∣ ∣ ∣ ∣ ∣∣β1≤β2≤⋯≤β˜j−1,β˜j≥β˜j+1≥⋯≥βˆN−1c−(α)≤∑j:ξ(ki−1)≤zj<ξ(ki)βj(zj+1−zj)≤c+(α),∀i=2,…,M∑ˆN−1j=1βj(zj+1−zj)=1,0≤βj≤U,∀1≤j≤ˆN−1,⎫⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎬⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎭. (13)

According to (9), (10) and (11), for any , and can be computed respectively by solving the following linear programs

 lSRα(ξ)=minβ∈H(ˆΞN,α)βˆj and uSRα(ξ)=minβ∈H(ˆΞN,α)βˆj. (14)

Note that the first line of constraints in (13) is because is the mode of any in and . The second line of the constraints in (13) corresponds to the condition when is the step function in (9) and (10). The third line requires that must a density function while the last line requires is non-negative and no more than according the prior information of .

We summary this procedure by [19] for constructing a confidence band for a unimodal density function in Algorithm 1.

It is worthwhile to note that if the density is monotone, for example, non-decreasing, we have and the first line of the constraints in (13) becomes . Note also that, in the original work [19], no upper bound for is needed to construct this uncertainty set . In this paper, we require knowing so that we can include a constraint in (13), this modification does not change the statistical property of (i.e., Theorem 1) once we assume . We need the constraint only to ensure and which are needed in our theoretical analysis.

[19] established the convergence rate of the constructed confidence band in the following proposition.111The original Corollary 7.2 [19] was stated a little differently form from Proposition 2. In fact, the conclusion there replaces in (15) by and equality in (15) by . Their results were obtained by choosing the parameter appearing in the proof of Corollary 7.2 [19]. However, the same proof works for Proposition 2 and thus implies (15) if we choose .

###### Proposition 2 (Corollary 7.2 [19]).

Let but . Suppose is -Holder continuous for constants and , i.e., for any and . Suppose for a constant and sufficiently large . We have

 limN→∞P{∣∣uSRα(ξ)−lSRα(ξ)∣∣≤4p∗(ξ)(√3+2ρ1+2ρB−1/2+C(p∗(ξ))−(ρ+1)Bρ)(logNN)ρ/(1+2ρ)}=1. (15)

According to the lower bound in [23], the convergence rate in (15) attains the minimax rate and thus is optimal (upto a constant factor) for the confidence band for a unimodal density. It is also worthwhile to note that the rate of convergence from Kolmogorov-Smirnov distance is slower as compared to this approach. As shown in [18], the confidence band formed by the Kolmogorov-Smirnov distance is only , which is slower than the rate of in Proposition 2 when .

Proposition 2 shows the pointwise convergence of the confidence band to the true distribution . In the next theorem, we show that, under additional assumption, we can characterize the convergence in probability of define in (7) to in (2).

###### Theorem 2.

Suppose is -Holder continuous for constants and , i.e., for any and . Moreover, suppose . For any and , there exists an such that for any , we have

 P(supx∈X∣∣ ∣∣∫Ξf(x,ξ)p∗(ξ)dξ−supp∈DSR(ˆΞN,α)∫Ξf(x,ξ)p(ξ)dξ∣∣ ∣∣≤ϵ)≥1−θ

and

###### Proof.

Proof. See Appendix Proof of Theorem 2. ∎

Finally, illustrations of this method are given in Figure 1 for a beta distribution and in Figure 2 for a truncated exponential distribution. (What exact parameters?)

### 3.2 Kernel-desnsity-estimation confidence bands

The shape-restricted confidence bands described in the previous section can only be applied to univariate density function. In this subsection, we describea method to construct confidence bands for multivariate density function based on the classical kernel density estimation (KDE) [38, 33]. This method requires a kernel function which is a mapping satisfying . The commonly use kernel functions include uniform kernel and Guassian kernel . Let be a bandwidth parameter. Recall that is i.i.d. samples drawn from . The KDE of based on , and is

 ˆph(ξ):=1NN∑i=11hmK(ξ−^ξih). (16)

The convergence of to the true density have been studied for a long time (see e.g. [42]) with most of existing works focusing the asymptotic convergence property. Recently, the finite-sample non-asymptotic convergence property of KDE is characterized by [36] and [21]. The confidence band we construct based on KDE utilize the non-asymptotic convergence bound by [21].

We need the following assumptions in this subsection.

###### Assumption 2 (For KDE confidence bands.).

We assume:

• There exists a constant such that for any .

• There exists a non-increasing function such that .

• There exists , and such that for , .

A2 and A3 of Assumption 2 hold if is one of the popular kernel densities (up to scaling) in , including the two mentioned above as well as Exponential, Tricube, triangular, and Epanechnikov kernels. Under these assumptions, the following finite-sample convergence result is established by [21].

###### Proposition 3.

[Theorem 2. [21]] Suppose is -Holder continuous for constants and , i.e., for any and . Let and be the volume of the unit ball in . Suppose . We have

 P⎧⎨⎩supξ∈Ξ|ˆph(ξ)−p∗(ξ)|≤C1hρ+C2√log(N/α)Nhm⎫⎬⎭≥1−α.

where222Note that because of A4 of Assumption 2. and . As a consequence, if , we have

 P⎧⎨⎩supξ∈Ξ|ˆph(ξ)−p∗(ξ)|≤(C1+C2)(log(N/α)N)ρ/(2ρ+m)⎫⎬⎭≥1−α.

Based on the convergence property in Proposition 3, with , we construct the KDE confidence band of a significance level of for as follows

 lKDEα(ξ)=max{0, ˆph(ξ)−δ}, uKDEα(ξ)=ˆph(ξ)+δ (17)

where

 δ=C1hρ+C2√log(N/α)Nhm. (18)

The corresponding uncertainty set is

 DKDE(ˆΞN,α):={p∈L∣∣lKDEα(ξ)≤p(ξ)≤uKDEα(ξ), ∀ξ∈Ξ,∫Ξp(ξ)dξ=1}. (19)

The following property is a direct consequence of Proposition 3.

###### Theorem 3.

.

We summary this procedure for constructing a KDE confidence band in Algorithm 2.

The following theorem show that, under additional assumption, define in (7) converges to in (2).

###### Theorem 4.

Suppose is -Holder continuous for constants and , i.e., for any and . Moreover, suppose and . For any and , there exists an such that for any , we have

 P(supx∈X∣∣ ∣∣∫Ξf(x,ξ)p∗(ξ)dξ−supp∈DKDE(ˆΞN,α)∫Ξf(x,ξ)p(ξ)dξ∣∣ ∣∣≤ϵ)≥1−θ

and

###### Proof.

Proof. See Appendix Proof of Theorem 4. ∎

## 4 An Numerical Method for DRO

In this section, we discuss a numerical scheme for solving the DRO problem in (7) with an ambiguity set in the form of (6). Here, the ambiguity can be one of the two ambiguity sets and introduced in Section 3 as long as Assumption 1 or 2 is satisfied. However, our numerical methods can be applied to a general ambiguity set like (6) as long as the following assumptions are satisfied. We make the following assumption regarding to :

###### Assumption 3 (On ambiguity set D(ˆΞN,α)).
• .

• .

Note that, under Assumption 1, satisfies Assumption 3 because for an upper bound of according to its definition (14) and the constraints in (13). Under Assumption 2, with the appropriate bandwidth (See Theorem 3), we have