Sample average approximation with heavier tails I: non-asymptotic bounds with weak assumptions and stochastic constraintsSubmitted to the editors DATE.

# Sample average approximation with heavier tails I: non-asymptotic bounds with weak assumptions and stochastic constraints††thanks: Submitted to the editors DATE.

Roberto I. Oliveira Instituto de Matemática Pura e Aplicada (IMPA), Rio de Janeiro, RJ, Brazil. (rimfo@impa.br). Roberto I. Oliveira’s work was supported by a Bolsa de Produtividade em Pesquisa from CNPq, Brazil. His work in this article is part of the activities of FAPESP Center for Neuromathematics (grant #2013/07699-0, FAPESP - S. Paulo Research Foundation).    Philip Thompson Center for Mathematical Modeling (CMM), Santiago, Chile & CREST-ENSAE, Paris-Saclay, France. (Philip.THOMPSON@ensae.fr).
###### Abstract

We give statistical guarantees for the sample average approximation (SAA) of stochastic optimization problems. Precisely, we derive exponential finite-sample nonasymptotic deviation inequalities for the the SAA estimator’s near-optimal solution set and optimal value. In that respect, we give three main contributions. First, our bounds do not require sub-Gaussian assumptions as in previous literature of stochastic optimization (SO). Instead, we just assume random Hölder continuity and a heavy-tailed distribution with finite 2nd moments, a framework more suited for risk-averse portfolio optimization. Second, we derive new deviation inequalities for SO problems with expected-valued stochastic constraints which guarantee joint near feasibility and optimality in terms of the original problem without requiring a metric regular solution set. Thus, unlike previous works, our bounds do not assume strong growth conditions on the objective function nor an indirect problem reformulation (such as constraint penalization or first order conditions, both of which are often necessary but not sufficient conditions). Instead, we just assume metric regularity of the feasible set, making our analysis general enough for many classes of problems. A finite-sample near feasibility and optimality deviation is established for metric regular sets which are nonconvex or which are convex but not strictly feasible. For strictly feasible convex sets, we give finite-sample exact feasibility and near optimality guarantees. The feasible set’s metric regular constant is present in our inequalities as an additional condition number. For convex sets, we use concentration of measure localization, obtaining feasibility guarantees in terms of smaller metric entropies. Third, we obtain a general uniform concentration inequality for heavy-tailed Hölder continuous random functions using empirical process theory. This is the main tool in our analysis but it is also a result of independent interest.

Key words.

AMS subject classifications.

## 1 Introduction

Consider the following set-up.

###### Set-up 1 (The exact problem).

We are given the optimization problem

 f∗:=minx∈Y f0(x) s.t. fi(x)≤0,∀i∈I, (1)

with a compact hard constraint , a nonempty feasible set , the solution set and, for , the -near solution set . In above, is a finite set of indexes (possibly very large) and is a continuous function, for all . We set .

The central question of this work is:

###### Problem 1 (The approximate problem).

With respect to the Set-up LABEL:setup:optimization:problem, suppose that is directly inaccessible, but we do have access to “randomly perturbed” real-valued continuous versions of defined over (to be precised in the following). Based on this information, we choose real numbers and consider the problem

 ˆF∗:=minx∈Y ˆF0(x) s.t. ˆFi(x)≤ˆϵi,∀i∈I, (2)

with feasible set , solution set and, for , the -near solution set . We set .

We then ask the following questions:

• Can we ensure that nearly optimal solutions of the accessible problem (LABEL:problem:min:SAA) are nearly optimal solutions of the original inaccessible problem (LABEL:problem:min)?

• Can we also bound in terms of ?

We distinguish the hard constraint from the soft constraint which is allowed to be relaxed due to inevitable model perturbation. In this context, one should interpret as “tuning” parameters. For this reason, we also consider, for given (seen as a “small parameter”), the relaxed feasible set

 Xγ:={x∈Y:fi(x)≤γ,∀i∈I}. (3)

For given and , it will be convenient to define the set . We set .

In stochastic optimization (SO), Problem LABEL:problem:perturbed:opt is made precise by the Sample Average Approximation (SAA) methodology, which is explained as follows. We consider a distribution over a sample space and suppose the data of problem (LABEL:problem:min) satisfies, for any ,

 fi(x):=PFi(x,⋅):=∫ΞFi(x,ξ)dP(ξ),(x∈Y), (4)

where the measurable function is such that the above integrals are well defined. It is then assumed that, although there is no access to , the decision maker can evaluate over an acquired independent identically distributed (i.i.d.) size- sample of . Within this framework, the SAA approach is to solve (LABEL:problem:min:SAA) with

 ˆFi(x):=ˆPFi(⋅,x)=1NN∑j=1Fi(ξj,x),(i∈I0,x∈Y), (5)

where denotes the empirical distribution associated to and is the Dirac measure at the point . For notation convenience, we will omit the dependence on and . Also, we consider a common probability space and a random variable with distribution such that for any in the -algebra of and for any measurable .

We should mention that there exists a competing methodology in SO called Stochastic Approximation. In this case, a specific algorithm is chosen and samples are used in an online interior fashion to approximate the objective function along the iterations of the algorithm. The quality of such methods are given in terms of optimization and statistical estimation errors simultaneously. The first relates to the optimality gap while the second is related to the variance error induced by the use of random perturbations. See e.g. [59, 18, 19, 20]. Differently, the quality of the SAA methodology is measured by the statistical estimation error present when solving Problem LABEL:problem:perturbed:opt under (LABEL:equation:expected:data)-(LABEL:equation:empirical:data). This analysis is valid uniformly on the class of algorithms chosen to solve (LABEL:problem:min:SAA). The SAA methodology is adopted in many decision-making problems in SO (see, e.g., [30]) but also in the construction of the so called -estimators studied in Mathematical Statistics and Statistical Machine Learning (see, e.g., [33]). In the first setting, knowledge of data is limited, but one can resort to samples using Monte Carlo simulation. In the second setting, a limited number of samples is acquired from measurements and an empirical contrast estimator is built to fit the data to a certain risk criteria given by a loss function over a hypothesis class [33]. In the latter context, the SAA problem is often termed Empirical Risk Minimization. However, randomly perturbed constraints are not typical in this setting.

Referring to the general perturbation theory framework of Problem LABEL:problem:perturbed:opt, central questions in the SAA methodology are to give conditions on the data and good guarantees on the estimation error such that computable (nearly) optimal solutions of (LABEL:problem:min:SAA) are (nearly) optimal solutions of the original problem (LABEL:problem:min)-(LABEL:equation:expected:data). This can be cast in different forms. One important type of analysis is to guarantee that such methodology is asymptotically consistent as . This may include different kinds of convergence modes. Almost everywhere (a.e.) convergence, for instance, provides a Strong Law of Large Numbers (SLLN) for the SAA problem, while convergence in distribution provides a Central Limit Theorem (CLT) and may be used to give rates of convergence of the variance associated to the sample average approximation,111Typically of order dictated by the Central Limit Theorem. as well as the construction of asymptotic confidence intervals for the true solution. It should be remarked that a key aspect in the analysis of the SAA estimator is the need of uniform SLLN and CLT since we are dealing with random functions instead of random variables [1]. Such kind of asymptotic analysis have been carried out in numerous works, e.g., [2, 9, 24, 25, 39, 40, 41, 46, 47, 48, 49]. See e.g. [49, 15, 23] for an extensive review.

However, a preferred mode of analysis, which will be the one analyzed in this work, is exponentially non-asymptotic in nature. By this we mean the construction of deviation inequalities giving explicit non-asymptotic rates in which guarantee that with exponential high probability222Meaning that the deviation depends polynomially on for a chosen probability level . This is also termed exponential convergence in contrast to polynomial convergence obtained by mere use of Markov-type inequalities. the optimal value and (nearly) optimal solutions of (LABEL:problem:min:SAA) are close enough to the optimal value and (nearly) optimal solutions of (LABEL:problem:min) given a prescribed tolerance . In such inequalities, the prescribed deviation error typically depends on the number of samples and parameters depending on the problem’s structure333These parameters are typically associated to intrinsic “conditional numbers” of the problem, such as the feasible set’s diameter and dimension as well as parameters associated to growth, curvature or regularity properties of the data. These may include, e.g., Lipschitz and strong-convexity moduli.. As an example, in the case of a compact with no stochastic constraints (), diameter and a random Lipschitz continuous function with bounded Lipschitz modulus satisfying for a.e. , the following non-asymptotic inequality for optimal value deviation can be obtained: for some constant ,

 P{∣∣ˆF∗−f∗∣∣≥C√dD(X)L√N(1+t)}≤e−t2,∀t>0,∀N∈N.

Such type of inequalities can be obtained for the near-optimal solution set deviation using different metrics than . Given , examples of such metrics include the set deviation with a specific norm or bounds on [49]. The strong advantage of such inequalities in comparison to asymptotic results is that the exponential tail decay is not just valid in the limit but also valid uniformly for any . It is thus true in the regime of a small finite number of samples. This kind of results are a manifestation of the concentration of measure phenomenon [29].

In asymptotic convergence analysis, the mild assumption of a data distribution with finite second moment is usually sufficient. Moreover, smoothness is usually required to establish results of asymptotic convergence in distribution. To derive stronger non-asymptotic results, however, more demanding assumptions on the distribution are assumed in the existing SO literature. The typical assumption in this context is a light-tailed data (see Section LABEL:subsection:contributions, item (i)). While this is satisfied by many problems (for instance, bounded or sub-Gaussian data), the much weaker assumption of a data with finite second moment is desirable in many practical situations where information on the distribution is limited. This is particularly true in the context of risk-averse SO where heavy-tailed data is expected. One central aspect of this work is to provide a non-asymptotic analysis of the SAA problem with heavier-tailed data.

Another crucial aspect in Problem LABEL:problem:perturbed:opt is the presence of perturbations in the constraints. Already in the case of deterministic constraint perturbations, the feasible set must be sufficiently regular to ensure stability of solutions. Usual conditions to ensure this are, e.g., Slater or Mangasarian-Fromovitz constraint qualifications (CQ). In the specific case of the SAA methodology, most of the developed work has been done without stochastic constraints (i.e. ). We refer the reader to the recent review [16] advocating this gap in the literature. With that respect, another central question in this work is to give non-asymptotic guarantees for the SAA problem in the presence of stochastic perturbation of the constraints. Importantly, we will present such type of results which ensure feasibility and optimality simultaneously. In this quest, we wish to avoid penalization or Lagrangian reformulations of the original problem. While these have the advantage of coping with the constraints in the objective, they implicitly require determining the associated penalization parameters or multipliers. The computation of penalization parameters and multipliers is another hard problem by itself and, in the stochastic case, they are data dependent random variables. Moreover, these reformulations may express necessary but not sufficient conditions of the original problem. Our adopted framework is to establish guarantees for the SAA estimator in terms of the original optimization problem itself. In this quest, we explore metric regularity of the feasible set as suitable general property to handle random perturbation of constraints (see, e.g., [38]).

### 1.1 Related work and contributions

We next resume the main contributions of this paper and later compare it with previous works.

(i) Heavy-tailed and Hölder continuous data. The standard non-asymptotic exponential convergence results for SO problems were given for sub-Gaussian data. We say a random variable is a centered sub-Gaussian random variable with parameter if

This is equivalent to the tail of not exceeding the tail of a centered Gaussian random variable with variance (whose tail decreases exponentially fast in ). As an example, bounded random variables are sub-Gaussian by Hoeffding’s inequality [13]. To establish exponential non-asymptotic deviations, a standard requirement in SO used so far is the following uniform sub-Gaussian assumption: that for all , there exists such that for all , is sub-Gaussian with parameter . Explicitly,

 P{etδi(x)}≤exp{σ2it22},∀t∈R,∀x∈Y. (6)

From now on we will assume the following condition.

###### Assumption 1.1 (Hölder continuous heavy-tailed data).

Let be a norm on . For all , there exist random variable with and , such that a.e. and for all ,

The above assumption is standard in SO [49], where typically it is also assumed the stronger condition that (i.e., Lipschitz continuity). In the current literature of SO, the data is assumed to satisfy both conditions (LABEL:equation:sub:gaussian:intro) and Assumption LABEL:assump:random:holder:continuity. In that case, to establish uniform sub-Gaussianity of one requires to be sub-Gaussian. We give significant improvements by obtaining non-asymptotic exponential deviation inequalities of order for the optimal value and nearly optimal solutions of the SAA estimator assuming just Assumption LABEL:assump:random:holder:continuity, that is, is only required to have a finite 2nd moment. Note that in this setting, even for a compact , the multiplicative noise may result in much heavier fluctuations of when compared to a bounded or Gaussian random variable. One motivation for obtaining this kind of new results in SO is the use of risk measures in decision-making under uncertainity (such as CV@R [43]). In this framework, the assumption of a sub-Gaussian data (with a tail decreasing exponentially fast) may be too optimistic since often risk measures are used to hedge against the tail behavior of the random variable modeling the uncertainty. The price to pay for our assumptions is that the deviation bounds typically depend on empirical quantities such as the Hölder moduli of the empirical losses.

(ii) Stochastic optimization with expected-valued stochastic constraints. A remarkable difficulty in problem (LABEL:problem:min)-(LABEL:equation:expected:data) is that constraints are randomly perturbed. In this generalized framework, besides the behaviour of the objective, the feasible set’s geometry plays an additional important role for ensuring simultaneous feasibility and optimality. We obtain exponential non-asymptotic deviations for optimal value and nearly optimal solutions for SO problems with expected valued stochastic constraints also in the context of heavy-tailed data of item (i). In our analysis, we do not use reformulations based on penalization or Lagrange multipliers. For the mentioned purposes, we explore metric regularity of the feasible set (MRF) as a sufficient condition.

In the following, for a given , and, for a given set , , where is the norm in Assumption LABEL:assump:random:holder:continuity. We shall assume:

###### Assumption 1.2 (Metric regular set).

There exists such that, for all ,

MRF is a fairly general property of sets used in the perturbation analysis and computation of problems in Optimization and Variational Analysis [38, 5, 18, 59, 35]. We can thus make our analysis of random perturbed constraints general enough for many classes of problems. This property is related to standard constraint qualifications, one example being the Slater constraint qualification (SCQ) which ensures that has a strictly feasible point.

###### Assumption 1.3 (Slater constraint qualification).

are continuous and convex on the closed and convex set and, for some ,

We shall also consider Assumption LABEL:assump:SCQ as a particular case of Assumption LABEL:def:metric:regularity:intro. Indeed, SCQ implies MRF under compactness of the set:

###### Theorem 1.4 (Robinson [42]).

Suppose Assumption LABEL:assump:SCQ holds and is compact. Then Assumption LABEL:def:metric:regularity:intro holds with .

It should be noted, however, that MRF is still true for a larger class of sets which are not strictly feasible nor convex and may not satisfy standard CQs. One fundamental instance is of a polyhedron which is a convex metric regular set even without a strict feasibility assumption, as implied by the celebrated Hoffmann’s Lemma [14]. We remark here that, in Assumption LABEL:def:metric:regularity:intro, we restrict our analysis for the case of “Lipschitzian” bounds. Our results can be easily extended to the case of “Hölderian” bounds with some exponent as an additional condition number: (see Section 4.2 in [38]). In the Hölderian case, another important case in which MRF holds true for a compact nonconvex is when the constraints are polynomial or real-analytic functions, a deep result implied by Lojasiewicz’s inequality [31]. We refer to Section 4.2 in [38] and references therein.

In our results, we do explore the benefits of convexity and strict feasibility in the Slater condition by showing that finite-sample near optimality and exact feasibility of nearly optimal solutions of the SAA problem (LABEL:problem:min:SAA) are jointly satisfied in high probability when SCQ holds and the error tolerance is smaller than a feasibility threshold of (see Assumption LABEL:assump:SCQ and Theorem LABEL:thm:SAA:exp:inner:SCQ in Section LABEL:section:exponential:convergence:results). Nevertheless, we also derive exponential finite-sample non-asymptotic deviations which guarantee simultaneous near optimality and near feasibility of solutions of the SAA problem (LABEL:problem:min:SAA) assuming just a metric regular feasible set (possibly nonconvex) without requiring the Slater condition and valid for any tolerance level (see Section LABEL:section:exponential:convergence:results, Theorem LABEL:thm:SAA:exp:outer:MR). In the particular case when satisfies the SCQ, Theorem LABEL:thm:SAA:exp:outer:MR(iv) and Theorem LABEL:thm:SAA:exp:inner:SCQ(i) describe a finite-sample “transition regime” for the feasibility property with respect to the threshold (we refer to the comments following Theorem LABEL:thm:SAA:exp:inner:SCQ in Section LABEL:section:exponential:convergence:results). To the best of our knowledge, these are new types of results.

We emphasize that, in all the above results, we do not impose metric regularity on the solution set map which restricts significantly the problem in consideration (see e.g. [45] for a precise definition). To the best of our knowledge, this is also a new type of result. In the small existing literature of SAA with stochastic constraints, metric regularity of the solution set map has been assumed for establishing non-asymptotic deviations guarantees for optimal solutions (that is, which ensure feasibility and optimality simultaneously). We should mention that requiring a metric regular solution set tipically imposes strong growth conditions on the objective function and sometimes uniqueness of solutions [45, 40]. By exploring metric regularity of the feasible set we do not require such regularity properties of the objective function.

(iii) An uniform concentration inequality for heavy-tailed Hölder continuous random functions. Since the statistical problem in consideration involves optimizing over a feasible set, there is the need of obtaining uniform deviation inequalities for random functions. In the quest of items (i)-(ii) above, instead of Large Deviations Theory, as used, e.g., in [52, 51, 26, 49], we use techniques from Empirical Process Theory. In this approach, we use chaining arguments to obtain a general concentration inequality for the uniform empirical deviation of random Hölder continuous functions (See Section LABEL:section:ci, Theorem LABEL:thm:concentration:ineq:function). Since we assume a heavy-tailed data, we incorporate in our arguments self-normalization techniques [37, 7] instead of postulating boundedness assumptions on the data and then directly invoking Talagrand’s inequality for bounded empirical processes [53] as in [45, 41, 40], for example.

Before comparing our results in Section LABEL:section:exponential:convergence:results with previous works, we make some important remarks regarding the parameters appearing in our deviation inequalities in the framework of items (i)-(ii). Typically, our deviations in feasibility and optimality depend on the metric entropy of or and of the index set (see Section LABEL:section:ci for a precise definition). Such quantity is associated to the complexities of or and . As an example, if is the Euclidean ball in , its metric entropy is of . We now focus on the case of stochastic constraints (). For a finite (but possibly very large) number of constraints, our deviations depend on . We also explore the convexity of as an useful localization property for using concentration of measure in feasibility estimation. We show that exact or near feasibility holds true depending on sets with smaller metric entropies than or . Precisely, if is convex and is a prescribed tolerance, our feasibility deviation inequalities depend, for given , on the metric entropy of -active level sets of the form

 Xi,γ:={x∈Xγ:fi(x)=γ}, (7)

corresponding to constraint (see (LABEL:equation:set:constraint:relaxation)).

We believe this is a new type of result (see Section LABEL:section:exponential:convergence:results, Theorem LABEL:thm:SAA:exp:outer:MR(iv) and Theorem LABEL:thm:SAA:exp:inner:SCQ(i) along equations (LABEL:equation:variance:convex:feas) and (LABEL:equation:A:alpha)). Finally, our deviation bounds depend on the metric regularity constant in Definition LABEL:def:metric:regularity:intro as a condition number and on deterministic errors associated to set approximation due to inevitable constraint relaxation in the stochastic setting.444We thus provide deviation bounds of the form “variance error + approximation error”, often seen in statistics. These approximation errors depend on the optimization problem and vanish as in a prescribed rate. A very conservative upper bound of these rates is . We refer to (LABEL:equation:Gap), (LABEL:equation:gap) and Proposition LABEL:prop:gap:error in Section LABEL:section:exponential:convergence:results. To the best of our knowledge these kind of exponential finite-sample error bounds are also new.

To conclude, we remark that we do not consider chance constrained problems. These are problems where the constraints are required to be satisfied within a confidence level, that is, problem (LABEL:problem:min) with for given , and . This problem is equivalently expressed of the form (LABEL:problem:min)-(LABEL:equation:expected:data) with , where stands for the characteristic function of the measurable set . Hence, they are of a very distinct nature: the constraints are bounded but discontinuous (not satisfying Assumption LABEL:assump:random:holder:continuity) and usually not convex even if is convex.

We now compare our results with previous work. Related to our work we mention [40, 45, 41, 3, 26, 50, 51, 58, 48, 49, 55, 56, 57, 17, 36, 28, 44, 10, 8, 22, 21, 11, 4]. Except [44, 21, 22], all other papers assume light-tailed data. However, their analysis is asymptotic where data with finite second moments may be expected. Moreover, in [44], asymptotic consistency is given for a necessary stationary reformulation based on optimality functions while in [21, 22] the analysis is restricted to the optimal value with no report on consistency for optimal solutions. In these two papers, their asymptotic rate in terms of convergence in probability is with and, hence, sub-optimal since the optimal rate is not achieved. Our results assuming heavy-tailed data are non-asymptotic, do not use reformulations of the optimization problem and achieve the optimal rate with joint guarantees of feasibility and optimality.

Expected-valued stochastic constraints were analyzed in [44, 8, 17, 58, 55, 45, 3, 56, 57, 22, 21, 11]. As mentioned, an asymptotic analysis is given in [44] in terms of a necessary stationary reformulation based on optimality functions. In [8, 17, 58, 55], exponential non-asymptotic convergence is obtained only for the feasibility requirement, with no report on simultaneous feasibility and optimality guarantees. Moreover, they assume light-tailed data. In [45, 3, 56, 57], besides assuming light-tailed data, the exponential non-asymptotic guarantees for optimal solutions assume metric regularity of the solution set. We, on the other hand, obtain exponential non-asymptotic guarantees for nearly optimal solutions assuming heavy-tailed data and only metric regularity of the feasible set. This last assumption is much weaker since no qualification structure of the solution set nor growth conditions on the objective function need to be verified. In [22, 21, 11], analysis is only given for optimal values. Moreover, in [22, 21] it is provided asymptotic sub-optimal rates while light-tailed data is assumed in [11]. For completeness, we also mention [27] where algorithms, based on the different SA methodology, are proposed for convex problems with stochastic constraints and light-tailed data.

Finally, we compare our approach with the mentioned works with respect to the methods used in obtaining concentration of measure, an intrinsic property of random perturbations. As explained before, this is the property required to derive exponential non-asymptotic convergence. Instead of using Large Deviations Theory as in [3, 26, 50, 51, 58, 48, 49, 17, 8, 4] we use methods from Empirical Process Theory. Ready to use inequalities from this theory were invoked in [40, 45, 41, 55, 56, 57, 10]. One essential point on all of these works is the assumption of light-tailed data which we avoid. To achieve this we do not postulate a bounded data and directly use concentration inequalities of empirical processes as done in [40, 45, 41, 55, 56, 57]. Instead, we work with basic assumptions on the data (heavy-tailed Hölder continuous random functions) and derive a suitable concentration inequality on this class of random functions. The work in [10] also chooses this line of analysis. Using techniques based on Rademacher averages and the McDiarmid’s inequality [34], they require Hölder continuous random functions with a constant Hölder modulus and, as a consequence, require bounded data. Moreover, besides not analyzing stochastic constraints, their rate of convergence is of for . Hence, even in the setting of bounded data, their rate is severely sub-optimal since as approaches the optimal exponent . Our deviations have the optimal rate of for heavy-tailed data and include expected-valued stochastic constraints. To obtain these sharper results, our concentration inequalities are derived from different techniques using chaining and metric entropy arguments as well as self-normalization theory [37, 7].

We present some needed notation. For a set , we will denote its topological interior and frontier respectively by and and set and , where is the norm in Assumption LABEL:assump:random:holder:continuity. The excess or deviation between compact sets is defined as . Note that iff , where denotes the Minkowski’s sum and denotes the closed unit ball with respect to the norm . will denote the closed ball of radius and center . For a given set , we denote its cardinality by and its complement by . For , we write if there exists constant such that . Also, . For , we denote . For random variables , denotes the -algebra generated by . Given -algebra of sets in , the conditional expectation with respect to .

## 2 Motivating applications

Our theory is motivated by many stochastic optimization problems which can be cast in the framework given in the Introduction, see e.g. [49]. In this section we present two specific set of problems. In subsequent papers, we apply our methodology developed here specially tailored to these applications.

###### Example 2.1 (Risk averse portfolio optimization).

Let be random return rates of assets . The objective of portfolio optimization is to invest some initial capital in these assets subject to a desirable feature of the total return rate on the investment. If are the fractions of the initial capital invested in assets , then the total return rate is

 R(x,ξ):=ξTx,

where denotes the transpose operation, and . An obvious hard constraint for the capital is the simplex

 Y:={x≥0:d∑i=1xi=1}.

An option to hedge against uncertain risks, is to solve the problem

 minx∈Y P[−R(x,⋅)] s.t. CV@Rp[−R(x,⋅)]≤β, (8)

where, given random variable , the Conditional Value-at-Risk of level correspondent to distribution is defined as

 CV@Rp[G]:=mint∈R{t+1pP[G−t]+}.

Problem (LABEL:problem:cvar:portfolio) minimizes the expected risk in losses subjected to a constraint which hedges against more aggressive tail losses. The above problem can be equivalently solved by the problem with expected-value constraints given by

 minx∈Y,t∈R P[−R(x,⋅)] s.t. t+p−1P[−R(x,⋅)−t]+≤β.

We refer to [43].

###### Example 2.2 (The Lasso estimator).

In Least-Squares-type problems, the loss function to be minimized is

 F(x,ξ):=[y(ξ)−⟨x(ξ),x⟩]2,(x∈Rd),

for and . If is a sample of , the usual ordinary least squares method minimizes over . When , this method typically produces a good approximation of the minimizer of .

The above is not true when , where the least-squares estimator is not consistent. For this setting, Tibshirani [54] proposed the Lasso estimator given by the problem

 minx∈Y ˆF(x),

with where is a tuning parameter and denotes the -norm.

Bickel, Ritov and Tsybakov [6] analyze the penalized estimator given by the problem

 minx∈Rd ˆF(x)+λ∥ˆD2x∥1,

where is a data driven diagonal matrix whose -th entry is . Up to a penalization parameter, the above problem is equivalent, for some , to

 minx∈Rd ˆF(x) s.t. ∥ˆD2x∥1≤R.

## 3 Statement and discussion of main results

In this section we state the main results of this work. In Subsection LABEL:section:exponential:convergence:results, we present exponential nonasymptotic deviation inequalities for the near-optimal solution set and optimal value of the SAA estimator. In Subsection LABEL:section:ci, we present an uniform concentration inequality for heavy-tailed random Hölder continuous functions used to derive the results of Subsection LABEL:section:exponential:convergence:results. The proofs are given in the subsequent section LABEL:section:proofs.

### 3.1 Nonasymptotic deviation inequalities for joint optimality and feasibility guarantees

The following exponential nonasymptotic deviation inequalities quantify the order of the size of the sample necessary so that and is close to in some sense for a given tolerance . The order of will be a function of , the confidence level and some variance which will depend on intrinsic condition numbers of the problem: the Hölder parameters of the objective and the diameter and metric entropies of , or other “smaller” subsets.

With respect to Assumption LABEL:assump:random:holder:continuity, we will need the following definitions: for , and , we set , and define the following “variance-type” quantities:

 σi(x)2 :=P[Fi(x,⋅)−PFi(x,⋅)]2, ˆσi(x)2 :=ˆP[Fi(x,⋅)−PFi(x,⋅)]2, ˘σ0(x)2 :=ˆσ0(x)2+σ0(x)2, ˆσ0(Z) :=Aα0(Z)√ˆL20+L20. (9)

In above, is defined in (LABEL:equation:A:alpha). It quantifies the “size” of in terms diameter and metric entropy. We refer to Subsection LABEL:section:ci for details.

###### Theorem 3.1 (SAA with fixed feasible set).

Consider the Set-up LABEL:setup:optimization:problem and Problem LABEL:problem:perturbed:opt under (LABEL:equation:expected:data)-(LABEL:equation:empirical:data) and . Let and . Then

• (Optimality) For any and any , with probability :

 N≥O(1)ˆσ0(X)2ϵ2ln(1/p)⟹ˆX∗ϵ⊂X∗2ϵ.
• (Optimal value) For any and any , with probability :

 N≥O(1)ˆσ0(X)2∨˘σ0(z)2ϵ2ln(1/p) ⟹ f∗−2ϵ≤ˆF∗, N≥O(1)˘σ0(x∗)2ϵ2ln(1/p) ⟹ ˆF∗≤f∗+ϵ.

We now consider the case the soft constraints are perturbed (). We will need the following definitions in addition to the ones in (LABEL:equation:variance:def1): for , and ,

 ˘σI(x) :=supi∈I√ˆσi(x)2+σi(x)2, ˆσI(Z) :=supi∈I{Aαi(Z)√ˆL2i+L2i}, ˆσI(γ) :=supi∈I{Aαi(Xi,γ)√ˆL2i+L2i}, (10)

where was defined in (LABEL:equation:active:level:set:constraint) and is defined in (LABEL:equation:A:alpha) for any set .

A fundamental point is that the concentration phenomenon is guaranteed only over a confidence band. As a consequence, when constraints are randomly perturbed, the optimality deviation is affected by the feasibility deviation in terms of optimality errors given in the following. For a metric regular feasible set (Assumption LABEL:def:metric:regularity:intro), we can obtain deviation guarantees using the exterior approximation of given a tolerance . In this case, we need to consider the following optimality error:

 Gap(γ):=f∗−minXγf≥0. (11)

See the definition of in (LABEL:equation:set:constraint:relaxation).

###### Theorem 3.2 (SAA with exterior approximation of a metric regular feasible set).

Consider the Set-up LABEL:setup:optimization:problem and Problem LABEL:problem:perturbed:opt under (LABEL:equation:expected:data)-(LABEL:equation:empirical:data). Suppose satisfies Assumption LABEL:def:metric:regularity:intro, for some and let , and . Then

• (Feasibility) For any and any , with probability :

 ˆϵi≡ϵ and N≥O(1)^σ2ϵ2[lnm+ln(1/p)]⟹ˆX⊂X2ϵ⊂˘X2cϵ,

with .

• (Optimality and feasibility) For any and any , with probability :

 ˆϵi≡ϵ and N≥O(1)^σ2ϵ2[lnm+ln(1/p)]⟹{D(ˆX,X)≤2cϵ,∀ˆx∈ˆX∗ϵ,f(ˆx)≤f∗+2ϵ,

with .

• (Optimal value) For any and any , with probability :

 ˆϵi≡ϵ and N≥O(1)^σ2ϵ2[lnm+ln(1/p)]⟹{ˆF∗≤f∗+ϵ,f∗≤ˆF∗+ϵ+Gap(2ϵ),

with .

• (Localized feasibility for convex constraints) If, additionally, a.e. is a family of convex functions on , then the statements in items (i)-(iii) remain true with replaced by .

In item (iv), by “localized feasibility” we mean that the finite-sample feasibility guarantee depends on variances at the points and and the variance associated to the approximate active-level sets of the constraints.

For a convex feasible set satisfying the Slater constraint qualification, we can obtain deviations using the interior approximation of for a given tolerance . In this case we need to consider the following optimality error:

 gap(γ):=minX−γf−f∗≥0. (12)
###### Theorem 3.3 (SAA with interior approximation of a strictly feasible convex set).

Consider the Set-up LABEL:setup:optimization:problem and Problem LABEL:problem:perturbed:opt under (LABEL:equation:expected:data)-(LABEL:equation:empirical:data). Suppose satisfies Assumption LABEL:assump:SCQ such that for all , for some and let . Then

• (Localized exact feasibility) For any , and , with probability :

 ˆϵi≡−ϵ and N≥O(1)^σ2ϵ2[lnm+ln(1/p)]⟹ˆX⊂X,

with .

• (Optimality and feasibility) For any , , and , with probability :

 ˆϵi≡−ϵ and N≥O(1)^σ2ϵ2[lnm+ln(1/p)]⟹ˆX∗ϵ⊂X∗2ϵ+gap(2ϵ),

with .

• (Optimal value) For any , , and , with probability :

with .

In item (i), by “localized feasibility” we mean that the finite-sample feasibility guarantee depends on point-wise variances and and the variance associated to the exact active-level sets of the constraints. In the case of satisfying the SCQ, Theorem LABEL:thm:SAA:exp:outer:MR(iv) and Theorem LABEL:thm:SAA:exp:inner:SCQ(i) describes a finite-sample “transition regime” with respect to feasibility: for all , a -near-feasibility is guaranteed in terms of set deviation, while for , we obtain exact feasibility. We note that and are related as described in Theorem LABEL:thm:robinson.

To complete the picture regarding the deviation inequalities in Theorems LABEL:thm:SAA:exp:outer:MR and LABEL:thm:SAA:exp:inner:SCQ in case of perturbed constraints, we present the next straightforward proposition to explicit state that, under Assumption LABEL:def:metric:regularity:intro, with rate at most . This rate depends only on the local Hölder modulus of around solutions at the border (to be precised in the following). Note that, in many instances, . An analogous result holds for under Assumption LABEL:assump:SCQ for all sufficiently small . A more drastic manifestation of this local behavior is that these errors are exactly zero if “interior solutions near the border” exist in the sense given below.

###### Proposition 3.4 (Optimality errors).

Suppose Assumption LABEL:assump:random:holder:continuity holds. Given with and , set

 Lγ,c:=minz∈(Xγ+)∗sup{|f(x)−f(y)|∥x−y∥α0:x≠y and x,y∈B[z,c|γ|]∩Y}≤L0.

If, additionally, Assumption LABEL:def:metric:regularity:intro holds, then for all . Moreover, if .

If, additionally, Assumption LABEL:assump:SCQ holds, then for all where . Moreover, if .

A proof of Proposition LABEL:prop:gap:error is given in the Appendix for completeness.

### 3.2 An uniform concentration inequality for heavy-tailed Hölder continuous functions

An important contribution of this work is the following Theorem LABEL:thm:concentration:ineq:function. This is the cornerstone tool used in the analysis of the SAA estimator under Assumption LABEL:assump:random:holder:continuity (without the need of sub-Gaussianity). Its derivation will rely on branch of mathematical statistics called Empirical Process Theory. Given an i.i.d. sample of a distribution over a space and a family of measurable functions , the empirical process (EP) is the stochastic process indexed over . An essential quantity in this theory is . If , then is simply a sum of independent random variables. Otherwise, it is a much more complicated object. In our specific context of Set-up LABEL:setup:optimization:problem, Problem LABEL:problem:perturbed:opt and (LABEL:equation:expected:data)-(LABEL:equation:empirical:data), we shall consider classes of the form , where is a compact subset of . Actually, our result is valid for general totally bounded metric spaces. In controlling the supremum , the “complexity” of is important. This is formalized in the next definition.

###### Definition 3.5 (Metric entropy).

Let be a totally bounded metric space. Given , a -net for is a finite set of maximal cardinality such that for all with , one has . The -entropy number is . The function is called the metric entropy of .

###### Theorem 3.6 (sub-Gaussian uniform concentration inequality for heavy-tailed Hölder continuous functions).

Suppose that

• is a totally bounded metric space with diameter .

• is an i.i.d. sample of a distribution over and denotes the correspondent empirical distribution.

• is a measurable function such that for some . Moreover, there exist and non-negative random variable with such that a.e. for all ,

We also define the quantities: ,

 Aα(M) := ∞∑i=1D(M)α2iα ⎷H(D(M)2i,M)+H(D(M)2i−1,M)+ln[i(i+1)], (13)

and, for all ,

 ˆδG(x,y):=(ˆP−P)[G(x,⋅)−G(y,⋅)].

Then there exists universal constant , such that for any and ,

 P⎧⎨⎩supx∈M∣∣ˆδG(x,y)∣∣≥CAα(M)√(1+t)N[ˆL2+PL2(⋅)]⎫⎬⎭≤e−t.

The quantity measures the “complexity” of the class with respect to diameter and metric entropy of and the smoothness exponent of for the uniform concentration property to hold.

To obtain Theorem LABEL:thm:concentration:ineq:function, we will use the following inequality due to Panchenko (see Theorem 1 in [37] or Theorem 12.3 in [7]). It establishes a sub-Gaussian tail for the deviation of an EP around its mean after a proper normalization with respect to a random quantity .

###### Theorem 3.7 (Panchenko’s inequality).

Let be a finite family of measurable functions such that . Let also and be both i.i.d. samples drawn from a distribution over which are independent of each other. Define

 S:=supg∈FN∑j=1g(ξj),and%