Example 1

[10pt] .

Generalization Bounds in the

Predict-then-Optimize Framework

Othman El Balghiti

Department of Industrial Engineering and Operations Research, Columbia University, New York, NY, 10027, oe2161@columbia.edu

Adam N. Elmachtoub

Department of Industrial Engineering and Operations Research and Data Science Institute, Columbia University, New York, NY, 10027, adam@ieor.columbia.edu

Paul Grigas

Department of Industrial Engineering and Operations Research, UC Berkeley, Berkeley, CA, 94720, pgrigas@berkeley.edu

Ambuj Tewari

Department of Statistics, University of Michigan, Ann Arbor, MI, 48109, tewaria@umich.edu

The predict-then-optimize framework is fundamental in many practical settings: predict the unknown parameters of an optimization problem, and then solve the problem using the predicted values of the parameters. A natural loss function in this environment is to consider the cost of the decisions induced by the predicted parameters, in contrast to the prediction error of the parameters. This loss function was recently introduced in Elmachtoub and Grigas (2017), which called it the Smart Predict-then-Optimize (SPO) loss. Since the SPO loss is nonconvex and noncontinuous, standard results for deriving generalization bounds do not apply. In this work, we provide an assortment of generalization bounds for the SPO loss function. In particular, we derive bounds based on the Natarajan dimension that, in the case of a polyhedral feasible region, scale at most logarithmically in the number of extreme points, but, in the case of a general convex set, have poor dependence on the dimension. By exploiting the structure of the SPO loss function and an additional strong convexity assumption on the feasible region, we can dramatically improve the dependence on the dimension via an analysis and corresponding bounds that are akin to the margin guarantees in classification problems.

Key words: generalization bounds; perscriptive analytics; regression

A common application of machine learning is to predict-then-optimize, i.e., predict unknown parameters of an optimization problem and then solve the optimization problem using the predictions. For instance, consider a navigation task that requires solving a shortest path problem. The key input into this problem are the travel times on each edge, typically called edge costs. Although the exact costs are not known at the time the problem is solved, the edge costs are predicted using a machine learning model trained on historical data consisting of features (time of day, weather, etc.) and edge costs (collected from app data). Fundamentally, a good model induces the optimization problem to find good shortest paths, as measured by the true edge costs. In fact, recent work has been developed to consider how to solve problems in similar environments (Bertsimas and Kallus 2014, Kao et al. 2009, Donti et al. 2017). In particular, Elmachtoub and Grigas (2017) developed the Smart Predict-then-Optimize (SPO) loss function which exactly measures the quality of a prediction by the decision error, in contrast to the prediction error as measured by standard loss functions such as squared error. In this work, we seek to provide an assortment of generalization bounds for the SPO loss function.

Specifically, we shall assume that our optimization task is to minimize a linear objective over a convex feasible region. In the shortest path example, the feasible region is a polyhedron. We assume the objective cost vector is not known at the time the optimization problem is solved, but rather predicted from data. A decision is made with respect to the predicted cost vector, and the SPO loss is computed by evaluating the decision on the true cost vector and then subtracting the optimal cost assuming knowledge of the true cost vector. Unfortunately, the SPO loss is nonconvex and non-Lipschitz, and therefore proving generalization bounds is not immediate.

Our results consider two cases, depending on whether the feasible region is a polyhedron or a strongly convex body. In all cases, we achieve a dependency of up to logarithmic terms, where is the number of samples. In the polyhedral case, our generalization bound is formed by considering the Rademacher complexity of the class obtained by compositing the SPO loss with our predict-then-optimize models. This in turn can be bounded by a term on the order of square root of the Natarajan dimension times the logarithm of the number of extreme points in the feasible region. Since the number of extreme points is typically exponential in the dimension, this logarithm is essential so that the bound is at most linear in the dimension. When our cost vector prediction models are restricted to linear, we show that the Natarajan dimension of the predict-then-optimize hypothesis class is simply bounded by the product of the two relevant dimensions, the feature dimension and the cost vector dimension, of the linear hypothesis class. Using this polyhedral approach, we show that a generalization bound is possible for any convex set by looking at a covering of the feasible region, although the dependency on the dimension is at least linear.

Fortunately, we show that when the feasible region is strongly convex, tighter generalization bounds can be obtained using margin-based methods. The proof relies on constructing an upper bound on the SPO-loss function and showing it is Lipschitz. Our margin based bounds have no explicit dependence on dimensions of input features and of cost vectors. It is expressed as a function of the multivariate Rademacher complexity of the vector-valued hypothesis class being used. We show that for suitably constrained linear hypothesis classes, we get a much improved dependence on problem dimensions. Since the SPO loss generalizes the 0-1 multiclass loss from multiclass classification (see Example 1), our work can be seen as extending classic Natarajan-dimension based (Shalev-Shwartz and Ben-David 2014, Ch. 29) and margin-based generalization bounds (Koltchinskii et al. 2002) to the predict-then-optimize framework.

One of the challenges in the multi-class classification literature is to provide generalization bounds that are not too large in the number of classes. For data-independent worst-case bounds, the dependency is at best square root in the number of classes (Guermeur 2007, Daniely et al. 2015). In contrast, we provide data-independent bounds that grow only logarithmically in the number of extreme points (labels). Using data-dependent (margin-based) approaches, Lei et al. (2015) and Li et al. (2018) successfully decreased this complexity to logarithm in the number of classes. In contrast, our margin-based approach removes the dependency on the number of classes by exploiting the structure of the SPO loss function.

Even though we construct a Lipschitz upper bound on SPO loss in a general norm setting (Theorem 3), our margin bounds (Theorem 4) are stated in the norm setting. This is because the most general contraction type lemma for vector valued Lipschitz functions we know of only works for the -norm (Maurer 2016). Bertsimas and Kallus (2014) derive the same type of bounds in an infinity norm setting but our understanding of general norms appears limited at present. Our work will hopefully provide the motivation to develop contraction inequalities for vector valued Lipschitz functions in a general norm setting.

We now describe the predict-then-optimize framework which is central to many applications of optimization in practice. Specifically, we assume that there is a nominal optimization problem of interest which models our downstream decision-making task. Furthermore, we assume that the nominal problem has a linear objective and that the decision variable and feasible region are well-defined and known with certainty. However, the cost vector of the objective, , is not observed directly, and rather an associated feature vector is observed. Let be the underlying joint distribution of and let be the conditional distribution of given . Then the goal for the decision maker is to solve

 minw∈SEc∼Dx[cTw|x] = minw∈SEc∼Dx[c|x]Tw (1)

The predict-then-optimize framework relies on using a prediction/estimate for , which we denote by , and solving the deterministic version of the optimization problem based on . We define to be the optimization task with objective cost vector , namely

 P(^c):minw^cTws.t.w∈S. (2)

We assume is a nonempty, compact, and convex set representing the feasible region. We let denote any oracle for solving . That is, is a fixed deterministic mapping such that for all . For instance, if (id1) corresponds to a linear, conic, or even a particular combinatorial or mixed-integer optimization problem (in which case can be implicitly described as a convex set), then a commercial optimization solver or a specialized algorithm suffices for .

In this framework, we assume that predictions are made from a model that is learned on a training data set. Specifically, the sample training data is drawn i.i.d. from the joint distribution , where is a feature vector representing auxiliary information associated with the cost vector . We denote by our hypothesis class of cost vector prediction models, thus for a function , we have that . Most approaches for learning a model from the training data are based on specifying a loss function that quantifies the error in making prediction when the realized (true) cost vector is actually . Herein, following Elmachtoub and Grigas (2017), our primary loss function of interest is the “smart predict-then-optimize” loss function that directly takes the nominal optimization problem into account when measuring errors in predictions. Namely, we consider the SPO loss function (relative to the optimization oracle ) defined by:

 ℓSPO(^c,c):=cT(w∗(^c)−w∗(c)) ,

where is the predicted cost vector and is the true realized cost vector. Notice that exactly measures the excess cost incurred when making a suboptimal decision due to an imprecise cost vector prediction. Also, note that we have for all and .

###### Example 1

In the shortest path problem, is the edge cost vector, is a feature vector (e.g., weather and time), and is a network flow polytope. Our setting also captures multi-class (and binary) classification by the following characterization: is the -dimensional simplex where is the number of classes, where is the unit vector in . It is easy to see that each vertex of the simplex corresponds to a label, and correct/incorrect prediction has a loss of 0/1.

As pointed out in Elmachtoub and Grigas (2017), the SPO loss function is generally non-convex, may even be discontinuous, and is in fact a strict generalization of the 0-1 loss function in binary classification. Thus, optimizing the SPO loss via empirical risk minimization may be intractable even when is a linear hypothesis class. To circumvent these difficulties, one approach is to optimize a convex surrogate loss as examined in Elmachtoub and Grigas (2017). Our focus is on deriving generalization bounds that hold uniformly over the class , and thus are valid for any training approach, including using a surrogate or other loss function within the framework of empirical risk minimization. Notice that a generalization bound for the SPO loss directly translates to an upper bound guarantee for problem (\the@equationgroup@ID) that holds “on average” over the distribution.

We will make use of a generic given norm on , as well as the -norm denoted by for . For the given norm on , denotes the dual norm defined by . Let denote the ball of radius centered at , and we analogously define for the -norm and for the dual norm. For a set , we define the size of in the norm by . We analogously define for the -norm and for the dual norm. We define the “linear optimization gap” of with respect to by , and for a set we slightly abuse notation by defining . Define .

Let us now briefly review the notion of Rademacher complexity and its application in our framework. Recall that is a hypothesis class of functions mapping from the feature space to . Given a fixed sample , we define the empirical Rademacher complexity of with respect to the SPO loss, i.e., the empirical Rademacher complexity of the function class obtained by composing with by

 ^RnSPO(H):=Eσ[supf∈H1nn∑i=1σiℓSPO(f(xi),ci)] ,

where are i.i.d. Rademacher random variables for . The expected version of the Rademacher complexity is defined as where the expectation is w.r.t an i.i.d. sample drawn from the underlying distribution . The following theorem is an application of the classical generalization bounds based on Rademacher complexity due to Bartlett and Mendelson (2002) to our setting.

###### Theorem 1 (Bartlett and Mendelson (2002))

Let be a family of functions mapping from to . Then, for any , with probability at least over an i.i.d. sample drawn from the distribution , each of the following holds for all

 RSPO(f) ≤^RSPO(f)+2RnSPO(H)+ωS(C)√log(1/δ)2n , % and RSPO(f) ≤^RSPO(f)+2^RnSPO(H)+3ωS(C)√log(2/δ)2n .

In this section, we consider the case where is a polyhedron and derive generalization bounds based on bounding the Rademacher complexity of the SPO loss and applying Theorem 1. Since is polyhedral, the optimal solution of (id1) can be found by considering only the finite set of extreme points of , which we denote by the set . Since the number of extreme points may be exponential in , our goal is to provide bounds that are logarithmic in . At the end of the section, we extend our analysis to any compact and convex feasible region by extending the polyhedral analysis with a covering number argument.

In order to derive a bound on the Rademacher complexity, we will critically rely on the notion of the Natarajan dimension (Natarajan 1989), which is an extension of the VC-dimension to the multiclass classification setting and is defined in our setting as follows.

###### Definition 1 (Natarajan dimension)

Suppose that is a polyhedron and is the set of its extreme points. Let be a hypothesis space of functions mapping from to , and let be given. We say that N-shatters if there exists such that

• for all

• For all , there exists such that (i) for all and (ii) for all .

The Natarajan dimension of , denoted , is the maximal cardinality of a set N-shattered by .

The Natarajan dimension is a measure for the richness of a hypothesis class. In Theorem 2, we show that the Rademacher complexity for the SPO loss can be bounded as a function of the Natarajan dimension of . The proof follows a classical argument and makes strong use of Massart’s lemma and the Natarajan lemma.

###### Theorem 2

Suppose that is a polyhedron and is the set of its extreme points. Let be a family of functions mapping from to . Then we have that

 RnSPO(H)≤ωS(C)√2dN(w∗(H))log(n|S|2)n.

Furthermore, for any , with probability at least over an i.i.d. sample drawn from the distribution , for all we have

 RSPO(f) ≤^RSPO(f)+2ωS(C)√2dN(w∗(H))log(n|S|2)n+ωS(C)√log(1/δ)2n

Next, we show that when is restricted to the linear hypothesis class , then the Natarajan dimension of can be bounded by . The proof relies on translating our problem to an instance of linear multiclass prediction problem and using a result of Daniely and Shalev-Shwartz (2014).

###### Corollary 1

Suppose that is a polyhedron and is the set of its extreme points. Let be the hypothesis class of all linear functions, i.e., . Then we have

 dN(w∗(Hlin))≤dp.

Furthermore, for any , with probability at least over an i.i.d. sample drawn from the distribution , for all we have

 RSPO(f) ≤^RSPO(f)+2ωS(C)√2dplog(n|S|2)n+ωS(C)√log(1/δ)2n.

Next, we will build off the previous results to prove generalization bounds in the case where is a general compact convex set. The arguments we made earlier made extensive use of the extreme points of the polyhedron. Nevertheless, this combinatorial argument can be modified in order to derive similar results for general . The approach is to approximate by a grid of points corresponding to the smallest cardinality -covering of . To optimize over these grid of points, we first find the optimal solution in and then round to the nearest point in the grid. Both the grid representation and the rounding procedure can fortunately both be handled by similar arguments made in Theorems 2 and Corollary 1, yielding a generalization bound below.

###### Corollary 2

Let be any compact and convex set, and let be the hypothesis class of all linear functions. Then, for any , with probability at least over an i.i.d. sample drawn from the distribution , for all we have

 RSPO(f) ≤^RSPO(f)+4dωS(C)√2plog(2nρ2(S)d)n+3ωS(C)√log(2/δ)2n+O(1n).

Although the dependence on the sample size in the above bound is favorable, the dependence on the number of features and the dimension of the feasible region is relatively weak. Given that the proofs of Corollary 2 and Theorem 2 are purely combinatorial and hold for worst-case distributions, this is not surprising. In the next section, we demonstrate how to exploit the structure of the SPO loss function and additional convexity properties of in order to develop improved bounds.

In this section, we develop improved generalization bounds for the SPO loss function under the additional assumption that the feasible region is strongly convex. Our developments are akin to and in fact are a strict generalization of the margin guarantees for binary classification based on Rademacher complexity developed in Koltchinskii et al. (2002). We adopt the definition of strongly convex sets presented in Journée et al. (2010) and Garber and Hazan (2015), which is reviewed in Definition 2 below. Recall that is a generic given norm on and denotes the ball of radius centered at .

###### Definition 2

We say that a convex set is -strongly convex with respect to the norm if, for any and for any , it holds that:

 B(λw1+(1−λ)w2,(μ2)λ(1−λ)∥w1−w2∥2)⊆S .

Informally, Definition 2 says that, for every convex combination of points in , a ball of appropriate radius also lies in . Several examples of strongly convex sets are presented in Journée et al. (2010) and Garber and Hazan (2015), including and Schatten balls for , certain group norm balls, and generally any level set of a smooth and strongly convex function.

Our analysis herein relies on the following Proposition, which strengthens the first-order general optimality condition for differentiable convex optimization problems under the additional assumption of strong convexity. Proposition 1 may be of independent interest and, to the best of our knowledge, has not appeared previously in the literature.

###### Proposition 1

Let be a non-empty -strongly convex set and let be a convex and differentiable function. Consider the convex optimization problem:

 minwF(w)s.t.w∈S . (3)

Then, is an optimal solution of (3) if and only if:

 ∇F(¯w)T(w−¯w)≥(μ2)∥∇F(¯w)∥∗∥w−¯w∥2  for all w∈S . (4)

In fact, we prove a slightly more general version of the proposition where the function need only be defined on an open set containing . In the case of linear optimization with , the inequality (4) implies that is the unique optimal solution of whenever and . Hence, in the context of the SPO loss function with a strongly convex feasible region, provides a degree of “confidence” regarding the decision implied by the cost vector prediction . This intuition motivates us to define the “-margin SPO loss”, which places a greater penalty on cost vector predictions near 0.

###### Definition 3

For a fixed parameter , given a cost vector prediction and a realized cost vector , the -margin SPO loss is defined as:

 ℓγSPO(^c,c):={ℓSPO(^c,c) if ∥^c∥∗>γ(∥^c∥∗γ)ℓSPO(^c,c)+(1−∥^c∥∗γ)ωS(c) if ∥^c∥∗≤γ

Recall that, for any , it holds that . Hence, we also have that , that is the -margin SPO loss provides an upper bound on the SPO loss. Notice that the -margin SPO loss interpolates between the SPO loss and the upper bound whenever . The -margin SPO loss also satisfies a simple monotonicity property whereby for any and . We can also define a “hard -margin SPO loss” that simply returns the upper bound whenever .

###### Definition 4

For a fixed parameter , given a cost vector prediction and a realized cost vector , the hard -margin SPO loss is defined as:

 ¯ℓγSPO(^c,c):={ℓSPO(^c,c) if ∥^c∥∗>γωS(c) if ∥^c∥∗≤γ

It is simple to see that for all and . Due to this additional upper bound, in all of the subsequent generalization bound results, the empirical -margin SPO loss can be replaced by its hard margin counterpart.

We are now ready to state a theorem concerning the Lipschitz properties of the optimization oracle and the -margin SPO loss, which will then be used to derive margin-based generalization bounds. Theorem 3 below first demonstrates that the optimization oracle satisfies a “Lipschitz-like” property away from zero. Subsequently, this Lipschitz-like property is a key ingredient in demonstrating that the -margin SPO loss is Lipschitz.

###### Theorem 3

Suppose that feasible region is -strongly convex with . Then, the optimization oracle satisfies the following “Lipschitz-like” property: for any , it holds that:

 ∥w∗(^c1)−w∗(^c2)∥ ≤ 1μ⋅min{∥^c1∥∗,∥^c2∥∗}∥^c1−^c2∥∗ . (5)

Moreover, for any fixed and , the -margin SPO loss is -Lipschitz with respect to the dual norm , i.e., it holds that:

 |ℓγSPO(^c1,c)−ℓγSPO(^c2,c)| ≤ 5∥c∥∗γμ∥^c1−^c2∥∗  for all ^c1,^c2∈Rd . (6)
\@trivlist

We present here only the proof of (5) and defer the proof of (6), which relies crucially on (5), to the supplementary materials. Let . We assume without loss of generality that (otherwise the right-hand side of (5) is equal to by convention). Applying Proposition 1 twice yields:

 ^cT1(w∗(^c2)−w∗(^c1)) ≥ (μ2)τ∥w∗(^c1)−w∗(^c2)∥2 ,

and

 ^cT2(w∗(^c1)−w∗(^c2)) ≥ (μ2)τ∥w∗(^c1)−w∗(^c2)∥2 .

Adding the above two inequalities together yields:

 μτ∥w∗(^c1)−w∗(^c2)∥2 ≤ (^c2−^c1)T(w∗(^c1)−w∗(^c2)) ≤ ∥^c1−^c2∥∗∥w∗(^c1)−w∗(^c2)∥ ,

where the second inequality is Hölder’s inequality. Dividing both sides of the above by yields the desired result.  \@endparenv

We are now ready to present our main generalization bounds of interest in the strongly convex case. Our results are based on combining Theorem 3 with the Lipschitz vector-contraction inequality for Rademacher complexities developed in Maurer (2016), as well as the results of Bartlett and Mendelson (2002). Following Bertsimas and Kallus (2014) and Maurer (2016), given a fixed sample , we define the multivariate empirical Rademacher complexity of as

 ^Rn(H):=Eσ[supf∈H1nn∑i=1d∑j=1σijfj(xi)]=Eσ[supf∈H1nn∑i=1σTif(xi)] , (7)

where are i.i.d. Rademacher random variables for and , and . The expected version of the multivariate Rademacher complexity is defined as where the expectation is w.r.t. the i.i.d. sample drawn from the underlying distribution .

Let us also define the empirical -margin SPO loss and the empirical Rademacher complexity of with respect to the -margin SPO loss as follows:

 ^RγSPO(f):=1nn∑i=1ℓγSPO(f(xi),ci) , and  ^RnγSPO(H):=Eσ[supf∈H1nn∑i=1σiℓγSPO(f(xi),ci)] ,

where on the left side above and are i.i.d. Rademacher random variables for .

In the following two theorems, we focus only on the case of the -norm set-up, i.e., the norm on the space of variables as well as the norm on the space of cost vectors are both the -norm. To the best of our knowledge, extending the vector-contraction inequality of Maurer (2016) to an arbitrary norm setting (or even the case of general -norms) remains an open question that would have interesting applications to our framework. Theorem 4 below presents our margin based generalization bounds for a fixed . Recall that denotes the domain of the true cost vectors , , and .

###### Theorem 4

Suppose that feasible region is -strongly convex with respect to the -norm with , and let be fixed. Let be a family of functions mapping from to . Then, for any fixed sample we have that

 ^RnγSPO(H) ≤ 5√2ρ2(C)^Rn(H)γμ .

Furthermore, for any , with probability at least over an i.i.d. sample drawn from the distribution , each of the following holds for all

 RSPO(f) ≤^RγSPO(f)+10√2ρ2(C)Rn(H)γμ+ωS(C)√log(1/δ)2n , and RSPO(f) ≤^RγSPO(f)+10√2ρ2(C)^Rn(H)γμ+3ωS(C)√log(2/δ)2n .
\@trivlist

The bound on follows simply by combining Theorem 3, particularly (6), with equation (1) of Maurer (2016). The subsequent generalization bounds then simply follow since for all and by applying the version of Theorem 1 for the -margin SPO loss.  \@endparenv

It is often the case that the structure of the hypothesis class naturally leads to a bound on that can have mild, even logarithmic, dependence on dimensions and . For example, let us consider the general setting of a constrained linear function class, namely , where . In Section id1 of the supplementary materials, we derive a result that extends Theorem 3 of Kakade et al. (2009) to multivariate Rademacher complexity and provides a convenient way to bound in the case when corresponds to the level set of a strongly convex function. When (where denotes the Frobenius norm of ) this result implies that , and when (where denotes the -norm of the vectorized matrix ) this result implies that . Note the absence of any explicit dependence on in the first bound and only logarithmic dependence on in the second. We discuss the details of these and additional examples, including the “group-lasso” norm, in Section id1.

Theorem 4 may also be extended to bounds that hold uniformly over all values of , where is a fixed parameter. This extension is presented below in Theorem 5.

###### Theorem 5

Suppose that feasible region is -strongly convex with respect to the -norm with , and let be fixed. Let be a family of functions mapping from to . Then, for any , with probability at least over an i.i.d. sample drawn from the distribution , each of the following holds for all and for all

 RSPO(f) ≤^RγSPO(f)+20√2ρ2(C)Rn(H)γμ+ωS(C)⎛⎝√log(log2(2¯γ/γ))n+√log(2/δ)2n⎞⎠ , and RSPO(f) ≤^RγSPO(f)+20√2ρ2(C)^Rn(H)γμ+ωS(C)⎛⎝√log(log2(2¯γ/γ))n+3√log(4/δ)2n⎞⎠ .

Note that a natural choice for in Theorem 5 is , presuming that one can bound this quantity based on the properties of and . Example 2 below discusses how Theorems 4 and 5 relate to known results in binary classification.

###### Example 2

In Elmachtoub and Grigas (2017), it is shown that the SPO loss corresponds exactly to the 0-1 loss in binary classification when , , and . In this case, using our notation, the margin value of a prediction is . It is also easily seen that , the -margin SPO loss corresponds exactly to the margin loss (or ramp loss) that interpolates between 1 and 0 when , and the hard -margin SPO loss corresponds exactly to the margin loss that returns 1 when and 0 otherwise. Furthermore, note that the interval is -strongly convex (Garber and Hazan 2015). Thus, except for some worse absolute constants, Theorems 4 and 5 exactly generalize the well-known results on margin guarantees based on Rademacher complexity for binary classification (Koltchinskii et al. 2002).

As in the case of binary classification, the utility of Theorems 4 and 5 is strengthened when the underlying distribution has a “favorable margin property.” Namely, the bounds in Theorems 4 and 5 can be much stronger than those of Corollary 2 when the distribution and the sample are such that there exists a relatively large value of such that the empirical -margin SPO loss is small. One is thus motivated to choose the value of in a data-driven way so that, given a prediction function trained on the data , the upper bound on is minimized. Since Theroem 5 is a uniform result over , this data-driven procedure for choosing is indeed valid.

Our work extends learning theory, as developed for binary and multiclass classification, to predict-then-optimize problems in two very significant directions: (i) obtaining worst-case generalization bounds using combinatorial parameters that measure the capacity of function classes, and (ii) exploiting special structure in data by deriving margin-based generalization bounds that scale more gracefully w.r.t. problem dimensions. It also motivates several interesting avenues for future work. Beyond the margin theory, other aspects of the problem that lead to improvements over worst case rates should be studied. In this respect, developing a theory of local Rademacher complexity for predict-then-optimize problems would be a promising approach. It will be good to use minimax constructions to provide matching lower bounds for our upper bounds. Extending the margin theory for strongly convex sets, where the SPO loss is ill-behaved only near 0, to polyhedral sets, where it can be much more ill-behaved, is a challenging but fascinating direction. Developing a theory of surrogate losses, especially convex ones, that are calibrated w.r.t. the non-convex SPO loss will also be extremely important. Finally, the assumption that the optimization objective is linear could be relaxed.

## References

• Bartlett and Mendelson (2002) Bartlett, Peter L, Shahar Mendelson. 2002. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research 3(Nov) 463–482.
• Bertsekas and Scientific (2015) Bertsekas, Dimitri P, Athena Scientific. 2015. Convex optimization algorithms. Athena Scientific Belmont.
• Bertsimas and Kallus (2014) Bertsimas, Dimitris, Nathan Kallus. 2014. From predictive to prescriptive analytics. arXiv preprint arXiv:1402.5481 .
• Daniely et al. (2015) Daniely, Amit, Sivan Sabato, Shai Ben-David, Shai Shalev-Shwartz. 2015. Multiclass learnability and the erm principle. The Journal of Machine Learning Research 16(1) 2377–2404.
• Daniely and Shalev-Shwartz (2014) Daniely, Amit, Shai Shalev-Shwartz. 2014. Optimal learners for multiclass problems. Conference on Learning Theory. 287–316.
• Donti et al. (2017) Donti, Priya, Brandon Amos, J Zico Kolter. 2017. Task-based end-to-end model learning in stochastic optimization. Advances in Neural Information Processing Systems. 5484–5494.
• Elmachtoub and Grigas (2017) Elmachtoub, Adam N, Paul Grigas. 2017. Smart ”predict, then optimize”. arXiv preprint arXiv:1710.08005 .
• Garber and Hazan (2015) Garber, Dan, Elad Hazan. 2015. Faster rates for the frank-wolfe method over strongly-convex sets. 32nd International Conference on Machine Learning, ICML 2015.
• Guermeur (2007) Guermeur, Yann. 2007. Vc theory of large margin multi-category classifiers. Journal of Machine Learning Research 8(Nov) 2551–2594.
• Journée et al. (2010) Journée, Michel, Yurii Nesterov, Peter Richtárik, Rodolphe Sepulchre. 2010. Generalized power method for sparse principal component analysis. Journal of Machine Learning Research 11(Feb) 517–553.
• Kakade et al. (2012) Kakade, Sham M., Shai Shalev-Shwartz, Ambuj Tewari. 2012. Regularization techniques for learning with matrices. Journal of Machine Learning Research 13 1865–1890.
• Kakade et al. (2009) Kakade, Sham M, Karthik Sridharan, Ambuj Tewari. 2009. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. Advances in neural information processing systems. 793–800.
• Kao et al. (2009) Kao, Yi-hao, Benjamin V Roy, Xiang Yan. 2009. Directed regression. Advances in Neural Information Processing Systems. 889–897.
• Koltchinskii et al. (2002) Koltchinskii, Vladimir, Dmitry Panchenko, et al. 2002. Empirical margin distributions and bounding the generalization error of combined classifiers. The Annals of Statistics 30(1) 1–50.
• Lei et al. (2015) Lei, Yunwen, Urun Dogan, Alexander Binder, Marius Kloft. 2015. Multi-class svms: From tighter data-dependent generalization bounds to novel algorithms. Advances in Neural Information Processing Systems. 2035–2043.
• Li et al. (2018) Li, Jian, Yong Liu, Rong Yin, Hua Zhang, Lizhong Ding, Weiping Wang. 2018. Multi-class learning: From theory to algorithm. S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, R. Garnett, eds., Advances in Neural Information Processing Systems 31. Curran Associates, Inc., 1586–1595.
• Maurer (2016) Maurer, Andreas. 2016. A vector-contraction inequality for rademacher complexities. International Conference on Algorithmic Learning Theory. Springer, 3–17.
• Mohri et al. (2018) Mohri, Mehryar, Afshin Rostamizadeh, Ameet Talwalkar. 2018. Foundations of machine learning. MIT press.
• Natarajan (1989) Natarajan, Balas K. 1989. On learning sets and functions. Machine Learning 4(1) 67–97.
• Shalev-Shwartz and Ben-David (2014) Shalev-Shwartz, Shai, Shai Ben-David. 2014. Understanding machine learning: From theory to algorithms. Cambridge university press.
• Tibshirani et al. (2015) Tibshirani, Robert, Martin Wainwright, Trevor Hastie. 2015. Statistical learning with sparsity: the lasso and generalizations. Chapman and Hall/CRC.

Appendix

\@trivlist

The proof is along the lines of Corollary 3.8 in Mohri et al. (2018). Fix a sample of data , where and . Let . From the definition of empirical Rademacher complexity, we have that

 ^RnSPO(H) =Eσ[supf∈H1nn∑i=1σiℓSPO(f(xi),ci)] =Eσ[supf∈H1nn∑i=1σicTi(w∗(f(xi))−w∗(ci))] =Eσ[sup(w1,…,wn)∈F|X1nn∑i=1σicTi(wi−w∗(ci))] ≤ωS(C)√2log|F|X|n ≤ωS(C)√2dN(w∗(H))log(n|S|2)n

where the first inequality is directly due to Massart’s lemma and the definition of and the second inequality follows from the Natarajan Lemma (see Lemma 29.4 in Shalev-Shwartz and Ben-David (2014)). The bound for the expected version of the Rademacher complexity follows immediately from the bound on the empirical Rademacher complexity. Applying this bound with Theorem 1 concludes the proof.  \@endparenv

\@trivlist

We will prove that is an instance of a linear multiclass predictor for a particular class-sensitive feature mapping . Recall that is the number of extreme points of . In our application of linear multiclass predictors, let be a function that takes a feature vector an extreme point and maps it to a matrix and let

 HΨ={x↦argmaxi∈{1,…,|S|}⟨B,Ψ(x,i)⟩:B∈Rd×p}.

We will show that, for , we have that . Consider any and the associated matrix . Then

 w∗(Bfx) ∈argminw∈S(Bfx)Tw =argmaxi∈{1,…,|S|}−(Bfx)Twi =argmaxi∈{1,…,|S|}−Tr((Bfx)Twi) =argmaxi∈{1,…,|S|}−Tr(BTfwixT) =argmaxi∈{1,…,|S|}⟨−Bf,wixT⟩.

Thus, it is clear that for , choosing the function in corresponding to yields exactly the function . Therefore . Theorem 7 in Daniely and Shalev-Shwartz (2014) shows that . Since , then . Combining this bound on the Natarajan dimension with Theorem 2 concludes the proof.  \@endparenv

\@trivlist

Consider the smallest cardinality -covering of the feasible region by Euclidean balls of radius . From Example 27.1 in Shalev-Shwartz and Ben-David (2014), the number of balls needed is at most . Let the set denote the centers of the balls from the smallest cardinality covering. Then it immediately follows that

 |~S|≤(2ρ2(S)√dϵ)d. (8)

Finally, let the function be the function that takes a feasible solution in and maps it to the closest point in in .

We can bound the empirical Rademacher complexity by

 ^RnSPO(H) =Eσ[supf∈H1nn∑i=1σiℓSPO(f(xi),ci)] =Eσ[supf∈H1nn∑i=1σicTi(w∗(f(xi))−w∗(ci))] =Eσ[supf∈H1nn∑i=1σicTi[w∗(f(xi))−~w(w∗(f(xi)))+~w(w∗(f(xi)))−w∗(ci)]] ≤Eσ[supf∈H1nn∑i=1σicTi[w∗(f(xi))−~w(w∗(f(xi)))]]+Eσ[supf∈H1nn∑i=1σicTi[~w(w∗(f(xi)))−w∗(ci)]] ≤2ϵρ2(C)+Eσ[supf∈H1nn∑i=1σicTi[~w(w∗(f(xi)))−w∗(ci)]] ≤2ϵρ2(C)+(ωS(C)+2ϵρ2(C))√2dN(~w(w∗(H)))log(n|~S|2)n (9)

The first inequality follows from the triangle inequality. The second inequality follows from the fact that and are at most away by the definition of . In the worst case, the difference is in the direction of , and