Smart “Predict, then Optimize”
Adam N. Elmachtoub
Department of Industrial Engineering and Operations Research, Columbia University and Data Science Institute, New York, NY 10037, firstname.lastname@example.org
Department of Industrial Engineering and Operations Research, University of California, Berkeley, CA 94720, email@example.com
Many real-world analytics problems involve two significant challenges: prediction and optimization. Due to the typically complex nature of each challenge, the standard paradigm is to predict, then optimize. By and large, machine learning tools are intended to minimize prediction error and do not account for how the predictions will be used in a downstream optimization problem. In contrast, we propose a new and very general framework, called Smart “Predict, then Optimize” (SPO), which directly leverages the optimization problem structure, i.e., its objective and constraints, for designing successful analytics tools. A key component of our framework is the SPO loss function, which measures the quality of a prediction by comparing the objective values of the solutions generated using the predicted and observed parameters, respectively. Training a model with respect to the SPO loss is computationally challenging, and therefore we also develop a surrogate loss function, called the SPO+ loss, which upper bounds the SPO loss, has desirable convexity properties, and is statistically consistent under mild conditions. We also propose a stochastic gradient descent algorithm which allows for situations in which the number of training samples is large, model regularization is desired, and/or the optimization problem of interest is nonlinear or integer. Finally, we perform computational experiments to empirically verify the success of our SPO framework in comparison to the standard predict-then-optimize approach.
Key words: prescriptive analytics; data-driven optimization; machine learning; linear regression
In real-world analytics applications, machine learning (ML) is used to leverage historical, contextual, and often high-dimensional data in order to handle the prediction task of predicting key uncertain input parameters. At the same time, even if one is equipped with an excellent prediction model, the optimization task may be difficult due to problem-specific constraints and objective. In the context of vehicle routing, for example, a practical approach to deal with both of these challenges is to apply the predict-then-optimize paradigm; first, a previously trained machine learning model provides predictions for the travel time on all edges of a road network, and then an optimization solver provides near-optimal routes in a reasonable time frame. We emphasize that most solution systems for real-world operations problems involve some component of both prediction and optimization (see Angalakudati et al. (2014), Chan et al. (2012), Deo et al. (2015), Gallien et al. (2015), Cohen et al. (2017), Besbes et al. (2015), Mehrotra et al. (2011), Chan et al. (2013) for recent examples). Indeed, advances in statistics and machine learning offer very powerful, general purpose tools for prediction while advances in mathematical optimization offer very powerful modeling paradigms and solvers. However, except for a few limited options, machine learning tools for parameter prediction do not effectively account for the structure of the nominal optimization problem, i.e., its constraints and objective. In contrast, we provide a new framework for designing machine learning tools that better predict input parameters of optimization problems by leveraging the nominal problem structure.
Our approach, which we call Smart “Predict, then Optimize” (SPO), fundamentally maintains the paradigm of sequentially predicting and then optimizing. The key difference is that our approach is not fully decoupled because the prediction models that we train are specifically designed to lead to good quality solutions when solving the nominal optimization problem. In particular, we design a new framework for parameter prediction that explicitly incorporates the objective and constraints of the nominal optimization problem at hand. The quality of a prediction is not measured based on prediction error such as least squares loss, i.e., the squared distance between the predicted and true values. In the SPO framework, the quality of a prediction is measured using the objective cost associated with the underlying optimization problem. That is, when training a ML model using historical feature data and parameter data , we evaluate the quality of a prediction by measuring the cost of the decision (induced by ) with respect to the true parameter value . In contrast, the least squares approach measures error by evaluating , completely ignoring the decisions induced by the predictions.
In this paper, we focus on predicting unknown parameters that appear linearly in the objective, e.g., the cost vector of a linear, convex, or integer optimization problem. We provide a new loss function for training prediction models that is at the core of our SPO framework. Since this loss function is difficult to work with, we also provide a surrogate loss function alongside an optimization procedure that is computationally fast and performs well numerically compared to standard predict-then-optimize approaches. We also motivate our results by proving convexity and consistency properties of our loss functions. We believe our SPO framework provides a clear path for designing machine learning tools whose performance are measured by the quality of the decisions they induce.
Next, we explicitly state our contributions in detail.
First, we formally define a new loss function, which we call the SPO loss, to measure the error in predicting the cost vector of a nominal optimization problem. The loss corresponds to the loss in the objective value due to making an incorrect decision with respect to the true cost vector. Specifically, the SPO loss is the true cost of the solution generated minus the true cost of an optimal solution that has full knowledge of the true cost vector. Unfortunately, the SPO loss function can be nonconvex and discontinuous in the predictions, implying that training ML models under the SPO loss may be challenging.
Given the intractability of the SPO loss function, we provide a surrogate loss function which we call the SPO+ loss. This loss function is derived using a sequence of upper bounds motivated by duality, a data scaling approximation, and a first-order approximation of the objective cost of decisions made by SPO. The SPO+ loss function is convex in the predictions. Moreover, when training a linear model to predict the objective coefficients of a linear program, only a linear optimization problem needs be solved to minimize the SPO+ loss.
We provide a general algorithm for minimizing the SPO+ loss which is based on stochastic gradient descent. This method easily allows the number of training samples to be large, and also allows for regularization on the machine learning model. Moreover, this method allows one to handle nominal optimization problems where the constraints are convex or the decisions must be integral, as a blackbox solver for the nominal problem may be directly integrated in the algorithm.
We prove a key consistency result of the SPO+ loss function, which further motivates its use. Namely, under full distributional knowledge of the parameters, minimizing the SPO+ loss function is in fact equivalent to minimizing SPO loss if two mild conditions hold. Namely, the distribution must be continuous and symmetric around its mean, an assumption easily satisfied when the noise corresponds to a Gaussian random variable.
Finally, we validate our framework through numerical experiments on the shortest path problem and assignment problem. In each setting, we test our SPO framework against standard predict-then-optimize approaches, and evaluate the out of sample performance with respect to the SPO loss. The value of our SPO framework increases as the degree of model misspecification increases.
Settings where the input parameters of an optimization problem need to be predicted from contextual (feature) data are numerous, and we now provide more details on potential applications. As a first example, in inventory planning problems such as the economic lot sizing problem or the joint replenishment problem, the demand is the key input into the optimization model. In practical settings, demand is highly nonstationary and can depend on historical and contextual data such as weather, seasonality, and competitor sales. The decisions of when to order inventory are captured by a linear or integer optimization model, depending on the complexity of the problem. Under a common formulation, the demand appears linearly in the objective, which is convenient for the SPO framework. The goal is to design a prediction model that maps feature data to demand predictions, which in turn lead to good inventory plans. Simply minimizing prediction error is a reasonable approach, but it does not leverage any problem structure. Indeed, as a basic example, in all nontrivial instances an order is always placed in the first period; thus a prediction of the demand in the first period is non-informative, which the SPO framework would capture naturally. Moreover, understanding which days have heavy demand is more critical than understanding those with low demand, since the corresponding costs are much higher when demand is higher, which the SPO framework would also capture.
Another set of natural applications arise in the area of vehicle routing, where the cost of each edge of a graph needs to be predicted before solving the nominal optimization problem. The cost of an edge typically corresponds to the expected length of time a vehicle would need to traverse the corresponding edge. For clarity, let us focus on one important example, the shortest path problem. In the shortest path problem, one is given a weighted directed graph, along with an origin node and destination node, and the goal is to find a sequence of edges from the origin to the destination at minimum possible cost. A well-known fact is that the shortest path problem can be formulated as a linear optimization problem, but there are also alternative specialized algorithms such as the famous Dijkstra’s algorithm. The data used to predict the cost of the edges may incorporate the length, speed limit, weather, season, day, and real-time data from mobile applications such as Google Maps and Waze. Simply minimizing prediction error may not suffice nor be appropriate, as understanding the cost on some edges may be more critical than others. The SPO framework would ensure that the predicted weights lead to shortest paths, and would naturally emphasize the estimation of edges that are critical to this decision.
Finally, another set of examples arises in portfolio optimization problems. Here the mean values of each potential investment need to be predicted from data, and can depend on many features which typically include historical returns, news, economic factors, social media, and others. The goal is to find a portfolio with the highest return subject to a constraint on the total risk, or variance, of the portfolio. The standard deviation of investments are typically more stable, and are not as difficult nor sensitive to predict. Our SPO framework would result in predictions that lead to high performance investments that satisfy the desired level of risk. A least squares loss approach places higher emphasis on estimating higher valued investments, even if the corresponding risk may not be ideal. In contrast, the SPO framework directly accounts for the risk of each investment when training the prediction model.
We now describe previous works that are most related to our SPO framework and problems of interest, followed by a brief discussion of related areas of research. To the best of our knowledge, Kao et al. (2009) is the only previous work that seeks to train a machine learning model that minimizes loss with respect to a nominal optimization problem. In their framework, the nominal problem is an unconstrained quadratic optimization problem, where the unknown parameters appear in the linear portion of the objective. They show that minimizing a combination of prediction error and SPO loss is optimal when training a linear model (linear regression) under a very specific generative model of the data and a specific set of structural constraints on the linear model. Their work does not extend to settings where the nominal optimization problem has constraints, which our framework does. Their work also does not address how to deal with non-uniqueness of solutions to the nominal problem (since their problem is strongly convex), which must be addressed in our setting to avoid degenerate prediction models.
In Rudin and Vahn (2014), ML models are trained to directly predict the optimal solution of a newsvendor problem from data. Tractability and statistical properties of the method are shown as well as it’s effectiveness in practice. However, it is not clear how this approach can be used when there are constraints in the nominal optimization problem since feasibility issues may arise.
The general approach in Bertsimas and Kallus (2014) considers the problem of accurately estimating an unknown optimization objective using machine learning models, specifically ML models where the predictions can be described as a weighted combination of training samples, e.g., nearest neighbors and decision trees. In their approach, they estimate the objective of an instance by applying the same weights generated by the ML model to the corresponding objective functions of those samples. This approach differs from standard predict-then-optimize only when the objective function is nonlinear in the unknown parameter. Moreover, the training of the ML models does not rely on the structure of the nominal optimization problem. In contrast, the unknown parameters of all the applications mentioned in Section id1 appear linearly in the objective and our SPO framework directly incorporates the problem structure when training the ML model.
The approach in Tulabandhula and Rudin (2013) relies on minimizing a loss function that combines the prediction error with the operational cost of the model on an unlabeled dataset. However, the operational cost is with respect to the predicted parameters, and not the true parameters. Other approaches for finding near-optimal solutions from data include operational statistics (Liyanage and Shanthikumar (2005), Chu et al. (2008)), sample average approximation (Kleywegt et al. (2002), Schütz et al. (2009), Bertsimas et al. (2014)), and robust optimization (Bertsimas and Thiele (2006), Bertsimas et al. (2013), Wang et al. (2016)). These approaches typically do not have a clear way of using feature data, nor do they directly consider how to train a machine model to predict optimization parameters. Another related stream of work is in data-driven inverse optimization, where feasible or optimal solutions to an optimization problem are observed and the objective function has to be learned (Aswani et al. (2015), Keshavarz et al. (2011), Chan et al. (2014), Bertsimas et al. (2015), Esfahani et al. (2015)). In these problems, there is typically a single unknown objective, and no previous samples of the objective are provided.
Finally, we note that our framework is related to the general setting of structured prediction (see, e.g., Osokin et al. (2017), Goh and Jaillet (2016) and the references therein). Motivated by problems in computer vision and natural language processing, structured prediction is a version of multiclass classification that is concerned with predicting structured objects, such as sequences or graphs, from feature data. Our proposal presents a new paradigm for structured prediction where the structured objects are decision variables associated with a cost objective.
We now describe the “Predict, then Optimize” framework which is central to many applications of optimization in practice. Specifically, we assume that there is a nominal optimization problem of interest with a linear objective where the decision variables and constraints are well-defined and known with certainty. However, for any instance of the problem, the objective function cost vector is not known, but can be predicted from known feature data associated with the instance. Specifically, a prediction (machine learning) model is used that maps the feature vector to a cost vector prediction. The prediction model itself is chosen from a hypothesis class in order to minimize a given notion of loss, which is a function that quantifies the error in making incorrect predictions. Since the distribution of the data is not known, historical data consisting of pairs of feature vectors and cost vectors are used to train the prediction model, i.e., the prediction model chosen is the one that minimizes the empirical loss on the training data. Our primary interests in this paper concern defining suitable loss functions for the “Predict, then Optimize” framework, examining their properties, as well as developing algorithms for training prediction models using these loss functions.
We now formally list the key ingredients of our framework:
Nominal optimization problem, which is of the form
where are the decision variables, is the problem data describing the linear objective function, and is a nonempty, compact (i.e., closed and bounded), and convex set representing the feasible region. Since is assumed to be fixed and known with certainty, every problem instance can be described by the corresponding cost vector, hence the dependence on in (id1). When solving a particular instance where is unknown, a prediction for is used instead. We assume access to a practically efficient optimization oracle, , that returns a solution of for any input cost vector. For instance, if (id1) corresponds to a linear, conic, or even a particular combinatorial or mixed-integer optimization problem (in which case can be implicitly described as a convex set), then a commercial optimization solver or a specialized algorithm suffices for .
Training data of the form , where is a feature vector representing auxiliary information associated with .
A hypothesis class of cost vector prediction models , where is interpreted as the predicted cost vector associated with feature vector .
A loss function , whereby quantifies the error in making prediction when the realized (true) cost vector is actually .
Given the loss function and the training data , the empirical risk minimization principle states that we should determine a prediction model by solving the optimization problem
Provided with the prediction model , a natural decision rule is induced for the nominal optimization problem when presented with a feature vector , namely the optimal solution with respect to the predicted cost vector, , is chosen. Example 1 contextualizes our framework in the context of a network optimization problem.
Example 1 (Network Flow)
An example of the nominal optimization problem is a minimum cost network flow problem, where the decisions are how much flow to send on each edge of the network. We assume that the underlying graph is provided to us, e.g., the road network of a city. The feasible region represents flow conservation, capacity, and required flow constraints on the underlying network. The cost vector is not known with certainty, but can be estimated from data which can include features of time, day, edge lengths, most recent observed cost, and so on. An example of the the hypothesis class is the set of linear prediction models given by . The linear model can be trained, for example, according to the mean squared error loss function, i.e., . The corresponding empirical risk minimization problem to find the best linear model then becomes
Note that one can also include regularization terms to prevent the prediction model from overfitting, which is equivalent to restricting the hypothesis class even further. The decision rule to find the optimal network flow given a feature is .
In standard applications of the “Predict, then Optimize” framework, as in Example 1, the loss function that is used is completely independent of the nominal optimization problem. In other words, the underlying structure of the optimization problem does not factor into the loss function and therefore the training of the prediction model. For example, when , this corresponds to the mean squared error. Moreover, if is a set of linear predictors, then (id1) reduces to a standard least squares linear regression problem. In contrast, our focus in Section id1 is on the construction of loss functions that intelligently measure errors in predicting cost vectors and leverage problem structure when training the prediction model.
Let be the dimension of a feature vector, be the dimension of a decision vector, and be the number of training samples. Let denote the set of optimal solutions of , and let denote a particular oracle for solving . That is, is a fixed deterministic mapping such that . Note that nothing special is assumed about the mapping , hence may be regarded as an arbitrary element of . Let denote the support function of , which is defined by . Since is compact, is finite everywhere, the maximum in the definition is attained for every , and note that for all . Recall also that is a convex function. For a given convex function , recall that is a subgradient of at if for all , and the set of subgradients of at is denoted by . For two matrices , the trace inner product is denoted by .
Herein, we introduce several loss functions that fall into the predict-then-optimize paradigm, but that are also smart in that they take the nominal optimization problem into account when measuring errors in predictions. We refer to these loss functions as Smart “Predict, then Optimize” (SPO) loss functions. As a starting point, let us consider a true SPO loss function that exactly measures the excess cost incurred when making a suboptimal decision due to an imprecise cost vector prediction. Following the PO paradigm, given a cost vector prediction , a decision is implemented based on solving . After the decision is implemented, the cost incurred is with respect to the cost vector that is actually realized. The excess cost due to the fact that may be suboptimal with respect to is then . Definition 1 formalizes this true SPO loss associated with making the prediction when the actual cost vector is , given a particular oracle for .
Given a cost vector prediction and a realized cost vector , the true SPO loss with respect to the optimization oracle is defined as:
Note that there is an unfortunate deficiency in Definition 1, which is the dependence on the particular oracle used to solve (id1). Practically speaking, this deficiency is not a major issue since we should usually expect to be a unique optimal solution, i.e., we should expect to be a singleton. (Note that if any solution from may be used by the loss function, then the loss function essentially becomes . Thus, a prediction model would then be incentivized to always predict since , and therefore the loss would be trivially 0.)
In any case, if one wishes to address the dependence on the particular oracle in Definition 1, then it is most natural to “break ties” by presuming that the implemented decision has worst-case behavior with respect to . Definition 2 is an alternative SPO loss function that does not depend on the particular choice of the optimization oracle .
Given a cost vector prediction and a realized cost vector , the (unambiguous) true SPO loss is defined as:
Note that Definition 2 presents a version of the true SPO loss that upper bounds the version from Definition 1, i.e., it holds that for all . As mentioned previously, the distinction between Definitions 1 and 2 is only relevant in degenerate cases. In the results and discussion herein, we work with the unambiguous true SPO loss given by Definition 2. Related results may often be inferred for the version of the true SPO loss given by Definition 1 by recalling that Definition 2 upper bounds Definition 1 and that the two loss functions are almost always equal except for degenerate cases where has multiple optimal solutions.
Notice that is impervious to the scaling of , in other words it holds that for all . This property is intuitive since the true loss associated with prediction should only depend on the optimal solution of , which does not depend on the scaling of . Moreover, this property is also shared by the 0-1 loss function in binary classification problems. Namely, labels can take values of and the prediction model predicts values in . If the predicted value has the same sign as the true value, the loss is 0, and otherwise the loss is 1. Therefore, the 0-1 loss function is also independent of the scale on the predictions. This similarity is not a coincidence; in fact, Proposition 1 illustrates that binary classification is a special case of the SPO framework.
The 0-1 loss function associated with binary classification is a special case of the SPO loss function.
Let and the feasible region be the interval . Here the “cost vector” corresponds to a binary class label, i.e., can take one of two possible values, or . (However, the predicted cost vector is allowed to be any real number.) Notice that, for both possible values of , it holds that . There are three cases to consider for the prediction : (i) if then and , (ii) if then and , and (iii) if then and . Thus, we have when and share the same sign, and otherwise. Therefore, is exactly the 0-1 loss function. \@endparenv
Now, given the training data, we are interested in determining a cost vector prediction model with minimal true SPO loss. Therefore, given the previous definition of the true SPO loss , the prediction model would be determined by following the empirical risk minimization principle as in (id1), which leads to the following optimization problem:
Unfortunately, the above optimization problem is difficult to solve, both in theory and in practice. Indeed, for a fixed , may not even be continuous in since (and the entire set ) may not be continuous in . Moreover, since Proposition 1 demonstrates that our framework captures binary classification, solving (3) is at least as difficult as optimizing the 0-1 loss function, which may be NP-hard in certain cases (Ben-David et al. 2003). We are therefore motivated to develop approaches for producing “reasonable” approximate solutions to (3) that (i) outperform standard PO approaches, and (ii) are applicable to large-scale problems where the number of training samples and/or the dimension of the hypothesis class may be very large.
In this section, we focus on deriving a tractable surrogate loss function that reasonably approximates . In fact, our surrogate function , which we call the SPO+ loss function, can be derived from a series of three upper bounds on . We shall first derive the SPO+ loss function, and then in Section id1 motivate why each upper bound is a reasonable approximation. Ideally, when finding the prediction model that minimizes the empirical risk using the SPO+ loss, this prediction model will also approximately minimize (3), the empirical risk using the SPO loss.
To begin the derivation of the SPO+ loss, we first observe that for any , the SPO loss can be written as
since for all . Clearly, replacing the constraint with in (4) results in an upper bound. Since this is true for all values of , then the first upper bound is
Now simply setting in the right-hand side of (5) yields an even greater upper bound:
Finally, since is a feasible solution of , an even greater upper bound is
Given a cost vector prediction and a realized cost vector , the surrogate SPO+ loss is defined as
Next, we state the following proposition which formally shows that, in addition to the SPO+ loss being an upper bound on the SPO loss, the SPO+ loss function is convex in . (This follows immediately from the convexity of the support function .) Note that while the SPO+ loss is convex in , in general it is not differentiable since is not generally differentiable. However, we can easily compute subgradients of the SPO+ loss by utilizing the oracle , namely since the chain rule implies that . We exploit this fact in developing computational approaches in Section 4.
Given a fixed realized cost vector , it holds that is a convex function of the cost vector prediction . Moreover, upper bounds , i.e.,
The convexity of this SPO+ loss function is also shared by the hinge loss function, which is a convex upper bound for the 0-1 loss function. Recall that the hinge loss given a prediction is if the true label is and if the true label is . More concisely, the hinge loss can be written as where is the true label. The hinge loss is central to the support vector machine (SVM) method, where it is used as a convex surrogate to minimize 0-1 loss. In fact, in the same setting as Proposition 1 where the SPO loss captures the 0-1 loss, Proposition 3 shows that the SPO+ loss function exactly matches the hinge loss. Moreover, the hinge loss satisfies a key consistency property with respect to the 0-1 loss (Steinwart (2002)), which justifies its use in practice. In Section id1 we show a similar consistency result for the SPO+ loss with respect to the SPO loss under some mild conditions.
The hinge loss is equivalent to the SPO+ loss, under the same conditions where the 0-1 loss is equivalent to the SPO loss.
In the same setup as Proposition 1, we have that and corresponds to the true label. Note that and for . Therefore,
where the second equality follows since . Thus, in this setting, is precisely the hinge loss. \@endparenv
Applying the ERM principle as in (3) to the SPO+ loss yields the following optimization problem for selecting the prediction model:
Much of the remainder of the paper describes results concerning problem (9). In Section id1 we demonstrate the aforementioned Fisher consistency result, in Section 4 we describe several computational approaches for solving problem (9), and in Section id1 we demonstrate that (9) often offers superior practical performance over standard PO approaches. Next, we provide a theoretically motivated justification for using the SPO+ loss.
In the following, we provide intuitive and theoretical justification for each of the upper bounds, (5)-(7), that were used in the derivation of the surrogate SPO+ loss function. Our reasoning is confirmed by the computational results in Section id1, which evaluate the performance of the SPO+ loss function in various problem instances.
Since the first upper bound in (5) involves optimizing over , in a sense one may view (5) as a “weak duality” bound. It turns out that this intuition can be made precise by applying Lagrangian duality within the definition of the SPO loss function, and moreover strong duality actually holds in this case. Thus there is no loss in approximation from this step. Proposition 4 formalizes this result.
For any cost vector prediction and realized cost vector , the true SPO loss function may be expressed as
We will actually prove a stronger result, namely that
where notice that the in (11) is over as opposed to in (10). Then, given (5), it is clear that (11) also implies that (10) holds. The proof of (11) employs Lagrangian duality (see, e.g., Bertsekas (1999) and the references therein). First, note that the set of optimal solutions with respect to , may be expressed as . Therefore, it holds that:
Let us introduce a scalar Lagrange multiplier associated with the inequality constraint “” on the right side of (12) and then form the Lagrangian:
The dual function is defined in the standard way and satisfies:
Weak duality then implies that and hence:
To prove (11), we demonstrate that strong duality holds by applying Theorem 4.3.8 of Borwein and Lewis (2010). In our setting, the primal problem is the problem on the right-hand side of (12). This problem corresponds to the primal minimization problem in Borwein and Lewis (2010) by considering the objective function given by and the constraint function . (Note that is the convex indicator function equal to when and otherwise.) Since is a compact and convex set, we satisfy all of the assumptions of Theorem 4.3.8 of Borwein and Lewis (2010) and hence strong duality holds. \@endparenv
The justification for the upper bound in (6) can be seen by first reformulating the true SPO loss ERM problem (3) using the alternate characterization of the SPO loss from Proposition 4. The SPO loss ERM can thus be formulated as
We note that the scalar is specific to observation , and serves as a multiplier of the prediction . Now consider forcing each scalar in (13) to be the same, i.e., for all . Although this assumption is an approximation (and upper bounds (13)), Proposition 5 implies that it is a reasonable one if we take the common multiplier to be large enough. In fact, Proposition 5 shows that the optimal value of each tends to , and is achieved by a large constant when is polyhedral. Thus we see that now plays the role of a parameter that uniformly controls the size of the predictions. Instead, we can assume that the size of the predictions are directly controlled by the size of by setting as a fixed value, say , which yields the upper bound in (6). Note that setting may also be interpreted as a change of variables . (We note that the choice of is not yet apparent but is somewhat analogous to including the constant in the least squares loss function. This will be made clear in Section id1.)
For any , the SPO loss function may be expressed as
Moreover, if is polyhedral then the minimum in (10) is attained, i.e, there exists such that for all the SPO loss function may be expressed as
It suffices to show that the function is monotone decreasing on , from which (14) follows from the basic monotone convergence theorem. Clearly is a convex function and moreover a subgradient of for any is given by . Since , we have that for any . Now, for any , the subgradient inequality implies that:
since and are both nonpositive. Thus, is monotone decreasing.
The representation of (15) follows from a similar proof of Proposition 10, but uses a direct application of a different strong duality result (see, for example, Proposition 5.2.1 of (Bertsekas 1999)) that exploits the assumption that is polyhedral to additionally obtain both primal and dual attainment. \@endparenv
The final step in the derivation of our convex surrogate SPO+ loss function involves approximating the concave (therefore nonconvex) function with a first-order expansion. Namely, we apply the bound , which can be viewed as a first-order approximation of based on a supergradient computed at (i.e., it holds that ). The first-order approximation of can be intuitively justified since one might expect , the optimal decision for the true cost vector, to be a near-optimal solution to , the nominal problem under the predicted cost vector. If the prediction is reasonably close to the true cost vector, then this upper bound is indeed reasonable. In fact, Section id1 provides a consistency property suggesting that the predictions are indeed reasonably close to the true value if the prediction model is trained on a sufficiently large dataset.
In this section, we now assume full knowledge of the true underlying distribution of , and prove a key consistency condition to describe when minimizing the SPO+ loss is equivalent to minimizing the SPO loss. This result is analagous to the well-known consistency results of the hinge loss and logistic loss functions with respect to the 0-1 loss– minimizing hinge and logistic loss under full knowledge also minimizes the 0-1 loss– and provides theoretical motivation for their success in practice.
We let denote the distribution of , i.e., , and consider the population version of the true SPO risk minimization problem:
and the population version of the SPO+ risk minimization problem:
Note here that we place no restrictions on , meaning consists of any function mapping features to cost vectors. We let denote any optimal solution of (16) and let denote any optimal solution of (17). The goal of this section is to demonstrate that the SPO+ loss function is Fisher consistent with respect to the SPO loss, i.e., is also an optimal solution of the population version of the true SPO risk minimization problem (16).
Throughout this section, we consider a non-parametric setup where the dependence on the features is dropped without loss of generality. To see this, first observe that and likewise for the SPO+ risk. Since there is no constraint on (the hypothesis class consists of all prediction models), then solving problems (16) and (17) is equivalent to optimizing each individually for all . Therefore, for the remainder of the section unless otherwise noted, we drop the dependence on . Thus, we now assume that the distribution is only over , and the SPO and SPO+ risk is defined as and , respectively. Moreover, and can now each be described as a single vector.
Next, we fully characterize the minimizers of the true SPO risk problem (16) in this non-parametric setting. For convenience, let us define (note that we are implicitly assuming that is finite). Proposition 6 demonstrates that for any minimizer of , all of its corresponding solutions with respect to the nominal problem, , are also optimal solutions for . In other words, minimizing the true SPO risk also optimizes for the expected cost in the nominal problem (since the objective function is linear). Proposition 6 also demonstrates that the converse is true – namely any cost vector prediction with a unique optimal solution that also optimizes for the expected cost is also a minimizer of the true SPO risk.
If a cost vector is a minimizer of , then . Conversely, if is a cost vector such that is a singleton and , then is a minimizer of .
Consider a cost vector that is a minimizer of . Let be an optimal solution of , i.e., , and let be chosen such that is the unique optimal solution of , i.e., . (Note that if is the unique optimal solution of then it suffices to select , otherwise we may take as a slight perturbation of ). Then it holds that:
Finally, we conclude that, for any , it holds that . Therefore, .
To prove the other direction, consider a cost vector such that is a singleton and , i.e., . Let be an arbitrary minimizer of . Then,
Finally, we conclude that since is a minimizer of and has at most the same risk, then is also a minimizer of . \@endparenv
Example 2 below demonstrates that, in order to ensure that is a minimizer of , it is not sufficient to allow to be any cost vector such that . In fact, it may not be sufficient for to be . This follows from the unambiguity of the SPO loss function, which chooses a worst-case optimal solution in the event that the prediction allows for more than one optimal solution.
Suppose that , , and is normally distributed with mean and variance 1. Then and for all . Clearly though, . Moreover, strictly dominates in the sense that for all and for . Therefore, and hence is not a minimizer of .
We now examine Fisher consistency of the SPO+ loss function, which implies that minimizing the SPO+ risk (17) also minimize the SPO risk (16). Recall that the expected risk for the SPO+ loss is defined as . It turns out that the SPO+ loss function is not always consistent with respect to the SPO loss, i.e., it is possible that a minimizer of may be strictly suboptimal for the problem of minimizing the true risk . Assumption 1 presents a set of natural sufficient conditions on the distribution of that lead to Fisher consistency of the SPO+ estimator.
The following are conditions that imply Fisher consistency of the SPO+ loss function.
The distribution of is continuous on all of .
The distribution of is centrally symmetric about its mean .
The mean has a unique optimal solution, i.e., is a singleton.
The feasible region is not a singleton.
More formally, “continuous on all of ” means that has a probability density function that is strictly greater than 0 at every point in and “centrally symmetric about its mean” means that is equal in distribution to . The Gaussian distribution with any mean and covariance matrix is an example which satisfies Assumption 1. (Note that this is a typical assumption when the true model is assumed to have the form , where is a Gaussian random variable.) Requiring to have a unique optimizer with respect to the nominal problem is a minimal assumption, as the measure of the space of vectors with multiple optimizers is minimal. Finally, the case where is a singleton is actually trivial since every function is a minimizer of the true SPO risk.
Under these conditions, Theorem 1 shows that is the unique minimizer of , which, by Proposition 6, implies that minimizing also minimizes . Thus, under Assumption 1, the SPO+ loss function is Fisher consistent with respect to the SPO loss function.
Suppose that Assumption 1 holds. Then there is a unique global minimizer of . Moreover, and thus also minimizes .
The proof works in two steps. First, we show that is an optimal solution of by considering the optimality conditions of this problem. Second, we directly show that is the unique such minimizer.
Step 1: For technical reasons, let us first verify that is finite valued for all . Since is compact, there exists a ball such that . Therefore, for any fixed , it holds that:
Therefore, since is finite and hence is finite, the above inequality implies that is finite. Moreover, by Proposition 2, it is clear that is convex on . In particular, for any point the subdifferential is nonempty and, since is finite, we have that (see Strassen (1965)). By linearity of expectation, note that , where the second equality follows since the distribution of is continuous on all of which implies that is a singleton with probability 1 (see, e.g., the introductory discussion in Drusvyatskiy and Lewis (2011)).
Now, the optimality conditions for the convex problem state that is a global minimizer if and only if . By the discussion in the previous paragraph, the optimality conditions may be equivalently written as . Finally, since is centrally symmetric around its mean, we have that is equal in distribution to ; hence . Therefore satisfies and is an optimal solution of .
Step 2: Consider a vector , and let us rewrite the difference as follows: