Dynamic Optimization with Side Information
Dimitris Bertsimas, Christopher McCord, Bradley Sturt
Operations Research Center, Massachusetts Institute of Technology,
dbertsim@mit.edu, mccord@mit.edu, bsturt@mit.edu
We present a datadriven framework for incorporating side information in dynamic optimization under uncertainty. Specifically, our approach uses predictive machine learning methods (such as nearest neighbors, kernel regression, and random forests) to weight the relative importance of various datadriven uncertainty sets in a robust optimization formulation. Through a novel measure concentration result for local machine learning methods, we prove that the proposed framework is asymptotically optimal for stochastic dynamic optimization with covariates. We also describe a generalpurpose approximation for the proposed framework, based on overlapping linear decision rules, which is computationally tractable and produces highquality solutions for dynamic problems with many stages. Across a variety of examples in shipment planning, inventory management, and finance, our method achieves improvements of up to 15% over alternatives and requires less than one minute of computation time on problems with twelve stages.
Key words: Distributionally robust optimization; machine learning; dynamic optimization.
History: This paper was first submitted in May 2019.
Dynamic decision making under uncertainty forms the foundation for numerous fundamental problems in operations research and management science. In these problems, a decision maker attempts to minimize an uncertain objective over time, as information incrementally becomes available. For example, consider a retailer with the goal of managing the inventory of a new short life cycle product. Each week, the retailer must decide an ordering quantity to replenish its inventory. Future demand for the product is unknown, but the retailer can base its ordering decisions on the remaining inventory level, which depends on the realized demands in previous weeks. A riskaverse investor faces a similar problem when constructing and adjusting a portfolio of assets in order to achieve a desirable riskreturn tradeoff over a horizon of many months. Additional examples abound in energy planning, airline routing, and ride sharing, as well as in many other areas.
To make high quality decisions in dynamic environments, the decision maker must accurately model future uncertainty. Often, practitioners have access to side information or auxiliary covariates, which can help predict that uncertainty. For a retailer, although the future demand for a newly introduced clothing item is unknown, data on the brand, style, and color of the item, as well as data on market trends and social media, can help predict it. For a riskaverse investor, while the returns of the assets in future stages are uncertain, recent asset returns and prices of relevant options can provide crucial insight into upcoming volatility. Consequently, organizations across many industries are continuing to prioritize the use of predictive analytics in order to leverage vast quantities of data to understand future uncertainty and make better operational decisions.
A recent body of work has aimed to leverage predictive analytics in decision making under uncertainty. For example, Hannah et al. (2010), Ban and Rudin (2018), Bertsimas and Kallus (2014) and Ho and Hanasusanto (2019) investigate prescriptive approaches, based on sample average approximation, that use local machine learning to assign weights to the historical data based on covariates. Bertsimas and Van Parys (2017) propose adding robustness to those weights to achieve optimal asymptotic budget guarantees. Elmachtoub and Grigas (2017) develop an approach for linear optimization problems in which a machine learning model is trained to minimize the decision cost. All of these approaches are specialized for singlestage or twostage optimization problems, and do not readily generalize to problems with many stages. For a class of dynamic inventory problems, Ban et al. (2018) propose a datadriven approach by fitting the stochastic process and covariates to a parametric regression model, which is asymptotically optimal when the model is correctly specified. Bertsimas and McCord (2019) propose a different approach based on dynamic programming that uses nonparametric machine learning methods to handle auxiliary covariates. However, these dynamic approaches require scenario tree enumeration and suffer from the curse of dimensionality. To the best of our knowledge, no previous work leverages machine learning in a computationally tractable, datadriven framework for decision making in dynamic environments with covariates.
Recently, Bertsimas et al. (2018a) developed a datadriven approach for dynamic optimization under uncertainty that they call sample robust optimization (SRO). Their SRO framework solves a robust optimization problem in which an uncertainty set is constructed around each historical sample path. They show this datadriven framework enjoys nonparametric outofsample performance guarantees for a class of dynamic linear optimization problems without covariates and show that this framework can be approximated using decision rule techniques from robust optimization.
In this paper, we present a new framework for leveraging side information in dynamic optimization. Specifically, we propose combining local machine learning methods with the sample robust optimization framework. Through a new measure concentration result, we show that the proposed sample robust optimization with covariates framework is asymptotically optimal, providing the assurance that the resulting decisions are nearly optimal in the presence of big data. We also demonstrate the tractability of the approach via an approximation algorithm based on overlapping linear decision rules. To the best of our knowledge, our method is the first nonparametric approach for tractably solving dynamic optimization problems with covariates, offering practitioners a generalpurpose tool for better decision making with predictive analytics. We summarize our main contributions as follows:

We present a generalpurpose framework for leveraging machine learning in datadriven dynamic optimization with covariates. Our approach extends the sample robust optimization framework by assigning weights to the uncertainty sets based on covariates. The weights are computed using machine learning methods such as nearest neighbor regression, kernel regression, and random forest regression.

We provide theoretical justification for the proposed framework in the big data setting. First, we develop a new measure concentration result for local machine learning methods (Theorem 2), which shows that the weighted empirical distribution produced by local predictors converges quickly to the true conditional distribution. To the best of our knowledge, such a result for local machine learning is the first of its kind. We use Theorem 2 to establish that the proposed framework is asymptotically optimal for dynamic optimization with covariates without any parametric assumptions (Theorem 1).

To find high quality solutions for problems with many stages in practical computation times, we present an approximation scheme based on overlapping linear decision rules. Specifically, we propose using separate linear decision rules for each uncertainty set to approximate the costs incurred in each stage. We show that the approximation is computationally tractable, both with respect to the number of stages and size of the historical dataset.

By using all available data, we show that our method produces decisions that achieve improved outofsample performance. Specifically, in a variety of examples (shipment planning, inventory management, and finance), across a variety of time horizons, our proposed method outperforms alternatives, in a statistically significant manner, achieving up to 15% improvement in average outofsample cost. Moreover, our algorithm is practical and scalable, requiring less than one minute on examples with up to twelve stages.
The paper is organized as follows. Section id1 introduces the problem setting and notation. Section id1 proposes the new framework for incorporating machine learning into dynamic optimization. Section id1 develops theoretical guarantees on the proposed framework. Section id1 presents the general multipolicy approximation scheme for dynamic optimization with covariates. Section id1 presents a detailed investigation and computational simulations of the proposed methodology in shipment planning, inventory management, and finance. We conclude in Section id1.
This paper follows a recent body of literature on datadriven optimization under uncertainty in operations research and management science. Much of this work has focused on the paradigm of distributionally robust optimization, in which the optimal solution is that which performs best in expectation over a worstcase probability distribution from an ambiguity set. Motivated by probabilistic guarantees, distributionally robust optimization has found particular applicability in datadriven settings in which the ambiguity set is constructed using historical data, such as Delage and Ye (2010), Xu et al. (2012), Esfahani and Kuhn (2018), Van Parys et al. (2017). In particular, the final steps in our convergence result (Section id1) draw heavily from similar techniques from Esfahani and Kuhn (2018) and Bertsimas et al. (2018a). In contrast to previous work, this paper develops a new measure concentration result for the weighted empirical distribution (Section id1) which enables machine learning and covariates to be incorporated into sample robust optimization and Wassersteinbased distributionally robust optimization for the first time.
Several recent papers have focused on tractable approximations of two and multistage distributionally and sample robust optimization. Many approaches are based around policy approximation schemes, including lifted linear decision rules (Bertsimas et al. 2018b), adaptivity (Hanasusanto et al. 2016), and finite adaptability (Bertsimas et al. 2018a). Alternative approaches include tractable approximations of copositive formulations (Hanasusanto and Kuhn 2018). Closest related to the approximation scheme in this paper are Chen et al. (2019) and Bertsimas et al. (2019), which address twostage problems via overlapping decision rules. Chen et al. (2019) propose a scenariowise modeling approach that leads to novel approximations of various distributionally robust applications, including twostage distributionally robust optimization using Wasserstein ambiguity sets and expectations of piecewise convex objective functions in singlestage problems. Independently, Bertsimas et al. (2019) investigate a multipolicy approximation of twostage sample robust optimization by optimizing a separate linear decision rule for each uncertainty set and prove that this approximation gap converges to zero as the amount of data goes to infinity. In Section id1 of this paper, we show how to extend similar techniques to dynamic problems with many stages for the first time.
As discussed previously, the methodology in this paper also follows recent work on incorporating covariates in optimization under uncertainty using local predictive methods (such as nearest neighbor regression, kernel regression, and random forests). In particular, the asymptotic optimality justification of Bertsimas and Kallus (2014) in singlestage settings relies on the strong universal consistency for local predictive models (e.g., Walk (2010)). Our proof of asymptotic optimality instead relies on convergence guarantees rooted in distributionally robust optimization. The reason we use a different approach is that the arguments for the convergence for local predictive models from Bertsimas and Kallus (2014) require finite dimensional decision variables. In contrast, the convergence guarantees in this paper apply for dynamic optimization over general spaces of policies.
We consider finitehorizon discretetime stochastic dynamic optimization problems. The uncertain quantities observed in each stage are denoted by random variables . The decisions made in each stage are denoted by . Given realizations of the uncertain quantities and decisions, we incur a cost of
A decision rule is a collection of measurable functions which specify what decision to make in stage based of the information observed up to that point. Given realizations of the uncertain quantities and choice of decision rules, the resulting cost is
Before selecting the decision rules, we observe auxiliary covariates . For example, in the aforementioned fashion setting, the auxiliary covariates may information on the brand, style, and color of a new clothing item and the remaining uncertainties representing the demand for the product in each week of the lifecycle. Given a realization of the covariates , our goal is to find decision rules which minimize the conditional expected cost:
(1) 
We refer to (1) as dynamic optimization with covariates. The optimization takes place over a collection which is any subset of the space of all nonanticipative decision rules.
In this paper, we assume that the joint distribution of the covariates and uncertain quantities is unknown, and our knowledge consists of historical data of the form
where each of these tuples consists of a realization of the auxiliary covariates and the following realization of the random variables over the stages. For example, in the aforementioned fashion setting, each tuple corresponds to the covariates of a past fashion item as well as its demand over its lifecycle. We will not assume any parametric structure on the relationship between the covariates and future uncertainty.
The goal of this paper is a generalpurpose, computationally tractable, datadriven approach for approximately solving dynamic optimization with covariates. In the following sections, we propose and analyze a new framework which leverages nonparametric machine learning, trained from historical data, to predict future uncertainty from covariates in a way that leads to nearoptimal decision rules to (1).
The joint probability distribution of the covariates and uncertain quantities is denoted by . For the purpose of proving theorems, we assume throughout this paper that the historical data are independent and identically distributed (i.i.d.) samples from this distribution . In other words, we assume that the historical data satisfies
where is the product measure. The set of all probability distributions supported on is denoted by . For each of the covariates , we assume that its conditional probability distribution satisfies , where is shorthand for . We sometimes use subscript notation for expectations to specify the underlying probability distribution; for example, the following two expressions are equivalent:
Finally, we say that the cost function resulting from a policy is upper semicontinuous if
for all .
In this section, we present our approach for incorporating machine learning in dynamic optimization. We first review sample robust optimization, and then we introduce our new sample robust optimization with covariates framework.
Consider a stochastic dynamic optimization problem of the form (1) in which there are no auxiliary covariates. The underlying joint distribution of the random variables is unknown, but we have data consisting of sample paths, . For this setting, sample robust optimization can be used to find approximate solutions in stochastic dynamic optimization. To apply the framework, one constructs an uncertainty set around each sample path in the training data and then chooses the decision rules that optimize the average of the worstcase realizations of the cost. Formally, this framework results in the following robust optimization problem:
(2) 
where is an uncertainty set around . Intuitively speaking, (2) chooses the decision rules by averaging over the historical sample paths which are adversarially perturbed. Under mild probabilistic assumptions on the underlying joint distribution and appropriately constructed uncertainty sets, Bertsimas et al. (2018a) show that sample robust optimization converges asymptotically to the underlying stochastic problem and that (2) is amenable to approximations similar to dynamic robust optimization.
We now present our new framework, based on sample robust optimization, for solving dynamic optimization with covariates. In the proposed framework, we first train a machine learning algorithm on the historical data to predict future uncertainty as a function of the covariates. From the trained learner, we obtain weight functions , for , each of which captures the relevance of the th training sample to the new covariates, . We incorporate the weights into sample robust optimization by multiplying the cost associated with each training example by the corresponding weight function. The resulting sample robust optimization with covariates framework is as follows:
(3) 
where the uncertainty sets are defined
and is some norm with .
The above framework provides the flexibility for the practitioner to construct weights from a variety of machine learning algorithms. We focus in this paper on weight functions which come from nonparametric machine learning methods. Examples of viable predictive models include nearest neighbors (kNN), kernel regression, classification and regression trees (CART), and random forests (RF). We describe these four classes of weight functions.
Definition 1
The nearest neighbor weight functions are given by:
Formally, is a nearest neighbor of if . For more technical details, we refer the reader to Biau and Devroye (2015).
Definition 2
The kernel regression weight functions are given by:
where is the kernel function and is the bandwidth parameter. Examples of kernel functions include the Gaussian kernel, , the triangular kernel, , and the Epanechnikov kernel, . For more information on kernel regression, see Friedman et al. (2001, Chapter 6).
The next two types of weight functions we present are based on classification and regression trees (Breiman et al. 1984) and random forests (Breiman 2001). We refer the reader to Bertsimas and Kallus (2014) for technical implementation details.
Definition 3
The classification and regression tree weight functions are given by:
where is the set of indices such that is contained in the same leaf of the tree as .
Definition 4
The random forest weight functions are given by:
where is the number of trees in the ensemble, and refers to the weight function of the th tree in the ensemble.
All of the above weight functions come from nonparametric machine learning methods. They are highly effective as predictive methods because they can learn complex relationships between the covariates and the response variable without requiring the practitioner to state an explicit parametric form. Similarly, as we prove in Section id1, solutions to (3) with these weight functions are asymptotically optimal for (1) without any parametric restrictions on the relationship between and . In other words, incorporating covariates into sample robust optimization via (3) leads to better decisions asymptotically, even without specific knowledge of how the covariates affect the uncertainty.
In this section, we establish asymptotic optimality guarantees for sample robust optimization with auxiliary covariates. We prove that, under mild conditions, (3) converges to (1) as the number of training samples goes to infinity. Thus, as the amount of data grows, sample robust optimization with covariates becomes an optimal approximation of the underlying stochastic dynamic optimization problem. Crucially, our convergence guarantee does not require parametric restrictions on the space of decision rules (e.g., linearity) or parametric restrictions on the joint distribution of the covariates and uncertain quantities. These theoretical results are consistent with empirical experiments in Section id1.
We begin by presenting our main result. The proof of the result depends on some technical assumptions and concepts from distributionally robust optimization. For simplicity, we defer the statement and discussion of technical assumptions regarding the underlying probability distribution and cost until Sections id1 and id1, and first discuss what is needed to apply the method in practice. The practitioner needs to select a weight function, parameters associated with that weight function, and the radius, , of the uncertainty sets. While these may be selected by cross validation, we show that the method will in general converge if the parameters are selected to satisfy the following:
Assumption 1
The weight functions and uncertainty set radius satisfy one of the following:

are nearest neighbor weight functions with for constants and , and for constants and .

are kernel regression weight functions with the Gaussian, triangular, or Epanechnikov kernel function and for constants and , and for constants and .
Given Assumption 1, our main result is the following.
Theorem 1
The theorem says that objective value of (3) converge almost surely to the optimal value of the fullinformation problem, (1), as goes to infinity. The assumptions of the theorem require that the joint distribution and the feasible decision rules are well behaved. We will discuss these technical assumptions in more detail in the following sections.
In order to prove the asymptotic optimality of sample robust optimization with covariates, we view (3) through the more general lens of Wassersteinbased distributionally robust optimization. We first review some properties of the Wasserstein metric and then prove a key intermediary result, from which our main result follows.
The Wasserstein metric provides a distance function between probability distributions. In particular, given two probability distributions , the type1 Wasserstein distance is defined as the optimal objective value of a minimization problem:
The Wasserstein metric is particularly appealing because a distribution with finite support can have a finite distance to a continuous distribution. This allows us to construct a Wasserstein ball around an empirical distribution that includes continuous distributions, which cannot be done with other popular measures such as the KullbackLeilbler divergence (Kullback and Leibler 1951). We remark that the 1Wasserstein metric satisfies the axioms of a metric, including the triangle inequality (Clement and Desch 2008):
Important to this paper, the 1Wasserstein metric admits a dual form, as shown by Kantorovich and Rubinstein (1958),
where the supremum is taken over all 1Lipschitz functions. Note that the absolute value is optional in the dual form of the metric, and the space of Lipschitz functions can be restricted to those which satisfy without loss of generality. Finally, we remark that Fournier and Guillin (2015) prove under a lighttailed assumption that the 1Wasserstein distance between the empirical distribution and its underlying distribution concentrates around zero with high probability. Theorem 2 in the following section extends this concentration result to the setting with auxiliary covariates.
Given a local predictive method, let the corresponding weighted empirical measure be defined as
where denotes the Dirac probability distribution which places point mass at . In this section, we prove under mild assumptions that the weighted empirical measure concentrations quickly to with respect to the Wasserstein metric. We introduce the following assumptions on the underlying joint probability distribution:
Assumption 2 (Conditional Subgaussianity)
There exists a parameter such that
Assumption 3 (Lipschitz Continuity)
There exists such that
Assumption 4 (Smoothness of Auxiliary Covariates)
The set is compact, and there exists such that
With these assumptions, we are ready to prove the concentration result, which is proved using a novel technique that relies on the dual form of the Wasserstein metric and a discrete approximation of the space of 1Lipschitz functions.
Theorem 2
Proof.
Proof. Without loss of generality, we assume throughout the proof that all norms refer to the norm.^{1}^{1}1To see why this is without loss of generality, consider any other norm where . In this case, By the definition of the 1Wasserstein metric, this implies where refers to the 1Wasserstein metric with the norm. If satisfies Assumption 1, also satisfies Assumption 1, so the result for all other choices of norms follows from the result with the norm. Fix any . It follows from Assumption 1 that
(4)  
(5)  
(6) 
for constants . Moreover, Assumption 1 also implies that there exists constants and such that
(7)  
(8) 
The proof of the the above statements under Assumption 1 is found in Appendix id1. Now, choose any fixed , and let
Finally, we define the following intermediary probability distributions:
where is shorthand for .
Applying the triangle inequality for the Wasserstein metric and the union bound,
We now proceed to bound each of the above terms.
By the dual form of the Wasserstein metric,
where the supremum is taken over all 1Lipschitz functions. By (5) and Jensen’s inequality, we can upper bound this by
where the final inequality follows from Assumption 3. Therefore, it follows from (7) that
(9) 
Consider any Lipschitz function for which , and let satisfy (which is finite because of Assumption 4). Then, for all , and all ,
The first inequality follows because for all and otherwise. For the second inequality, we used the Gaussian tail inequality for (Vershynin 2018) along with Assumption 2. Because this bound holds uniformly over all , and all , it follows that
for all . It is easy to see that the right hand side above divided by goes to 0 as goes to infinity, so
By the law of total probability,
We now show that each of the above terms have finite summations. First,
The first inequality follows from the union bound, the second inequality follows from Assumption 2, and the final inequality follows because and the definition of .
Second, for each , we define several quantities. Let be the partitioning of into translations of . Let be the set of piecewise constant functions which are constant on each region of the partition , taking values on . Note that . Then, we observe that for all Lipschitz functions which satisfy , there exists a such that
Indeed, within each region of the partition, can vary by no more than . The possible function values for are separated by . Because is bounded by , this implies the existence of such that has a value within of everywhere within that region. The identical reasoning holds for all other regions of the partition.
Therefore, for every ,
where the final inequality follows from the union bound. We choose , in which case
Furthermore, for all sufficiently large ,
Applying Hoeffding’s inequality, and noting is bounded by when , we have the following for all :
for sufficiently large that and . Note that (8) was used for the final inequality. Combining these results, we have