A Distributionally Robust Boosting Algorithm
Abstract.
Distributionally Robust Optimization (DRO) has been shown to provide a flexible framework for decision making under uncertainty and statistical estimation. For example, recent works in DRO have shown that popular statistical estimators can be interpreted as the solutions of suitable formulated datadriven DRO problems. In turn, this connection is used to optimally select tuning parameters in terms of a principled approach informed by robustness considerations. This paper contributes to this growing literature, connecting DRO and statistics, by showing how boosting algorithms can be studied via DRO. We propose a boosting type algorithm, named DROBoosting, as a procedure to solve our DRO formulation. Our DROBoosting algorithm recovers Adaptive Boosting (AdaBoost) in particular, thus showing that AdaBoost is effectively solving a DRO problem. We apply our algorithm to a financial dataset on credit card default payment prediction. We find that our approach compares favorably to alternative boosting methods which are widely used in practice.
July 2, 2019
1. Introduction
Distributional robustness in decision making under uncertainty is an important topic with a long and successful history in the Operations Research literature [2, 22, 3, 13]. In recent years, this topic has been further fueled by distributionally robust optimization (DRO) formulations applied to machine learning and statistical analysis. These formulations have been shown to produce powerful insights in terms of interpretability, parameter tuning, implementation of computational procedures and the ability to enforce performance guarantees.
For instance, the work of [28] studied connections between the statistical theory of Empirical Likelihood and distributionally robust decision making formulations. These connections have been helpful in order to obtain statistical guarantees in a large and important class of datadriven DRO problems.
The work of [15] showed how DRO can be used to establish a connection to regularized logistic regression, therefore explaining the role of regularization in the context of improving outofsample performance and datadriven decisionmaking.
The works of [6, 4] further show that squareroot Lasso, support vector machines, and other estimators such as groupLasso can be recovered exactly from DRO formulations. In turn, in [6], it is shown that exploiting the DRO formulation leads to an optimality criterion for choosing the associated regularization parameter in regularized estimators which is both by robustness and statistical principles. Importantly, such criterion is closely aligned with a welldeveloped theory of highdimensional statistics, but the criterion can be used even in the context of nonlinear decision making problems.
The work of [12] shows that variance regularization estimation can be cast in terms of a DRO formulation which renders the implementation of such estimator both practical and amenable to rigorous statistical and computational guarantees.
The list of recent papers that exploit DRO for the design of statistical estimators, revisiting classical ideas to improve parameter tuning, interpretation or implementation task is rapidly growing (see in [6, 4, 5, 8, 7, 14, 37, 21, 12]).
This paper contributes to these rapidlyexpanding research activities by showing that DRO can naturally be used in the context of boosting statistical procedures. In particular, we are able to show that Adaptive Boosting (AdaBoost) can be viewed as a particular example of DROBoosting with a suitable loss function and size of uncertainty, see Corollary 2.
These connections, as indicated earlier, are useful to enhance interpretability. Further, as we explain, DRO provides a systematic and disciplined approach for designing boosting methods. Also, the DRO formulation can be naturally used to fit tuning parameters (such as the strength used to weight one model versus another in the face of statistical evidence) in a statistically principled way.
Finally, we provide easytoimplement algorithms which can be used to apply our boosting methods, which we refer to as DROBoosting procedures.
As an application of our proposed family of procedures, we consider a classification problem in the context of a credit card default payment prediction problem. We find that our approach compares favorably to alternative boosting methods which are widely used in practice.
The rest of this paper is organized as follows. In Section 2 we provide a general discussion about boosting algorithms. This section also introduces various notations which are useful to formulate our problem. Section 3 contains a precise description of the algorithm and the results validating the correctness of the algorithms. We discuss in this section also a connection to AdaBoost. Section 4 contains statistical analysis for the optimal selection of tuning parameters (such as the weight assigned to different models in the presence of available evidence). In Section 5, we show the result of our algorithms in the context of an application to credit card default payments. In Section 6, we summarize the proofs of technical results.
2. General Background and Problem Formulation
We first start by describing a wide class of boosting algorithms. For concreteness, the reader may focus on a standard supervised classification framework. The ultimate goal is to predict the label associated with a predictive variable . We then discuss our DRO problem formulation.
2.1. General Notions for Boosting
Suppose that at our disposal we have a set of different classifiers. Typically, will contain finitely many elements, but the DROBoosting algorithm can be implemented, in principle, in the case of an infinite dimensional , assuming a suitable Hilbertspace structure, as we shall discuss once we present our algorithms. We will keep this in mind, but to simplify the exposition, let us assume that contains finitely many elements, say . However, we will point out the ways in which the infinite dimensional case can be considered.
The different elements in are called classifiers, also known as predicting functions or “learners”. (Throughout our discussion we shall use learner to refer to a given predicting function or a classifier.) For simplicity, let us assume that the class of learners has been trained with independent data sets.
So, based on the learner , we may have a decision rule to decide on the predicted label given the observation or predictor . For example, such rule may by simply take the form (where ) or, instead, the classifier may estimate the probability that the label is based on a probabilistic model which takes as input , in the context of logistic regression, which attaches probability proportional to to the label being and to the label being .
The idea of boosting is to combine the power of the learners in to create a stronger learner. [36] showed that the idea of combining relatively weak predictors or classifiers into a strong one having desirable probably approximately correct (PAC) guarantees [41] is feasible. Due to these appealing theoretical guarantees, boosting algorithms have attracted substantial attention in the community, first by considering static learning algorithms (e.g. [36, 17] ) and, also, inspired by Littlestone and Warmuth’s reweighting algorithm, by considering adaptive algorithms, such as AdaBoost, which will be reviewed in the sequel. AdaBoost was proposed by Freund and Schapire [19] as an ensemble learning approach to classification problems, which adaptive combine multiple weak learners.
In the context of classification problems, many boosting algorithms can be viewed as a weighted majority vote of all the weak learners (where the weights are computed relative to the information carried by each of the weak learners). However, boosting type algorithms are not only applicable to the classification problems, they can also be used to combine the prediction of simple regressors to improve the prediction power in regression of noncategorical data, see for example [11].
We will now focus our discussion on finding boosting learners of the form where is the linear span generated by the class . In particular, if ,
More generally, we may assume that contains elements in a subspace of a Hilbert space and then is the closure (under the Hilbert norm) of finitely many linear combinations of the elements in .
As a consequence of our results, we will show that a suitable DRO formulation applied to the family of boosting learners can be used to systematically produce adaptive algorithms which can be connected to wellknown procedures such as AdaBoost.
Now, let us assume that we are given a data set . The variable corresponds to the predictor and the outcome corresponds to the classification output or the label. As indicated earlier, let us assume that for concreteness. We use to denote the empirical measure associated with the data set . In particular, where is a point measure (or delta measure) centred at . We use to denote the expectation operator associated to the measure . So, in particular, for example, for any continuous and bounded , we have that
We introduce a loss function which can be used to measure the quality of our learner, namely, . Let us now discuss examples of loss functions of interest.
Given a learner , and an observation , the margin corresponding to the observation is defined as . A loss function is said to be a margintypecostfunction if there exist a function such that . In the context of regression problems, a popular loss function is the squared loss, namely, .
A natural approach to search for a boosting learner is to solve the optimization problem
(1) 
The choice of the function to define the loss function is relevant to endow the boosting procedure with robustness characteristics (see, for example, [18, 35]). While we focus on robustness as well, our approach is different. We take the loss function as a given modeling input and, as we shall explain, robustify the boosting training procedure by quantifying the impact of distributional perturbations in the distribution . Our procedure could be applied in combination of the choice of a given function which is chosen to mitigate outliers, for example, as it is the case typically in standard robustness studies in statistics (see [26]). This combination may introduce a double layer of robustification and it may be an interesting topic of future research, but we do not pursue it here.
Viewing the search of a best linear combination of functions from a family of weaker learners to minimize the training loss, as in (1), can be understood in the vein of functional gradient descent [20]. For example, the AdaBoost algorithm is an instance of the functional gradient descent algorithm with margin cost function .
The connection between functional gradient descent and AdaBoost will be further discussed in Section 3.
2.2. DROBoosting Formulation
The DRO formulation associated to (1) takes the form
(2) 
where is the distributional uncertainty set, centred around and the size of uncertainty is .
The motivation for the introduction of the inner maximization relative to (1) is to mitigate potential overfitting problems inherent to the training problems using directly empirical risk minimization (i.e. formulation (1)). The datadriven DRO approach (2) tries to address overfitting by alleviating the focus solely on the observed evidence. Rather than empirical risk minimization, our DRO procedure is finding the optimal decision that performs uniformly well around the empirical measure, where the uniform performance is evaluated within the distributional uncertainty set.
The datadriven DRO framework has been shown to be a valid procedure to improve the generalization performance for many machine learning algorithms. In the context of linear models, some of the popular machine learning algorithms with good generalization performance, for instance, regularized logistic regression, support vector machine (SVM), squareroot LASSO, group LASSO, among others, can be recovered exactly in terms of a DRO formulation such as (1) (see [5, 14, 4, 7]). In these applications, the set is defined in terms of the Wasserstein distance (see also [9, 15, 21, 42]).
Other application areas of datadriven DRO which have shown promise because of good empirical performance and theoretical guarantees include reinforcement learning, graphic modeling, deep learning, etc, have been shown to perform well empirically (see [39, 16, 40]).
Finally, we also mention formulations of the distributional uncertainty set based on moment constraints, for instance [10, 24].
Our focus on this paper is on the use of the KullbackLeibler (KL) divergence in order to describe the set . We choose to use the KL divergence because we want to establish the connection to wellknow boosting algorithms, and we wish to further add their interpretability in terms of robustness. But we emphasize, as mentioned earlier, that a wide range of DRO formulations can be applied. Formulations of problems such as (1) based on the KL and related divergence notions have been studied in the recent years, see [31, 38, 1, 23, 27, 28].
Let denote the set of all probability distributions with support on . (Any element is a random measure because is a random set, but this issue is not relevant to implement our DROBoosting algorithm, we just consider as given. For statistical guarantees, this issue is relevant, and it will be dealt with in the sequel.)
For any distribution , we define the weight vector such that for . In particular, for the empirical distribution we have . Note that the set is isomorphic to the standard simplex
For any distribution , the KL divergence between and , denoted by (also known as the Relative Entropy or the Empirical Likelihood Ratio (ELR)) is defined as
It is a wellknown consequence of Jensen’s inequality that , and if and only if . The distributional uncertainty set that we consider is defined via
By substituting the definition of the uncertainty set into (2), the DROBoosting model based on KL uncertainty sets is well defined given the loss function, the class , and the choice of , which will be discussed momentarily.
3. Main Results
In this section, we propose an algorithm based on functional gradient descent to solve the DROBoosting (2). As we shall see, similar to the AdaBoost algorithm as in [20], fitting the DROBoosting algorithm exhibits a procedure that alternates between reweighting the worst case probability distribution and updating the predicting function . Therefore, the connection between DROBoosting and AdaBoost will be naturally established. Moreover, since choosing different types of distributional uncertainty sets, , potentially based on different notions of discrepancies between and alternative distributions results in different reweighting regimes, it can be speculated that our proposed DROBoosting framework is more flexible than AdaBoost.
For ease of notation and introduction of our functional gradient descent method, in the rest of this paper, we define two functionals. For any index , the empirical loss functional is defined as The robust loss functional is defined as
Due to the fact that any finitely supported distribution, , is characterized by its associated weights, , we can introduce the weight uncertainty set defined as
as an isomorphic counterpart of the distributional uncertainty set .
Thus, using the right hand side weight vector and the weight uncertainty set , the functional can be rewritten as
(3) 
We denote by the set of all maximizers to the above optimization problem, i.e.
Using these definitions, the DROBoosting formulation (2) admits an alternative expression, namely,
(4) 
consequently it suffices to find a best predicting function that minimizes . In order to guarantee that the optimization problem (4) has optimal solution, we impose the following assumption.
Assumption 1 (Convex Loss Functional).
The empirical loss functional is assumed to be convex and continuous. In addition, there exist , such that the level set is compact.
Next, to simplify the exposition, we impose an assumption which guarantees that the data is rich enough relative to the class . We will discuss how can one proceed if this assumption is violated.
Assumption 2 (Separation).
For any two different predicting functions , there exist at least one such that . In other words, the observed data is rich enough to distinguish or separate two different learners in .
Observation 1: There are natural situations in which the Separation Assumption may fail to hold. For example, if is a regression tree, namely, where the ’s are disjoint sets forming a partition on the domain of , then when considering we may violate the Separation Assumption may not hold. Nevertheless, we can define an equivalence relationship “” defined via if and only if for all , and then work with equivalence classes instead. Moreover, since the dimension of the quotient space is at most , we may assume that is a finite dimensional space without loss of generality. To ease the exposition, we will state the assumption is finite dimensional next.
Assumption 3 (Finite Representation).
We assume that a finite dimensional basis exists for (or has been extracted for ). So, we can write (or ) where form linearly independent functions .
If the dimension of is infinite, then will grow with and this will have consequences for the rate of convergence of the algorithm and the statistical analysis of the uncertainty set selection. But this is not important to implement and run the algorithms that we will present below.
We now are ready to summarize some properties of in Lemma 1 and thereafter develop a subgradient descent algorithm to find the optimal robust decision rule by solving (2).
Lemma 1.
If Assumption 1 is imposed, then the robust loss functional is convex. In addition, the set of optimizer is guaranteed to be nonempty.
The predicting function controls the value of solely via . Thus, to derive the functional gradient descent algorithm, one needs to construct a metric on where the structure is determined only by as well. To this end, we consider the inner product , where
Due to the Assumption 2, is a well defined inner product on space , and we denote by the induced norm of on . As the dimension of space can be assumed to be finite due to Observation 1, the space endowed with inner product is a Hilbert space. Note that the topology induced by is isomorphic to the Euclidean topology in .
In order to formalize the functional subgradient algorithm, we first introduce the definition of functional subgradient and functional subdifferential.
Definition 1 (Functional SubGradient (Gradient)).
For a convex functional and , a linear functional , (i.e. ) is called the functional subgradient of at if
The functional subdifferential of at , denoted by , is defined as
If the functional subdifferential set is a singleton set, then the functional is said to be differentiable at , in which case the only element in is denoted by , called the functional gradient of at . Intuitively, one may regard the functional gradient as the response of an infinitesimal change in the predicting function .
Proposition 1.
Corollary 1.
Using Proposition 1 and Corollary 1, we can make sense of the functional subdifferential or even the functional gradient . However, in order to ensure the trajectory of functional gradient descent lies in the space , one wants to find a weak linear in to approximate the functional gradient . This simply means finding a best approximation in such that
Using these observation, the functional subgradient descent algorithm for solving the DROBoosting problem (2) is given in Algorithm 1.
Note that in Algorithm 1, we have to compute a worst case probability weight for some functional . If , we can simply pick . Otherwise, the worst case probability can be computed using the Lemma 2.
Lemma 2.
For and , define
(6) 
Suppose that there exist such that , then is strictly increasing in . Furthermore, if we set , where is the unique root of , then .
Corollary 2 shows that Algorithm 1 exactly recovers the AdaBoost algorithm proposed in [19]. The proof of Corollary 2 is elementary.
Corollary 2 (Connection to AdaBoost).
Remark 1 (Convergence Analysis).
Under the technical assumption that has Lipschitz continuous gradient, the convergence of Algorithm 1 follows from Theorem 2 of [30]. This assumption is typically violated in our setting. However, by introducing a soft maximum approximation to , as in Theorem 2 of [8], one may apply the results of [30] directly, at the expense of a small (usercontrolled) error. We tested empirically the convergence of the algorithm, successfully. The smoothing analysis will appear elsewhere.
4. Optimal Selection of the Distributional Uncertainty Size
We now explain how to choose using statistical principles and avoiding crossvalidation. The strategy is to invoke the theory of Empirical Likelihood, following the approach in [28, 5], as we shall explain. We assume that the data is i.i.d. from an underlying distribution , which is unknown. We use to denote the product measure governing the distribution of the whole sequence . Moreover, in this section we will assume, in Assumption 3, that where is fixed and independent of .
For each probability measure , there is an associated optimal predicting function as the minimizer to the risk minimization problem We assume the convexity and smoothness for the loss function as discussed previously in Section 3. Following the criterion discussed in [5], we define The set can be interpreted as a confidence region for the functional parameter and we are interested in minimizing the size of the confidence region, , while guaranteeing a desired level of coverage for the confidence region . In other words, we want to choose as the solution to the problem
(7) 
where is a desired confidence level, say 95%, which implies choosing . This problem is challenging to solve, but we will provide an asymptotic approximation to it as .
We make the following assumptions to proceed the discussion.
Assumption 4 (First Order Optimality Condition).
The optimal choice is characterized via the corresponding firstorder optimality condition, which yields
By the definition of the functional gradient in Definition 1, we know .
The derivation of the asymptotic results relies on central limitation theorem, and we assume the existence of the second moment for the functional gradient.
Assumption 5 (Square Integrability).
We assume that
In order to compute according to (7) we will provide a more convenient representation for the set . For any , we can define a set of probability measures for which is an optimal boosting learner, that is, if . Consequently,
Define the smallest KullbackLeibler discrepancy between and any element of via
(8) 
It is immediate to see that if and only if there exists an element in such that . Consequently, we have
(9) 
The asymptotic analysis of the object , known as the Empirical Profile Likelihood (EPL) has been studied extensively. From (9) we have that the solution to (7) is precisely the quantile of . Therefore, to approximate , it suffices to estimate the asymptotic distribution of .
We apply the techniques in [28] to derive the asymptotic distribution of as , and pick the quantile of the asymptotic distribution as our uncertainty size.
Figure 1 gives an illustration of the equivalence in (9). The emphasized line corresponds to the KullbackLeibler divergence between the empirical distribution and the manifold .
The asymptotic results for (8) is similar as introduced in the literature for the empirical likelihood theorem as in [32, 34, 33]. There show that the asymptotic distribution is chisquare and the degree of freedom depends on the estimating equation. We state the results in Theorem 1 below.
Theorem 1 (Generalized Empirical Likelihood Theorem).
We assume that our loss function is smooth, the functional gradient is well defined, and Assumption 4 and Assumption 5 hold. Then we have the EPL function defined in (8) has asymptotic chisquare distribution, i.e.
(10) 
where is the chisquare distribution with degree of freedom equal and stands for weak convergence.
5. Numerical Experiments
In this section, we apply our DROBoosting algorithm to predict default in credit card payment. We take the creditcard default payment data of Taiwan from UCI machine learning database [29]. The data has 23 predictors and the response is binary stands for nondefault and default. There are 30000 observations in the data set with 6636 defaults and 23364 nondefaults.
We take the binary classification tree as the weak learner as the basis function, and we consider exponential loss function for model fitting. We compare our model with the AdaBoost algorithm, which is the stateofart boosting algorithm for practical consideration.
Every time, we randomly split the data into a training set, with 3000 observations, and a testing set, with the rest 27000 data points. We train the model on the training set as we illustrated in Algorithm 1. We consider the basis function as 5layer classification tree, and we assume the dimension (or effective degree of freedom) of the functional space is roughly 30. To pick the uncertainty set, we apply the method introduced in Section 4, where we pick the level of uncertainty to be . We consider a more complex basis model, a 5layer classification tree, as the weak learner, which is mainly due to we try to explore a case where the AdaBoost model is more likely to overfit the data.
We report the accuracy, true positive rate, false negative rate, and the exponential loss in Table 1, where we can observe superior performance for the DROBoosting model on the testing set.

Training Set  Testing Set  

Algorithm  AdaBoost  DROBoosting  AdaBoost  DROBoosting  






Average Exponential Loss 
We can observe from our numerical experiment that the DRO formalization helps improve the performance on the testing set. The worstcase expected loss function tires to mimic the testing error and avoid overemphasis on the training set. Thus we observe higher training error using DRO, while the advantage is reflected in the testing error.
6. Proofs of the Technical Results
The proofs are presented in the order as they appear in the paper.
Proof of Lemma 1.
For the first argument, each is a convex functional due to Assumption 1, so is their convex combination . After taking maximization over , the functional is still convex. For the second argument, the objective function on the right hand side of (3) is continuous in and the feasible region is compact, so the optimizer set is guaranteed to be nonempty. ∎
Proof of Proposition 1.
According to [25] Chapter D, Theorem 4.1.1, we have
for all . In addition, note that each is convex and upper semicontinuous according to Assumption 1, so , as a convex combination of , is also convex and upper semicontinuous. Furthermore, the set is a compact set. Consequently, using [25], Chapter D, Theorem 4.4.2, the desired result (5) is proved. ∎
Proof of Lemma 2.
The strict monotonicity of the fucntion can be shown by taking derivative and applying Cauchy Schwartz inequality. Using the fact that is the root of , it follows that . The optimality of is proved by verifying the KarushâKuhnâTucker conditions of the convex optimization problem . ∎
Proof of Theorem 1.
The EPL function (8) is a convex optimization with constraint. We can write the optimization problem in the Lagrange form as where is the Lagrange multiplier. The optimization problem could be solved by its first order optimality condition, and it gives and We denote , , and . Then we apply Lemma 11.1 and Lemma 11.2 in [33], we have
(11)  
(12) 
Therefore, for the ELP function, we have
The first two equations are by definition, the equation four, five and six are applying (11) and (12), while the final equation is the central limitation theorem for the estimating equation. ∎
References
 Bayraksan and Love [2015] Bayraksan, G. and Love, D. K. (2015). Datadriven stochastic programming using phidivergences. In The Operations Research Revolution, pages 1–19. INFORMS.
 BenTal and Nemirovski [2000] BenTal, A. and Nemirovski, A. (2000). Robust solutions of linear programming problems contaminated with uncertain data. Mathematical programming, 88(3):411–424.
 Bertsimas and Sim [2004] Bertsimas, D. and Sim, M. (2004). The price of robustness. Operations research, 52(1):35–53.
 Blanchet and Kang [2017a] Blanchet, J. and Kang, Y. (2017a). Distributionally robust groupwise regularization estimator. Manuscript.
 Blanchet and Kang [2017b] Blanchet, J. and Kang, Y. (2017b). Distributionally robust semisupervised learning. arXiv preprint arXiv:1702.08848.
 Blanchet et al. [2016] Blanchet, J., Kang, Y., and Murthy, K. (2016). Robust wasserstein profile inference and applications to machine learning. arXiv preprint arXiv:1610.05627.
 Blanchet et al. [2017a] Blanchet, J., Kang, Y., Zhang, F., He, F., and Hu, Z. (2017a). Doubly robust datadriven distributionally robust optimization. arXiv preprint arXiv:1705.07168.
 Blanchet et al. [2017b] Blanchet, J., Kang, Y., Zhang, F., and Murthy, K. (2017b). Datadriven optimal transport cost selection for distributionally robust optimizatio. arXiv preprint arXiv:1705.07152.
 Blanchet and Murthy [2016] Blanchet, J. and Murthy, K. (2016). Quantifying distributional model risk via optimal transport.
 Delage and Ye [2010] Delage, E. and Ye, Y. (2010). Distributionally robust optimization under moment uncertainty with application to datadriven problems. Operations research, 58(3):595–612.
 Drucker [1997] Drucker, H. (1997). Improving regressors using boosting techniques. In Proceedings of the Fourteenth International Conference on Machine Learning, pages 107–115. Morgan Kaufmann Publishers Inc.
 Duchi et al. [2016] Duchi, J., Glynn, P., and Namkoong, H. (2016). Statistics of robust optimization: A generalized empirical likelihood approach. arXiv preprint arXiv:1610.03425.
 Erdoğan and Iyengar [2006] Erdoğan, E. and Iyengar, G. (2006). Ambiguous chance constrained problems and robust optimization. Mathematical Programming, 107(12):37–61.
 Esfahani and Kuhn [2015] Esfahani, P. M. and Kuhn, D. (2015). Datadriven distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations. arXiv preprint arXiv:1505.05116.
 Esfahani and Kuhn [2018] Esfahani, P. M. and Kuhn, D. (2018). Datadriven distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming, 171(12):115–166.
 Fathony et al. [2018] Fathony, R., Rezaei, A., Bashiri, M. A., Zhang, X., and Ziebart, B. (2018). Distributionally robust graphical models. In Advances in Neural Information Processing Systems, pages 8354–8365.
 Freund [1995] Freund, Y. (1995). Boosting a weak learning algorithm by majority. Information and computation, 121(2):256–285.
 Freund [2009] Freund, Y. (2009). A more robust boosting algorithm. arXiv preprint arXiv:0905.2138.
 Freund and Schapire [1997] Freund, Y. and Schapire, R. E. (1997). A decisiontheoretic generalization of online learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139.
 Friedman [2001] Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232.
 Gao and Kleywegt [2016] Gao, R. and Kleywegt, A. J. (2016). Distributionally robust stochastic optimization with wasserstein distance. arXiv preprint arXiv:1604.02199.
 Ghaoui et al. [2003] Ghaoui, L. E., Oks, M., and Oustry, F. (2003). Worstcase valueatrisk and robust portfolio optimization: A conic programming approach. Operations research, 51(4):543–556.
 Ghosh and Lam [2019] Ghosh, S. and Lam, H. (2019). Robust analysis in stochastic simulation: Computation and performance guarantees. Operations Research.
 Goh and Sim [2010] Goh, J. and Sim, M. (2010). Distributionally robust optimization and its tractable approximations. Operations research, 58(4part1):902–917.
 HiriartUrruty and Lemaréchal [2012] HiriartUrruty, J.B. and Lemaréchal, C. (2012). Fundamentals of convex analysis. Springer Science & Business Media.
 Huber [1971] Huber, P. J. (1971). Robust statistics. Technical report, PRINCETON UNIV NJ.
 Lam [2016] Lam, H. (2016). Recovering best statistical guarantees via the empirical divergencebased distributionally robust optimization. arXiv preprint arXiv:1605.09349.
 Lam and Zhou [2017] Lam, H. and Zhou, E. (2017). The empirical likelihood approach to quantifying uncertainty in sample average approximation. Operations Research Letters, 45(4):301–307.
 Lichman [2013] Lichman, M. (2013). UCI machine learning repository.
 Mason et al. [2000] Mason, L., Baxter, J., Bartlett, P. L., and Frean, M. R. (2000). Boosting algorithms as gradient descent. In Advances in neural information processing systems, pages 512–518.
 Namkoong and Duchi [2016] Namkoong, H. and Duchi, J. C. (2016). Stochastic gradient methods for distributionally robust optimization with fdivergences. In Advances in Neural Information Processing Systems, pages 2208–2216.
 Owen [1988] Owen, A. B. (1988). Empirical likelihood ratio confidence intervals for a single functional. Biometrika, 75(2):237–249.
 Owen [2001] Owen, A. B. (2001). Empirical likelihood. Chapman and Hall/CRC.
 Qin and Lawless [1994] Qin, J. and Lawless, J. (1994). Empirical likelihood and general estimating equations. The Annals of Statistics, pages 300–325.
 Rosset [2005] Rosset, S. (2005). Robust boosting and its relation to bagging. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 249–255. ACM.
 Schapire [1990] Schapire, R. E. (1990). The strength of weak learnability. Machine learning, 5(2):197–227.
 ShafieezadehAbadeh et al. [2015] ShafieezadehAbadeh, S., Esfahani, P. M., and Kuhn, D. (2015). Distributionally robust logistic regression. In Advances in Neural Information Processing Systems, pages 1576–1584.
 Shapiro [2017] Shapiro, A. (2017). Distributionally robust stochastic programming. SIAM Journal on Optimization, 27(4):2258–2275.
 Sinha et al. [2017] Sinha, A., Namkoong, H., and Duchi, J. (2017). Certifying some distributional robustness with principled adversarial training. arXiv preprint arXiv:1710.10571.
 Smirnova et al. [2019] Smirnova, E., Dohmatob, E., and Mary, J. (2019). Distributionally robust reinforcement learning. arXiv preprint arXiv:1902.08708.
 Valiant [1984] Valiant, L. G. (1984). A theory of the learnable. In Proceedings of the sixteenth annual ACM symposium on Theory of computing, pages 436–445. ACM.
 Zhao and Guan [2018] Zhao, C. and Guan, Y. (2018). Datadriven riskaverse stochastic optimization with wasserstein metric. Operations Research Letters, 46(2):262–267.