Optimization with Gradient-Boosted Trees and Risk Control

Optimization with Gradient-Boosted Trees and Risk Control


Decision trees effectively represent the sparse, high dimensional and noisy nature of chemical data from experiments. Having learned a function from this data, we may want to thereafter optimize the function, e.g., picking the best chemical process catalyst. In this way, we may repurpose legacy predictive models. This work studies a large-scale, industrially-relevant mixed-integer quadratic optimization problem involving: (i) gradient-boosted pre-trained regression trees modeling catalyst behavior, (ii) penalty functions mitigating risk, and (iii) penalties enforcing composition constraints. We develop heuristic methods and an exact, branch-and-bound algorithm leveraging structural properties of gradient-boosted trees and penalty functions. We numerically test our methods on an industrial instance.


1 Introduction

Machine learning traditionally focusses on model predictivity and the computational expense of training. Optimization in the machine learning literature usually refers to the training procedure, e.g., model accuracy maximization \citepsra2012,NIPS2012_4522. Here, we investigate optimization problems where the pretrained model prediction appears in the objective function. Modeling of physical and chemical processes is a major industrial example, where not only we require tools to predict the outcome of certain operating conditions, but also guidance towards a better set of operating conditions with respect to some performance criteria.

One option is using smooth and continuous machine learning models, e.g., neural networks and support vector machines with radial basis function kernels, to represent these physical and chemical processes. The resulting optimization models may be addressed using local nonlinear optimization methods \citepNocedal2006. But feasible solutions come without a quantifiable guarantee of globality, even when using multi-start algorithms to escape local minima. The value of global optimization is known in the chemical processing literature \citepBOUKOUVALA2016701, e.g., local minima can lead to infeasible parameter estimation \citepdoi:10.1021/jp0548873 or misinterpreted chemical data \citepBOLLAS20091768. For applications where global optimization is less relevant, we still wish to develop optimization methods for discrete and non-smooth machine learning models, e.g., regression trees. Discrete optimization methods allow repurposing a legacy model, originally built for prediction, into an optimization framework.

Also, although many machine learning techniques deal with feature correlation in training data, careful formulation of the subsequent optimization problem is essential to avoid implicit and unintentional feature causal roles. Consider, for instance, using historical data from a manufacturing process for quality maximization. The data may exhibit correlation between two process parameters: the temperature and the concentration of a chemical additive. A machine learning model of the system assigns weights to the parameters for future predictions. Lacking additional information, numerical optimization may produce candidate solutions with temperature and concentration combinations that are drastically different from past observations. For instance, the machine learning model may incorrectly associate temperature as responsible for an observed effect. A remedy to control the optimizer’s adventurousness is to include a term penalizing deviation from the training data subspace. Large values of this risk control parameter generates conservative solutions. Smaller values of the penalty term explores regions with greater possible rewards at the cost of additional uncertainty.

This work elaborates on optimization methods for problems whose objective includes gradient-boosted tree (GBT) model predictions \citep10.2307/2699986,elements and risk terms to penalize deviation from the covariance structure of training data. Advantages of GBTs are myriad and justify their prominence among the winners in machine learning competitions \citepChen:2016:XST:2939672.2939785,NIPS2017_6907. GBTs are robust to scale differences in the training data features, handle easily both categorical and numerical variables, and can minimize arbitrary differentiable loss functions. In addition, much work has been done to accelerate the training of GBTs using graphical processing units and distributed computing resources \citep2017arXiv170608359Z.

Our approach formulates a mixed-integer quadratic programming (MIQP) problem whose objective consists of a discrete GBT trained function and a continuous convex penalty function. Our goal is to design exact optimization methods computing either globally optimal solutions, or solutions within a quantified distance from the global optimum. We formulate the problem as an MIQP and solve a large industrial instance using commercial approaches. Motivated by weak numerical findings, we develop a novel branch-and-bound method that exploits the tree ensemble combinatorial structure and the penalty function convexity in an integrated setting. Numerical analysis substantiates the strength of our approach.

The paper proceeds as follows. \Crefsec:problem_definition formally defines the optimization problem. \Crefsec:modelling formulates the problem as an MIQP and performs worst cases analysis. \Crefsec:feasibility develops heuristics. \Crefsec:branch_and_bound presents our branch-and-bound method. \Crefsec:numerical_analysis presents a numerical analysis on a large-scale industrial instance. Finally, \crefsec:conclusion concludes.

2 Optimization Problem

We analyze an optimization problem that consists of penalty functions and gradient-boosted tree (GBT) trained functions \citep10.2307/2699986,FRIEDMAN2002367. GBTs are a subclass of boosting methods \citepFREUND1995256. Boosting methods convert a collection of weak learners into a strong learner, where a weak learner is at least better than random guessing. For GBTs, the weak learners are classification and regression trees [Breiman et al.(1984)Breiman, Friedman, Olshen, and Stone].

In this work, our analysis is restricted to GBTs that only consist of regression trees, i.e., no categorical variables. Each level of a GBT splits variable into and where is some constant defined at training. A GBT trained function is a collection of binary trees, each of which provides its own independent contribution when evaluating at . The overall output is the sum of all tree evaluations. For a given , a tree evaluation follows a root-to-leaf path by querying or and following the left or right child, respectively, in each tree level. The leaf that corresponds to contains the tree’s contribution.

Symbol Description
, Lower and upper bound of variable
Continuous variable,
For convex part:
PCA penalty parameter
identity matrix
PCA projection matrix
Vector of means
diagonal matrix of standard deviations
Indices that sum to 100%
For gradient boosted part:
Indices of GBTs
Indices of leaves for tree
Indices of split nodes for tree
Variable ’s -th breakpoint
Value of leaf
Binary variable indicating whether variable
Nonnegative variable that activates leaf
Table 1: Model sets, parameters and variables.

Assume that we want to optimize a GBT trained function. Since the GBT function approximates an unknown function, we may trust an optimal solution close to regions with many training points. To better approximate the remaining regions, we add a convex penalty term whose parameters are obtained by principal component analysis (PCA) on the training data. We consider the following optimization problem:


and is the variable vector. is the GBT trained function value at . Penalty parameter , identity matrix , projection matrix , mean , and diagonal matrix of standard deviations define the convex penalty term. Variable subset contains variables that should sum close to 100%. Parameters enforce box constraints that lower and upper bound variable . \Creftab:sets_and_params defines the model sets, parameters and variables.

The additional term, which encourages the subset of variables to sum close to 100%, is an artifact of the chemical processing application. We expect that the mass fraction of chemical compositions sums to 100 and, in the application, this term in the objective is equivalent to a constraint as there are several inert chemicals, i.e., variables that participate in the mass fraction sum but do not appear in the GBTs. But, more generally, our method can include any convex penalty term.

The problem instances that this paper addresses consist of a sum of independently trained GBT functions. However, without loss of generality, we equivalently optimize a single GBT function which is the union of all original GBTs.

3 Mixed-Integer Quadratic Formulation


eq:full_formulation consists of a continuous penalty function and a discrete GBT function. The discrete nature of the GBT function arises from the left/right decisions at the split nodes. So we consider a mixed-integer quadratic programming (MIQP) formulation. The main ingredient of the MIQP model is a mixed-integer linear programming (MILP) formulation of the GBT part which merges with the convex part via a linking constraint. The high level MIQP is:

s.t. (2b)

MILP approaches for machine learning show competitive performance for some important applications \citepbertsimas2014,doi:10.1287/opre.2015.1436,bertsimas2016,Bertsimas2017,MIYASHIRO2015721. The performance of MILP methods arises from the significant improvement of mixed-integer algorithms \citepbixby2012brief and the availability of commercial codebases.

3.1 GBT MILP Formulation

Figure 1: Gradient boosted tree trained in two dimensions. Left: gradient boosted tree. Right: domain partition.

We form the GBT MILP using the \citet2017arXiv170510883M approach. \Creffig:twod shows how a GBT partitions the domain of . Optimizing a GBT function reduces to optimizing the leaf selection, i.e., finding an optimal interval, opposed to a specific value. Aggregating over all GBT split nodes produces a vector of ordered breakpoints for each variable: . Consecutive pairs of breakpoints define a set of intervals where the GBT function is constant. Each point is either on a breakpoint or in the interior of an interval. Binary variable models whether for and . Binary variable is 1 if tree evaluates at node and 0 otherwise. Denote by the set of split nodes for tree . Moreover, let and be the sets of leaf nodes in the subtrees rooted in the left and right children of split node , respectively.

The GBT problem is formulated by \crefeq:ilp_sub. \Crefeq:ob_ilp minimizes the total value of the active leaves. \Crefeq:sosz_ilp selects exactly one leaf per tree. \Crefeq:ilp_left_select,eq:ilp_right_select enforce that a leaf is activated only if all corresponding splits occur. \Crefeq:ilp_y_order ensures that if , then . Without loss of generality, we may drop variable integrality constraint because any feasible assignment of specifies a single leaf, i.e., a single region in \creffig:twod.

s.t. (3b)

3.2 Linking Constraints


eq:link1,eq:link2 relate the continuous variables to the binary variables as follows:


for all . Note that we express the linking constraints using non-strict inequalities to avoid computational issues when optimizing with strict inequalities. Combining \crefeq:high_level,eq:ilp_sub,eq:link defines the mixed-integer quadratic programming (MIQP) formulation of \crefeq:full_formulation.

3.3 Worst Case Analysis

The difficulty in solving the optimization problem, i.e., \crefeq:full_formulation, is primarily justified by the fact that optimising a GBT trained function, i.e., \crefeq:ilp_sub, is an NP-hard problem \citep2017arXiv170510883M. In what follows, we provide theoretical justification that the number of continuous variable splits and tree depth affect the performance of an exact method based on complete enumeration. These parameters motivate the branching scheme in our branch-and-bound algorithm.

In a GBT ensemble, each continuous variable is associated with intervals (splits). Picking one interval for each sums to a total of distinct combinations. A GBT trained function evaluation selects a leaf from each tree. However, not all leaf combinations of leafs are valid evaluations. In a consistent leaf combination where one leaf enforces and another enforces , it must be the case that . Let be maximum depth of a tree in . Then, the number of leaf combinations is upper bounded by . Since the number of feasibility checks for a single combination is , an upper bound on the total number of feasibility checks is . Among others, this observation implies that the worst case performance of an exact method improves as the number of trees decreases.

4 Heuristics

We propose heuristic methods that generate good feasible solutions to our optimization problem based on two approaches: (i) mixed-integer quadratic programming (MIQP), and (ii) particle swarm optimization (PSO) \citep494215,488968. The former approach is motivated by the decomposability of GBT ensembles, while the latter one exploits trade-offs between the convex part and the GBT part in the objective function.

MIQP-based heuristic. While commercial MIQP solvers provide weak feasible solutions for large-scale \crefeq:full_formulation problem instances, they may efficiently solve moderate instances to global optimality. This observation motivates the computation of heuristic solutions by decomposing an MIQP instance into smaller MIQP sub-instances. A sub-instance is restricted to a subset of GBTs. Our MIQP-based heuristic solves sub-instances iteratively. Let be the subset of trees when the -th heuristic iteration begins. Initially, . At each iteration, subset is computed by the union of and additional trees in , i.e., . Denote by the sub-instance optimal solution for the subset of trees, e.g., is the optimal solution computed by solely optimizing the convex part.

We propose two approaches for picking the subsequent trees. Our first approach selects the trees in according to the order in which the trees are generated during the training process. The trees are constructed iteratively one by one and each new tree aims in minimizing the GBT ensemble error with respect to the training data. Therefore, a subset of early generated trees is expected to provide a better approximation of the GBT trained function compared to the one computed by a subset among the latest generated trees. In our second approach, the -th iteration selects the trees with the maximum contribution when evaluating at . These trees are expected tighten the sub-instance approximation the most.

PSO-based heuristic. PSO computes a good heuristic solution by triggering particles that collaboratively search the feasibility space. We pick the initial particle positions randomly. The search occurs in a sequence of rounds. In each round, every particle chooses its next position following the direction specified by a weighted sum of (i) the globally best found solution, (ii) the particle’s best found solution, (iii) the direction of its current trajectory, and moving by a fixed step size. Termination occurs either when all particles are close, or within a specified time limit. A key observation is that we should avoid initializing the particle positions in feasible regions strictly dominated by the convex term in \crefeq:full_formulation. Furthermore, we should ensure sufficiently large initial distances between the particles. We improve the PSO performance by selecting random points and projecting them relatively close to regions where the convex term does not strictly dominate the GBT term. The projected points are the initial particle positions.

5 Branch-and-Bound Algorithm

Branch-and-bound is an exact method used for solving mixed-integer nonlinear optimization problems. Using a divide-and-conquer principle, branch-and-bound forms a tree of subproblems to search the domain of feasible solutions. Key aspects of branch-and-bound are: (i) rigorous lower (upper) bounding methods for the minimization (maximization) subproblems, (ii) choosing the next branch and (iii) generating good feasible solutions. In the worst case, branch-and-bound enumerates all solutions, but generally it avoids complete enumeration by pruning subproblems, i.e., removing infeasible subproblems or nodes with lower bound larger than the best found feasible solution \citepMORRISON201679. This section exploits spatial branching that splits on continuous variables and is primarily used for handling nonconvexities, e.g., see \citetbelotti_kirches_leyffer_linderoth_luedtke_mahajan_2013.

5.1 Overview

The branch-and-bound algorithm spatially branches over the domain. It selects a variable , a point and splits interval into intervals and . Each of these intervals correspond to an independent subproblem and a new branch-and-bound node. At a given node, denote the reduced domain by . The spatial branching choice is a critical issue and arises from the relationship of the continuous variables and \crefsec:modelling binary variables (any feasible is an interval of ). A subproblem over immediately fixes some of the . To avoid redundant branches, all GBT splits define the branch-and-bound branching points.

The remainder of this section is structured as follows. \Crefsubsec:bb_lb discusses a lower bound for the \crefeq:full_formulation optimization problem. \Crefsubsec:preprocess generates an ordering to the GBT node splits to aid subproblem lower bounding. \Crefsubsec:strong_branch explains how the branch-and-bound algorithm leverages strong branching for relatively cheap node pruning.

5.2 Lower Bounding

Effective lower bounding is fundamental to any branch-and-bound algorithm. The MIQP consists of a convex (penalty) part and an mixed-integer linear (GBT) part, handling these parts independently forms the basis of our lower bounds.

Consider assessing over domain . \Crefeq:full_formulation restricted to results in optimal objective value , the tightest relaxation, i.e., lower bound. But finding is difficult, so we lower bound with:


eq:lb treats the two \crefeq:full_formulation objective terms as independent, i.e.,  separates the convex part from the GBT part. A mixed-integer model for \crefeq:lb consists of \crefeq:high_level,eq:ilp_sub, i.e., without the \crefeq:link linking constraints. From a computational perspective, the \crefeq:lb separation leverages the easy-to-solve convex part and the availability of commercial codes for the MILP GBT part. In general, , but if only corresponds to a single leaf for each GBT, . Tractability is an important when lower bounding. For , finding is easy, but finding is NP-hard \citep2017arXiv170510883M. With the aim of tractability, we calculate a relaxation on , at the expense of tightness.

Bound for GBTs

When evaluating the GBT functions at a given , each tree provides its own independent contribution, i.e., a single leaf. A feasible selection of leaves has to be consistent with respect to the GBT node splits, i.e., if one leaf splits on and another splits on then . Relaxing this consistency requirement derives lower bounds on . The most naïve bound that this approach achieves is , i.e., ignore all splits. Lower bounds and of the GBT part represent two extremes: the former is generally intractable and tight whereas the latter is tractable and typically weak. We improve on while aiming to retain tractability by considering a partition, , of the set of trees, i.e., where , for , and . Let be the minimum GBT objective value restricted to the subset of trees. For each , this term can be computed using \crefeq:ilp_sub formulation and we set where the term denotes the resulting lower bound using partition . We have for any partition of .

At a given node, since we are dealing with the reduced domain , we may improve on by only considering reachable leaves. Moreover, we may improve on by setting or for any that corresponds to or , respectively.

The branch-and-bound algorithm chooses an initial partition at the root node with subsets of size . Our choice of for the instance in \crefsec:numerical_analysis has been decided numerically. The important factors to consider are the tree depth, the number of continuous variable splits and their relation with the number of binary variables for a subset of size .

Local Lower Bound Improvement

Calculating the updated bound in a branch-and-bound node is computationally easy. Furthermore, at least one among two sibling nodes has the same convex bound with the parent node. Therefore, a single computation suffices to determine the convex bounds of two sibling nodes.

On the GBT side, the branch-and-bound algorithm initially calculates a global lower bound with partition . When the algorithm is at some non-root branch-and-bound node with domain , some of the variables are fixed. Since the number of valid GBT nodes decreases as the algorithm descends lower in the branch-and-bound tree, a tighter GBT bound may be computed by reducing the number of subsets in the partition, without very significant overhead. However, this reduction should occur with moderation because, while fixing binary variables simplifies the local problem, a complete recalculation may still carry too much of an expense when considering the cumulative use of time across all subproblems.

Consider a node with a known valid bound for the parent node. This bound as well as the corresponding individual lower bounds are also valid for node . In general, for two partitions and we do not know a priori which partition results in a superior GBT lower bound. However, if and are such that every is contained in some , then . Therefore, given partition for the parent node, constructing for the child node by unifying subsets of will not result in inferior lower bounds.

To improve upon at node , we first sort the subsets of in non-decreasing order with respect to the number of non-fixed variables corresponding to . Let this ordering be . Then, we iteratively take the union of consecutive pairs and calculate the associated lower bound, i.e., the first calculation is for and the second is for and so forth. The iterations terminate at a user defined time limit resulting in two sets of bounds: those that are combined and recalculated, and those that remain unchanged. Assuming that the final subset that is updated has index , the new partition of the trees at node is with GBT bound . The second sum is a result of having the time limit on updating the GBT lower bound. This time limit is necessary to maintain a balance between searching and bounding.

Node Pruning

In the branch-and-bound algorithm, each node has access to: (i) the current best found feasible objective , (ii) a lower bound on the convex penalties , and (iii) a lower bound on the GBT part . The algorithm prunes node if . This condition is valid since the left hand side tells us that all feasible solutions in have objective inferior to .

5.3 Branch Ordering

The performance of the branch-and-bound algorithm depends on its ability to prune nodes. The pruning condition depends on the quality of the best found feasible solution and the tightness of . Bound is tightly computed. A feasible solution may be improved by local search at a branch-and-bound node. We tighten using the \crefsubsec:bb_lb lower bounding method whose performance depends on how many binary variables are fixed at node . Branching decisions should be effective at fixing binary variables in order to improve computation.

Since all branches of the branch-and-bound algorithm are splits from the GBTs, the GBT lower bounding step benefits from branching choices that can fix a larger number of binary variables. We order the splits by preprocessing the GBTs. Each split node contains a split pair and has depth . Root node has . As \creffig:tree_weight_2 shows, each split node assigns to its pair a weight . A given pair may repeat in a single tree and over different trees. Hence, we sum up all its weights. Sorting the split pairs in non-increasing weight order defines the branch ordering.

The choice of favors split pairs that are higher in their respective GBTs. Branching on such a split pair should be more effective at fixing binary variables. Furthermore, since we sum the weights, the branch ordering also prioritizes pairs occuring more often. This weighting also incorporates a fairness property. Assume a GBT where the root splits on , while both children split on . Then, these split pairs are interchangable, i.e.,  could be the root and the two children. Therefore, they should have the same weight. This idea extends locally to a non-root node. Choosing a local weight of covers this case since the weights are summed up and the GBTs are binary trees. \Creffig:tree_weight_2 shows this interchangability property for an unbalanced tree, where the bottom tree gives a more balanced structure than the top tree.

Figure 2: GBT node weighting computed at preprocessing step. Pairs and are interchangeable. So, their summed weights are equal to ensure the fairness property.

5.4 Strong Branching

Branch selection is fundamental to any branch-and-bound algorithm. Strong branching selects a branch that enables pruning with low effort computations and achieves a non-negligible speed-up in the algorithm’s performance \citepMORRISON201679. Strong branching is known to have increased the size of efficiently solvable large-scale mixed-integer problems and is a major component of commercial solvers \citepKlabjan2001,Anstreicher2002,Anstreicher2003,10.1007/978-3-540-45157-0_6,doi:10.1080/10556780903087124,Kılınç2014. We use strong branching to leverage the easy-to-solve convex penalty term for pruning.

At a branch-and-bound node , branching leads in two children and . One node among and inherits the convex bound from the parent, while the other requires a new computation. We employ strong branching by investigating the first branches in the list generated according to \crefsubsec:preprocess. Without loss of generality, suppose that does not inherit . If one of the possible branches results in that satisfies the pruning condition without GBT bound improvement, then it is immediately selected as the strong branch and we proceed with node . \Creffig:strong_branch illustrates strong branching. If a strong branch cannot be found, the algorithm performs a local GBT lower bound improvement (\crefsubsubsec:local_lb_improve). If the new local GBT lower bound is not large enough to prune the current node, the algorithm branches on the first item of the branch ordering, it adds the children to the list of unexplored nodes and continues with the next iteration. The above cases ensure the selection of some branch.

Strong branching allows efficient pruning in regions where the contribution of the convex part in the objective is significant. Such a branch selection allows early node pruning in the branch-and-bound tree and avoids a large number of useless node repetitions. Furthermore, it reduces the computational overhead incurred by bound recalculation. Finally, it results in a sharper bound tightening at a local level.




Strong Branch

Figure 3: Strong branching for selecting the next spatial branch. A strong branch leads to a node that is immediately pruned, based on a convex bound computation.

6 Numerical Analysis

This section compares the \crefsec:feasibility,sec:branch_and_bound algorithms against black-box solvers. The algorithms are implemented in Python 3.5.3 using Pyomo 5.2 [Hart et al.(2011)Hart, Watson, and Woodruff, Hart et al.(2017)Hart, Laird, Watson, Woodruff, Hackebeil, Nicholson, and Siirola] for mixed-integer modeling and solver interfacing. We use CPLEX 12.7 and Gurobi 6.0.3. Experiments are run on an Ubuntu 16.04 HP EliteDesk 800 G1 TWR with 16GB RAM and an Intel Core i7-4770@3.40GHz CPU.

CPLEX 12.7 Gurobi 6.0.3
0 * -157,678 -84 -787 -168
1 * -1,109 -63 -736 -118
10 952 -773 6 -769 -81
100 1,040 -835 -3 -678 -83
1000 18,579 -846 0 -537 -78
Table 2: Best feasible (BF) solution and lower bound (LB) found by MIQP solvers CPLEX 12.7 and Gurobi 6.0.3 compared with the GenSA simulated annealing solution (SA) with one hour timeout for the industrial instance of \crefeq:full_formulation.

We test our algorithms on an industrial instance. The convex part has continuous variables, and . The GBT part contains 8800 trees where 4100 trees have max depth 16, the remaining trees have max depth 4, the total number of leaves is 93,200 and the corresponding \crefeq:ilp_sub model has 2061 binary variables. This instance was originally tackled using GenSA simulated annealing package \citepGenSA

6.1 Mixed-Integer Quadratic Pragramming

We feed the instance to MIQP solvers CPLEX 12.7, Gurobi 6.0.3 and GenSA simulated annealing package with 1 hour timeout. \Creftab:black_box_compare lists the obtained results. SA finds better heuristic solutions than the MIQP solvers. In two cases, CPLEX does not even report a feasible solution. The large optimality gap returned by the MIQP solvers implies that (i) either there exist significantly better solutions than the ones computed by SA, or (ii) the solver lower bounds are extremely weak. Determining a better optimality gap motivates the design of our branch-and-bound algorithm.


tab:black_box_compare shows that the optimization problem is difficult for state-of-the-art commercial MIQP solvers. Moreover, the problem becomes harder as parameter decreases and GBTs’ contribution to the objective becomes more significant. This finding motivates the design of optimization methods that aim in bounding efficiently the GBT trained function and exploit bounds on the convex penalty function when its contribution becomes significant. In this last case, we are equipped with stronger bounds.

6.2 Heuristics

0 -168.2 -144.0 -150.7
1 -130.7 -104.0 -111.9
10 -102.7 -89.6 -92.0
100 -84.2 -83.2 -85.3
1000 -80.2 -80.4 -76.9
Table 3: Mixed-integer quadratic programming (MIQP), particle swarm optimization (PSO) and simulated annealing (SA) results with a 1 hour timeout.
Figure 4: Mixed-integer quadratic programming (MIQP), particle swarm optimization (PSO) and simulated annealing (SA) results. Top: . Bottom: .

We compare the \crefsec:feasibility heuristics to simulated annealing (SA). For MIQP, each iteration introduces new trees. \Creftab:feas_results compares the heuristics for increasing . SA outperforms the other heuristics for smaller ’s and is competitive for larger values. Since both MIQP and PSO exploit the convex part, they both perform better for larger values of . \Creffig:feas_results shows the heuristic improvement over time for and . MIQP objective improvements are not regular and drop sharply when they improve. Particle swarm optimization (PSO) converges to an inferior solution for both ’s. Even though particles are initially projected towards low penalty regions, they suffer from bias, e.g., if a non-optimal solution remains the best found for many iterations, the other particles become biased towards it. For large ’s, the \crefsec:feasibility heuristics do not significantly improve on the SA solutions. This suggests that the \creftab:black_box_compare lower bounds may be poor motivating the need for an improved bounding approach.

6.3 Branch-and-Bound

The \crefsec:branch_and_bound branch-and-bound algorithm initializes with a feasible solution and a GBT part global lower bound. We initialize with the best feasible solutions from \crefsubsec:feas_results. For the GBT lower bound, we use the \crefsubsec:bb_lb approach. We first investigate how different subset sizes affect the calculated bound and runtime for this algorithm.

Figure 5: GBT part bounding using the partition based algorithm (\crefsubsec:bb_lb). The left axis corresponds to the lower bound and the right axis to the cumulative runtime of CPLEX 12.7/Gurobi 6.0.3 to solve the individual MILPs. The horizontal lines are the CPLEX 12.7 and Gurobi 6.0.3 lower bounds after solving the entire instance with one hour time limit.

fig:test compares the GBT bounding algorithm for various partition subset sizes to MILP solvers CPLEX 12.7 and Gurobi 6.0.3. The bounding algorithm calculates tighter bounds as the subset size increases. For larger sizes, the runtime shows exponential increase. Small subset sizes have a larger computational cost due to the non-negligible modeling overhead of solving many small MILPs. Comparing with a black-box approach, we see that a well chosen subset size achieves superior time-to-bound performance. A subset size of 140 solves in 4 minutes and improves on the Gurobi 6.0.3 bound. Choosing a subset size of 360 improves on the CPLEX 12.7 bound and takes 8 minutes.

Figure 6: Branch-and-bound lower bound improvement for (top) and (bottom) compared to Gurobi 6.0.3 given a one hour timeout. We test the branch-and-bound algorithm with (BB-BO) and without (BB-WBO) branch ordering. BF is the best known feasible solution.

Given the \creffig:test results, we initialize the branch-and-bound algorithm using a subset size of 150 for as it is in the trough of runtime plot. \Creffig:bound_evo compares the branch-and-bound algorithm with (BB-BO) and without (BB-WBO) the branch ordering to Gurobi 6.0.3 (solving the entire MIQP problem) and the best known feasible solution for and . BB-BO outperforms all of the other approaches, providing a significant decrease in the optimality gap for . Comparing with Gurobi 6.0.3, the case produces a much better optimality gap. For , all solvers have a fairly large optimality gap, this is due to the problem being closer to the NP-hard \crefeq:ilp_sub model Also strong branching is less likely to be invoked in early nodes of the branch-and-bound tree resulting more regular GBT bounds updates and therefore a large increase in cumulative bounding time. Both ’s show the importance of the branch ordering. BB-BO is able to produce better bounds earlier and for strong branching coupled with the branch ordering results in significant bounds improvement.

7 Conclusion

As machine learning methods mature and their predictive modeling power becomes well-respected industrially, decision makers want to move from solely making predictions on model inputs to deciding algorithmically what is the best model input. In other words, we must move towards optimizing pre-trained machine learning models. This paper effectively addresses a large-scale, industrially-relevant gradient-boosted tree model by directly exploiting: (i) advanced mixed-integer programming technology with strong optimization formulations, (ii) GBT tree structure with priority towards searching on commonly-occurring variable splits, and (iii) convex penalty terms with enabling fewer mixed-integer optimization updates. The particular model application is in chemical catalysis, but the general form of the optimization problem will appear whenever we wish to optimize a pre-trained gradient-boosted tree with convex penalty terms. It would have been alternatively possible to train and then optimize a smooth and continuous machine learning model, but applications with legacy code may start with a GBT. Additionally, it is well known in catalysis that the global solution to an optimization problem is often particularly useful. The methods in this paper not only generate good feasible solutions to the optimization problem, but they also converge towards proving the exact solution.


The support of: BASF SE, the EPSRC Centere for Doctoral Training in High Performance Embedded and Distributed Systems to M.M. (EP/L016796/1), and an EPSRC Research Fellowship to R.M. (EP/P016871/1) is gratefully acknowledged.


  1. Anstreicher, K., Brixius, N., Goux, J.-P., and Linderoth, J. Solving large quadratic assignment problems on computational grids. Mathematical Programming, 91(3):563–588, Feb 2002. ISSN 1436-4646.
  2. Anstreicher, K. M. Recent advances in the solution of quadratic assignment problems. Mathematical Programming, 97(1):27–42, Jul 2003. ISSN 1436-4646.
  3. Belotti, P., Lee, J., Liberti, L., Margot, F., and Wächter, A. Branching and bounds tightening techniques for non-convex MINLP. Optimization Methods and Software, 24(4-5):597–634, 2009.
  4. Belotti, P., Kirches, C., Leyffer, S., Linderoth, J., Luedtke, J., and Mahajan, A. Mixed-integer nonlinear optimization. Acta Numerica, 22:1–131, 2013.
  5. Bertsimas, D. and Dunn, J. Optimal classification trees. Machine Learning, 106(7):1039–1082, Jul 2017. ISSN 1573-0565.
  6. Bertsimas, D. and King, A. OR Forum – An algorithmic approach to linear regression. Operations Research, 64(1):2–16, 2016.
  7. Bertsimas, D. and Mazumder, R. Least quantile regression via modern optimization. The Annals of Statistics, 42(6):2494–2525, 12 2014.
  8. Bertsimas, D., King, A., and Mazumder, R. Best subset selection via a modern optimization lens. The Annals of Statistics, 44(2):813–852, 04 2016.
  9. Bixby, R. E. A brief history of linear and mixed-integer programming computation. Documenta Mathematica, pp. 107–121, 2012.
  10. Bollas, G. M., Barton, P. I., and Mitsos, A. Bilevel optimization formulation for parameter estimation in vaporâliquid(âliquid) phase equilibrium problems. Chem Eng Sci, 64(8):1768 – 1783, 2009.
  11. Boukouvala, F., Misener, R., and Floudas, C. A. Global optimization advances in mixed-integer nonlinear programming, MINLP, and constrained derivative-free optimization, CDFO. Eur J Oper Res, 252(3):701 – 727, 2016.
  12. Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. Classification and Regression Trees. Wadsworth, Inc., 1984.
  13. Chen, T. and Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794, 2016.
  14. Easton, K., Nemhauser, G., and Trick, M. Solving the travelling tournament problem: A combined integer programming and constraint programming approach. In Burke, E. and De Causmaecker, P. (eds.), Practice and Theory of Automated Timetabling IV, pp. 100–109, Berlin, Heidelberg, 2003. Springer Berlin Heidelberg. ISBN 978-3-540-45157-0.
  15. Eberhart, R. and Kennedy, J. A new optimizer using particle swarm theory. In Proceedings of the Sixth International Symposium on Micro Machine and Human Science, pp. 39–43, Oct 1995.
  16. Freund, Y. Boosting a weak learning algorithm by majority. Information and Computation, 121(2):256 – 285, 1995.
  17. Friedman, J. H. Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5):1189–1232, 2001.
  18. Friedman, J. H. Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4):367 – 378, 2002.
  19. Hart, W. E., Watson, J.-P., and Woodruff, D. L. Pyomo: modeling and solving mathematical programs in Python. Mathematical Programming Computation, 3(3):219–260, 2011.
  20. Hart, W. E., Laird, C. D., Watson, J.-P., Woodruff, D. L., Hackebeil, G. A., Nicholson, B. L., and Siirola, J. D. Pyomo–optimization modeling in Python, volume 67. Springer Science & Business Media, second edition, 2017.
  21. Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning. Springer-Verlag New York, second edition, 2009.
  22. Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.-Y. LightGBM: A highly efficient gradient boosting decision tree. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 3149–3157. Curran Associates, Inc., 2017.
  23. Kennedy, J. and Eberhart, R. Particle swarm optimization. In Proceedings of the IEEE International Conference on Neural Networks, volume 4, pp. 1942–1948 vol.4, Nov 1995.
  24. Kılınç, M., Linderoth, J., Luedtke, J., and Miller, A. Strong-branching inequalities for convex mixed integer nonlinear programs. Computational Optimization and Applications, 59(3):639–665, Dec 2014.
  25. Klabjan, D., Johnson, E. L., Nemhauser, G. L., Gelman, E., and Ramaswamy, S. Solving large airline crew scheduling problems: Random pairing generation and strong branching. Computational Optimization and Applications, 20(1):73–91, Oct 2001.
  26. Mišić, V. V. Optimization of Tree Ensembles. ArXiv e-prints, 2017. arXiv:1705.10883.
  27. Miyashiro, R. and Takano, Y. Mixed integer second-order cone programming formulations for variable selection in linear regression. European Journal of Operational Research, 247(3):721 – 731, 2015.
  28. Morrison, D. R., Jacobson, S. H., Sauppe, J. J., and Sewell, E. C. Branch-and-bound algorithms: A survey of recent advances in searching, branching, and pruning. Discrete Optimization, 19:79 – 102, 2016.
  29. Nocedal, J. and Wright, S. J. Sequential Quadratic Programming, pp. 529–562. Springer New York, 2006. ISBN 978-0-387-40065-5.
  30. Singer, A. B., Taylor, J. W., Barton, P. I., and Green, W. H. Global dynamic optimization for parameter estimation in chemical kinetics. J Phy Chem A, 110(3):971–976, 2006.
  31. Snoek, J., Larochelle, H., and Adams, R. P. Practical bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems 25, pp. 2951–2959. Curran Associates, Inc., 2012.
  32. Sra, S., Nowozin, S., and Wright, S. J. Optimization for Machine Learning. MIT Press, 2012.
  33. Xiang, Y., Gubian, S., Suomela, B., and Hoeng, J. Generalized simulated annealing for efficient global optimization: the GenSA package for R. The R Journal Volume 5/1, Jun 2013.
  34. Zhang, H., Si, S., and Hsieh, C.-J. GPU-acceleration for Large-scale Tree Boosting. ArXiv e-prints, 2017. arXiv:1706.08359.
This is a comment super asjknd jkasnjk adsnkj
The feedback cannot be empty
Comments 0
The feedback cannot be empty
Add comment

You’re adding your first comment!
How to quickly get a good reply:
  • Offer a constructive comment on the author work.
  • Add helpful links to code implementation or project page.