Parameterizing BranchandBound Search Trees to Learn Branching Policies
Abstract
Branch and Bound (B&B) is the exact tree search method typically used to solve MixedInteger Linear Programming problems (MILPs). Learning branching policies for MILP has become an active research area, with most works proposing to imitate the strong branching rule and specialize it to distinct classes of problems. We aim instead at learning a policy that generalizes across heterogeneous MILPs: our main hypothesis is that parameterizing the state of the B&B search tree can significantly aid this type of generalization. We propose a novel imitation learning framework, and introduce new input features and architectures to represent branching. Experiments on MILP benchmark instances clearly show the advantages of incorporating to a baseline model an explicit parameterization of the state of the search tree to modulate the branching decisions. The resulting policy reaches higher accuracy than the baseline, and on average explores smaller B&B trees, while effectively allowing generalization to generic unseen instances.
1 Introduction
Many problems arising from transportation, healthcare, energy and logistics can be formulated as MixedInteger Linear Programming (MILP) problems, i.e., optimization problems in which some decision variables represent discrete or indivisible choices. A MILP is written as
(1) 
where , , and is the set of indices of variables that are required to be integral, while the other ones can be realvalued. Note that one can consider a MILP as defined by ; we do not assume any special combinatorial structure on the parameters . While MILPs are in general hard, MILP solvers underwent dramatic improvements over the last decades [35, 3] and now achieve highperformance on a wide range of problems. The fundamental component of any modern MILP solver is Branch and Bound (B&B) [27], an exact tree search method. Following a divideandconquer approach, B&B partitions the search space by branching on variables’ values and smartly uses bounds from problem relaxations to prune unpromising regions from the tree. The B&B algorithm actually relies on expertlycrafted heuristic rules for its two most fundamental decisions: branching variable selection (BVS) and node selection. In particular, BVS has been shown to be a crucial factor for B&B’s success [3], and will be the main focus of the present article.
Understanding why B&B works has been called “one of the mysteries of computational complexity theory” [31], and there currently is no mathematical theory of branching; to the best of our knowledge, the only attempt in formalizing BVS is the recent work of [28]. One central reason why B&B is difficult to formalize resides in its inherent exponential nature: millions of BVS decisions could be needed to solve a MILP, and a single bad one could result in a doubled tree size and no improvement in the search. Such a complex and datarich setting, paired with a lack of formal understanding, makes B&B an appealing ground for machine learning (ML) techniques, which have lately been thriving in discrete optimization [8]. In particular, there has been substantial effort towards “learning to branch”, i.e., in using ML methods to learn BVS policies [34].
Up to now, most works in this area of research focused on learning branching policies by supervision or imitation of strong branching (SB), a valid but expensive heuristic scheme (see Sections 2 and 5 for more details). While [5] propose to explicitly learn SB scores by regression, [24] formulate BVS as a ranking problem and learn instancespecific proxies of SB. In a different vein, [7] suggest to leverage existing scoring rules by learning weights to combine them, and perform experiments on special classes of synthetic problems. The latest contribution we know of to “learning to branch” [18] frames BVS as a classification problem on SB expert decisions, and employs a graphconvolutional neural network to represent MILPs. The resulting branching policies improve on the solver by successfully specializing SB to different classes of generated combinatorial optimization problems. Specifically, the attained generalization ability is to similar MILP instances (within the same class), possibly larger in formulation size.
The present work seeks a different (somehow complementary) type of generalization for a branching policy, namely across heterogeneous MILPs, i.e., across problems not belonging to the same combinatorial class, without any restriction on the formulation’s structure and size. To achieve this goal, we parameterize BVS in terms of B&B search trees. On the one hand, information about the state of the B&B tree – abundant yet mostly unexploited by MILP solvers – was already shown to be useful to learn resolution patterns shared across general MILPs [15]. On the other hand, the state of the search tree ought to have a central role in BVS – which ultimately decides how the tree is expanded and hence how the search itself proceeds. In practice, B&B continually interacts with other algorithmic components of the solver to effectively search the decision tree. In a highly integrated framework, a branching variable should thus be selected among the candidates based on its role in the search and its various components. Indeed, stateoftheart heuristic branching schemes employ properties of the tree to make BVS decisions, and the B&B method equipped with such branching rules has proven to be successful across widely heterogeneous instances.
Motivated by these considerations, our main hypothesis is that MILPs share a higher order structure in the space of B&B search trees, and parameterized BVS policies should learn in this representational space. We setup a novel learning framework to investigate this idea. First of all, there is no natural input representation of this underlying space. Our first contribution is to craft input features of the variables that are candidates for branching: we aim at representing their broad roles in the search and their dynamic evolution. The dimensionality of such descriptions naturally changes with the number of candidate variables at every BVS step. The deep neural network (DNN) architecture that we propose learns a baseline branching policy (NoTree) from the candidate variables’ representations and effectively deals with varying input dimensions. Taking this idea further, we suggest that an explicit representation of the state of the search tree should condition the branching criteria, in order for it to flexibly adapt to the tree evolution. We contribute such treestate parameterization, and incorporate it to the baseline architecture to provide context over the candidate variables at each given branching step. In the resulting policy (TreeGate) the tree state acts as a control mechanism to drive a topdown modulation (specifically, feature gating) of the highly mutable space of candidate variables representations. In this sense, we learn branching from parameterizations of B&B search trees that are shared among general MILPs. To the best of our knowledge, the present work is the first attempt in the “learning to branch” literature to represent B&B search trees for branching, and to establish such a broad generalization paradigm covering many classes of MILPs. We envision a future combination of our framework on generic MILPs with more structurebased ones, such as that of [18], to leverage the strengths of both approaches.
We perform imitation learning (IL) experiments on a curated dataset of heterogeneous instances from standard MILP benchmarks: the selected problems belong to various special classes, are different in structure and size, and give rise to diverse search trees. We employ as expert rule the default branching scheme of the optimization solver SCIP [19], to which our framework is integrated. Machine learning experimental results clearly show the advantage of the policy employing the tree state (TreeGate) over the baseline one (NoTree), the former achieving a 19% improvement in test accuracy. The evaluation of the trained policies in the solver also supports our idea that representing B&B search trees enables learning to branch across generic MILP instances: the best TreeGate policy explores on average trees with 14.9% less nodes than the best NoTree one; measured over test instances only, this gap increases to 27%. In addition, when plugged in the solver both learned policies compare well with stateoftheart branching rules.
2 Background
Simply put, the B&B algorithm iteratively partitions the solution space of a MILP (1) into subproblems, which are mapped to nodes of a binary decision tree. At each node, integrality requirements for variables in are dropped, and a linear programming (LP) (continuous) relaxation of the problem is solved to provide a valid lower bound to the optimal value of (1). When the solution of a node LP relaxation violates the integrality of some variables in , that node is further partitioned into two children by branching on a fractional variable. Formally, defines the index set of candidate variables for branching at that node. The BVS problem consists in selecting a variable in order to branch on it, i.e., create child nodes according to the split
(2) 
Child nodes inherit a lower bound estimate from their parent, while (2) ensures is removed from their solution spaces. After extending the tree, the algorithm moves on to select a new open node, i.e., a leaf yet to be explored (node selection): a new relaxation is solved, and new branchings happen. When satisfies integrality requirements, then it is actually feasible for (1), and its value provides a valid upper bound to the optimal one. Maintaining global upper and lower bounds allows one to prune large portions of the search space. During the search, final leaf nodes are created in three possible ways: by integrality, when the relaxed solution is feasible for (1); by infeasibility of the subproblem; by bounds, when the comparison of the node’s lower bound to the global upper one proves that its subtree is not worth exploring. An optimality certificate is reached when the global bounds converge. See [46, 35] for details on B&B and its combination with other components of a MILP solver.
Branching Rules
Usually, candidates are evaluated with respect to some scoring function, and is chosen for branching as the (or a) scoremaximizing variable. The most used criterion in BVS measures variables depending on the improvement of the lower bound in their (prospective) child nodes. The strong branching (SB) rule [6] explicitly computes bound gains for . The procedure is expensive, but experimentally realizes trees with the least number of nodes. Instead, pseudocost (PC) [9] maintains a history of variables’ branchings, averaging past improvements to get a proxy for the expected gain. Fast in evaluation, PC can behave badly due to uninitialization, so combinations of SB with PC have been developed. In reliability branching, SB is performed until PC scores for a variable are deemed reliable proxies of bound improvements. In hybrid branching [2], PC scores are combined via a weighted sum with other criteria borrowed from the CSP and SAT communities (on inference and conflict clauses). Extensive literature has been produced on BVS schemes [1], and many other scoring criteria have been proposed; some of them are surveyed by [34] from a machine learning perspective.
Stateoftheart branching rules can in fact be interpreted as mechanisms to score variables based on their effectiveness in different search components. While hybrid branching explicitly combines five scores reflecting variables’ behaviors in different search tasks, the evaluation performed by SB and PC can also be seen as a measure of how effective a variable is – in the single task of improving the bound from one parent node to its children. Besides, one can assume that the importance of different search functionalities should change dynamically during the tree exploration.
3 Parameterizing B&B Search Trees
The central idea of our framework is to learn BVS by means of parameterizing the underlying space of B&B search trees. We believe this space can represent the complexity and the dynamism of branching in a way that is shared across heterogeneous problems. However, there are no natural parameterizations of BVS or B&B search trees. To this end, our contribution is twofold: 1) we propose handcrafted input features to describe candidate variables in terms of their roles in the B&B process, and explicitly encode a “tree state” to provide a richer context to variable selection; 2) we design novel DNN architectures to integrate these inputs and learn BVS policies.
3.1 Handcrafted Input Features
At each branching step , we represent the set of variables that are candidates for branching by an input matrix . To capture the multiple roles of a variable throughout the search, we describe each candidate in terms of its bounds and solution value in the current subproblem. We also feature statistics of a variable’s participation in various search components (e.g., inference, conflicts, implications) and in past branchings. In particular, the scores that are used in the SCIP default hybridbranching formula are part of .
Additionally, we create a separate parameterization to describe the state of the search tree. We record information of the current local node in terms of its depth and the quality of its bound. We also consider the growth rate and composition of the tree (explored, open, final leaf nodes), the evolution of global bounds, statistics on feasible solutions and multiple other scores, aggregated over variables. Statistics on bound estimates and depths of the open nodes complete the parameterization of .
All features are designed to capture the dynamics of the B&B process linked to BVS decisions, and are efficiently gathered through a customized version of PySCIPOpt [40]. Note that are defined in a way that is not explicitly dependent on the parameters of each instance . Even though naturally changes its dimensionality at each BVS step depending on the highly variable , the fixed lengths of the vectors enable training among branching sets of different sizes (see 3.2). The representations evolve with the search: tSNE plots [45] in Figures 1(a) and 1(b) synthesize the evolution of the tree state representation throughout the B&B search, for two different MILP instances. The pictures clearly show the high heterogeneity of the branching data across different search stages. A detailed description of the handcrafted input features is reported in Appendix B.
3.2 Architectures to Model Branching
We use parameterizations as inputs for a baseline DNN architecture (NoTree). Referring to Figure 2, the 25feature input of a candidate variable is first embedded into a representation with hidden size ; subsequently, multiple layers reduce the dimensionality from to an infimum by halving it at each step. The vector of length is then compressed by global average pooling into a single scalar. The dimension of is conceived (and implemented) as a “batch dimension”: this makes it possible to handle branching sets of varying sizes, still allowing the parameters of the nets to be shared across problems. Ultimately, a layer yields a probability distribution over the candidate set , according to which a variable is selected for branching.
We incorporate the treestate input to the baseline architecture in order to provide a searchbased context over the mutable branching sets. Practically, is embedded in a series of subsequent layers with hidden size . The output of a final sigmoid activation is , where denotes the total number of units of the NoTree layers. Separate chunks of are used to modulate by feature gating the representations of NoTree: controls features at the first embedding, acts at the second layer, …, and so on, until exhausting with the last layer prior the average pooling. In other words, is used as a control mechanism on variables parameterization, gating their features via a learned treebased signal. The resulting network (TreeGate) models the highlevel idea that a branching scheme should adapt to the tree evolution, with variables’ selection criteria dynamically changing throughout the tree search.
4 Experiments
MILP Dataset and Solver Setting
Despite MILP libraries containing hundreds of instances, not all of them appear viable for our setting, and a careful dataset curation is needed. On the one hand, comparing the behavior of different branching policies becomes easier (and results are clearer) when the explored trees are manageable in size and the problems can be consistently solved to optimality. On the other hand, standard MILP collections comprise very challenging instances, and are compiled to be the ongoing benchmark for advances in MILP research. In our ML context, it does not seem necessary to introduce extra challenges on the MILP side. We hence curate a heterogeneous collection of 27 problems from different realworld MILP benchmark libraries [10, 26, 37, 38], focusing on instances whose tree exploration is on average relatively contained (in the tens/hundreds of thousands nodes, maximum) and whose optimal value is known. We partition our selection into 19 train and 8 test problems. A complete list of instances is reported in Table 1, while we refer to Appendix A for other details.
We use SCIP 6.0.1. Modifying the solver configuration is common practice in BVS literature [30], especially in a proofofconcept setting in which our work is positioned. To reduce the effects of the other solver’s components on BVS, we work with a configuration specifically designed to fairly compare the performance of different branching rules [17]. In particular, we allow presolve and cut separation routines, and disable all primal heuristics. For each problem we provide the known optimal solution value as cutoff, and disable few other parameters associated with SB sideeffects. We also enforce a limit of one hour on the resolution time. The same setting is used for both data collection and policies’ evaluation. Further details on the solver parameters and hardware settings are reported in Appendix C.
Train  air04, air05, dcmulti, eil332, istanbulnocutoff, l152lav, lseu, misc03, neos20, neos21, neos476283, neos648910, pp08aCUTS, rmatr100p10, rmatr100p5, sp150x300d, stein27, swath1, vpm2 
Test  map18, mine1665, neos11, neos18, ns1830653, nu25pr12, rail507, seymour1 
Data Collection and Split
We collect IL training data from SCIP rollouts, gathering inputs and corresponding branching decisions (labels) . Our expert branching scheme is the default one of SCIP, relpscost, i.e., a reliability version of hybrid branching. Given that each branching decision gives rise to a single datapoint , and that the search trees of the selected MILP instances are not extremely big, one needs to augment the data. We proceed in two ways.

First, we naturally exploit the socalled performance variability of MILPs [33]. To obtain perturbed tree searches of the same instance, we set five different random seeds in the solver, to control variables’ permutations.

Second, we diversify B&B explorations by letting a random branching scheme run for the first nodes, before switching to SCIP default branching rule and starting data collection. The motivation behind this type of augmentation is to gather input states that are unlikely to be observed by an expert rule [21]: inputs collected from a default SCIP run will differ from those confronted by a trained IL policy at test time, given that a single divergent BVS may cascade and result in qualitatively very different B&B trees. We use , where the case of corresponds to a run in which no random branching is performed (i.e., relpscost is used from the beginning of the search). We apply this second type of augmentation to train instances only.
One can quantify MILP variability by computing the coefficient of variation of the performance measurements [26]; we report such variability scores for all our instances in Appendix D, using the total number of nodes as performance measure, across the five runs of SCIP performed on different seeds, as in (i). The observed coefficients range in : the majority of the instances presents a variability of at least 0.20, confirming (i) as an effective way of diversifying our dataset. The effect of initial random branchings is also analyzed and reported in Appendix D. Generally, the size of the explored trees grows with , i.e., initial random branchings affect the nodes’ count for worse – though the opposite can also happen in few cases. The coefficients of variation of the nodes shifted geometric means across different ’s range in in the training set, so (ii) also appears effective for data augmentation.
The final composition of train, validation and test splits is summarized in Table 2. In particular, train and validation data come from the same subset of 19 instances, with validation being performed on branchings from a different random seed. Instead, the test set contains datapoints from 8 separate MILPs, using augmentation of type (i) only.
Total  pairs  
Train  85,533  
Validation  14,413  
Test  28,307 
Policy  / /  Test acc@1 (@5)  Val acc@1 (@5)  All  Train  Test 
NoTree  32 / – / 0.0001  68.37 (91.43)  75.40 (95.23)  1341.72  859.17  3695.04 
64 / – / 0.0001  67.05 (89.18)  76.45 (95.11)  1363.73  847.63  4010.65  
128 / – / 0.0001  65.44 (90.21)  76.77 (95.66)  1454.20  875.19  4601.72  
128 / – / 0.001  64.02 (88.51)  77.69 (95.88)  1241.79  834.40  3068.96  
256 / – / 0.0001  64.59 (90.13)  77.29 (96.08)  1279.18  731.16  4491.64  
TreeGate  64 / 5 / 0.01  83.70 (95.83)  84.33 (96.60)  1056.79  759.94  2239.47 
256 / 2 / 0.001  83.69 (95.18)  84.10 (96.42)  1135.28  822.80  2369.35  
32 / 3 / 0.01  83.31 (95.72)  84.02 (96.50)  1188.48  809.18  2849.28  
128 / 5 / 0.001  81.61 (95.81)  84.96 (96.74)  1127.31  771.60  2666.73 
An important measure to analyze the dataset is given by the size of the candidate sets (i.e., the varying dimensionality of the inputs) contained in each split. Figure 1(c) shows histograms for in each subset. While in train and validation the candidate set sizes are mostly concentrated in the range, the test set has a very different distribution of , and in particular one with a longer tail (over 300). In this sense, the test instances present neverseen branching data gathered from heterogeneous MILPs, and we test the generalization of our policies to entirely unknown and larger branching sets.
IL Optimization
We train both IL policies using ADAM [25] with default , and weight decay . Our hyperparameter search spans: learning rate , hidden size , and depth . The factor by which units of NoTree are reduced is 2, and we fix . We use PyTorch [39] to train the models for 40 epochs, reducing by a factor of 10 at epochs 20 and 30.
4.1 Results
In our context, standard IL metrics are informative yet incomplete measures of performance for evaluating a learned BVS model, and one also cares about assessing the policies’ behaviors when plugged in the solver environment. This is why in order to determine the best NoTree and TreeGate policies we take into account both types of evaluations. We first select few policies based on their test accuracy score; next, we specify them as custom branching rules in SCIP and perform full rollouts on the entire MILP dataset, over five random seeds (i.e., 135 evaluations each). To summarize the policies’ performance in the solver, we compute the shifted geometric mean (with a shift of 100) of the total number of explored nodes, over the 135 B&B executions (All), and restricted to Train and Test instances.
Instance  Set  NoTree  TreeGate  % diff  random  pscost  relpscost (fair) 
All  1241.79  1056.79  14.90  6580.79  1471.61  286.15 (719.20)  
Train  834.40  759.94  8.92  2516.04  884.37  182.27 (558.34)  
Test  3068.96  2239.47  27.03  61828.29  4674.34  712.77 (1276.76)  
air04  train  645.99  536.07  17.02  6677.96  777.65  8.19 (114.39) 
air05  train  789.70  516.06  34.65  12685.83  1158.89  60.25 (277.22) 
dcmulti  train  203.53  187.49  7.88  599.12  122.39  9.38 (68.30) 
eil332  train  7780.85  8767.27  12.68  12502.02  8337.63  583.34 (9668.71) 
istanbulnocutoff  train  447.26  543.71  21.56  1085.16  613.68  242.39 (328.25) 
l152lav  train  621.82  687.91  10.63  6800.06  964.53  10.14 (250.04) 
lseu  train  372.67  396.71  6.45  396.73  375.31  148.99 (389.88) 
misc03  train  241.40  158.39  34.39  118.37  151.07  12.11 (294.11) 
neos20  train  2062.23  1962.95  4.81  10049.15  2730.01  200.26 (612.75) 
neos21  train  1401.84  1319.73  5.86  7016.55  1501.54  668.44 (1455.29) 
neos648910  train  140.05  175.82  25.54  1763.05  1519.01  39.83 (166.53) 
neos476283  train  13759.59  6356.81  53.80  *94411.77  2072.84  204.88 (744.65) 
pp08aCUTS  train  267.86  293.74  9.66  337.76  271.92  69.66 (350.21) 
rmatr100p5  train  443.35  460.48  3.86  1802.38  451.71  411.93 (785.15) 
rmatr100p10  train  908.27  906.04  0.25  4950.77  894.65  806.35 (1214.76) 
sp150x300d  train  868.60  785.27  9.59  1413.64  991.52  182.22 (300.42) 
stein27  train  1371.44  1146.79  16.38  1378.91  1322.36  926.82 (1111.25) 
swath1  train  1173.14  1165.39  0.66  1429.21  1107.52  298.58 (2485.63) 
vpm2  train  589.03  440.74  25.18  594.62  546.45  199.46 (463.12) 
map18  test  457.89  575.92  25.78  11655.33  1025.74  270.25 (441.18) 
mine1665  test  3438.44  4996.48  45.31  *389437.62  4190.41  175.10 (600.22) 
neos11  test  3326.32  3223.46  3.09  29949.69  4728.49  2618.27 (5468.05) 
neos18  test  15611.63  10373.80  33.55  228715.62  *133437.40  2439.29 (5774.36) 
ns1830653  test  6422.37  5812.03  9.50  288489.30  12307.90  3489.07 (4311.84) 
nu25pr12  test  357.00  86.80  75.69  1658.41  342.47  21.39 (105.61) 
rail507  test  9623.05  3779.05  60.73  *80575.84  4259.98  543.39 (859.37) 
seymour1  test  3202.20  1646.82  48.57  *167725.65  3521.47  866.32 (1096.67) 
Both types of metrics are reported in Table 3, together with the policies’ hyperparameters. Incorporating an explicit parameterization of the state of the search tree to modulate BVS clearly aids generalization: the advantage of TreeGate over NoTree is evident in all metrics, and across multiple trained policies. In particular, the top1 test accuracy averages at for the NoTree models, while TreeGate ones score at ; the gap in validation accuracy is also significant. In terms of B&B rollouts, NoTree models explore on average nodes, against the of TreeGate ones. What we observe is that best test accuracy does not necessarily translate into best solver performance. The NoTree policy with the best solver performance exhibits an approximately 4% gap from the optimal top1 test accuracy model, but an improvement over 7% in solver performance. We select as best policies those yielding the best nodes average over the entire dataset. In the case of TreeGate, the best model corresponds to that realizing the best top1 test accuracy (83.70%), and brings a 19% (resp. 7%) improvement over the NoTree policy, in top1 (resp. top5) test accuracy. Learning curves and further details can be found in Appendix E.
In solver evaluations, NoTree and TreeGate are also compared to SCIP default branching scheme relpscost, PC branching pscost and a random one. Additionally, we compute the fair number of nodes [17]. This measure accounts for those nodes that are processed as sideeffects of SBlike explorations, specifically looking at domain reduction and cutoffs counts. In other words, the fair number of nodes distinguishes treesize reductions due to better branching from those obtained by SB sideeffects. Note that for rules that do not involve any SB, the fair number of nodes and the usual nodes’ count coincide, so we only report it for the relpscost policy. The selected solver parametric setting (the same as the one used for data collection) allows a meaningful computation of the fair number of nodes, and a honest comparison of branching schemes.
Both NoTree and TreeGate policies are able to solve all instances within the 1h timelimit, like relpscost. In contrast, random hits the limit on 4 instances (17 times in total) while pscost does so on one instance only (neos18), a single time. Table 4 reports the nodes’ means for every MILP instance (over five runs), as well as measures aggregated over train and test sets, and the entire dataset. In aggregation, TreeGate is always better than NoTree, the former exploring on average trees with 14.9% less nodes. This gap becomes more pronounced when measured over test instances only (27%), indicating the advantage of TreeGate over NoTree when exploring unseen data. Results are less clearcut from an instancewise perspective, with neither policy emerging as an absolute winner. Nonetheless, TreeGate is at least 10% (resp. 25%) better than NoTree on 10 (resp. 8) instances, while the opposite only happens 6 (resp. 3) times. In this sense, the reductions in tree sizes achieved by TreeGate are overall more pronounced.
In addition, learned policies compare well to other branching rules: both NoTree and TreeGate are substantially better than random across all instances, and always better than pscost in aggregated measures. Only on one instance both policies are much worse than pscost (neos476283). As expected, relpscost still realizes the smallest trees, but comparisons in terms of fair number of nodes are nonetheless positive: on 11 instances, at least one among NoTree and TreeGate explores less nodes than the relpscost fair number. In general, the policies realize tree sizes that are comparable to the SCIP default ones, when SB side effects are taken into account.
5 Related Work
Among the first attempts in “learning to branch”, [5] perform regression to learn proxies of SB scores. Instead, [24] propose to learn the ranking associated with such scores, and train instancespecific models (that are not endtoend policies) via . Also [20] treat BVS as a ranking problem, and specialize their models to the combinatorial class of timedependent traveling salesman problems. More recently, the work of [7] learns mixtures of existing branching schemes for different classes of synthetic problems, focusing on sample complexity guarantees. Similarly to us, the latest contribution to “learning to branch” [18] frames BVS as classification of expert branching decisions and employs IL to learn a branching policy. However, their expert of choice is SB, and MILPs are represented via a graphconvolutional neural network (GCNN) that models the variableconstraint structure expressed by the parameter matrix . The resulting policies are specializations of SB that appear to effectively capture structural characteristics of some classes of combinatorial optimization problems, and are able to generalize to larger formulations from the same distribution (i.e., within the same combinatorial class). It is not obvious that the GCNN could effectively generalize across heterogeneous problems, given the policy in [18] was only tested to solve bigger instances for which small analogs were available during training.
Still concerning the B&B framework, [21] employ IL to learn a heuristic and classspecific node selection policy, and categorize B&B input features. An RL approach for node selection can be found in [42], where a MultiArmed Bandit is used to model the tree search; some complexity and scaling issues of B&B are also presented.
Feature gating has a long and successful history in machine learning, ranging from LSTMs [22] to GRUs [13], and we refer to [36] for a survey. The idea of using a tree state to drive a feature gating of the branching variables is an example of topdown modulation, which has been shown to perform well in other deep learning applications [43, 29, 44]. With respect to learning across nonstatic action spaces, the most similar to our work is [12], which is in the continual learning setting. Unlike the traditional Markov Decision Process formulation of reinforcement learning (RL), the input to our policies is not a generic state but rather includes a parameterized handcrafted input representation of the available actions, thus continual learning is not a relevant concern for our framework. Other related works from the RL setting learn action space representations [14, 11], but they both assume that the action space is static across RL episodes, while in contrast the action space of BVS changes dynamically with .
6 Conclusions and Future Directions
Branching variable selection is a crucial factor in B&B success, and we setup a novel imitation learning framework to address it. In particular, we seek to learn branching policies that generalize across heterogeneous MILPs, regardless of the instances’ structure and formulation size. In doing so, we undertake a step towards a broader type of generalization. The novelty of our approach is relevant for both the ML and the MILP worlds. On the one hand, we develop parameterizations of the candidate variables and of the search trees, and design a DNN architecture that handles candidate sets of varying size. On the other hand, the data encoded in our parameterization is not currently exploited by stateoftheart MILP solvers, but we show that this type of information could indeed help in adapting the branching criteria to different search dynamics. Our results on MILP benchmark instances clearly demonstrate the advantage of incorporating a searchtree context to modulate BVS and aid generalization to heterogeneous problems, in terms of both better test accuracy and smaller explored B&B trees.
There surely are additional improvements to be gained by continuing to explore IL methods for branching. [18] have shown a correlation between the structure of MILPs (captured by the GCNN) and at least one of the main ingredients of the expert we use, namely the bound improvement due to BVS. In this work, we exploit instead representations of general B&B trees, and both priors may be required to fully match the expert performance.
However, quantifying the goodness of branching policies and B&B search trees remains hard due to the complexity and exponentiality of the B&B system. In the IL setting this translates into not being able to assess the impact of a misclassified BVS in the subsequent tree exploration. In fact, the MILP domain expertise suggests that at any given branching step there is no such thing as a single best branching decision, but rather groups of variables on which one should branch [16]. In other words, there is no branching ground truth, and the quality of branching certainly resides in effective BVS sequences. For these reasons, barring a mathematical breakthrough for the theory of branching, we believe there can be much more innovation in future explorations of RL approaches for BVS. Within the RL paradigm the focus would shift to learning branching sequences and partial trees explorations, by means of heterogeneous reward signals that could allow to better approach the diverse performance goals one practically aims at when solving MILPs. These are important factors in “learning to branch” which cannot be expressed in IL terms. Indeed, the idea and the benefits of using an explicit parameterization of the state of B&B search trees – which we demonstrated in the IL setting – could be expanded even more in the RL one, for both state representations and the design of branching rewards.
Acknowledgements
We would like to thank Ambros Gleixner, Gerald Gamrath, Laurent Charlin, Didier Chételat, Maxime Gasse, Antoine Prouvost, Leo Henri and Sébastien Lachapelle for helpful discussions on the branching framework. We also thank Compute Canada for compute resources. This work was supported by CIFAR and IVADO.
Appendix A Dataset Curation
To curate a dataset of heterogeneous MILP instances, we consider the standard benchmark libraries MIPLIB 3, 2010 and 2017 [10, 26, 37], together with the collection of [38]. We assess the problems by analyzing B&B rollouts of SCIP with its default branching rule (relpscost) and a random one, enforcing a time limit of 1h in the same solver setting used for our experiments (see Appendix C). We focus on instances whose tree exploration is on average relatively contained (in the tens/hundreds of thousands nodes, maximum) and whose optimal value is known. This choice is primarily motivated by the need of ensuring a fair comparison among branching policies in terms of tree size, which is more easily achieved when rollouts do not hit the timelimit. We also remove problems that are solved at the root node (i.e., those for which no branching was performed).
Final training and test sets comprise 19 and 8 instances, respectively, for a total of 27 problems. They are summarized in Table 5, where we report their size, the number of binary/integer/continuous variables, the number of constraints, their membership in the train/test split and their library of origin. The constraints of each problem are of different types and give rise to various structures.
Name  Vars  Types (bin  int  cont)  Conss  Set  Library 
air04  8904  8904  0  0  823  train  MIPLIB 3 
air05  7195  7195  0  0  426  train  MIPLIB 3 
dcmulti  548  75  0  473  290  train  MIPLIB 3 
eil332  4516  4516  0  0  32  train  MIPLIB 2010 
istanbulnocutoff  5282  30  0  5252  20346  train  MIPLIB 2017 
l152lav  1989  1989  0  0  97  train  MIPLIB 3 
lseu  89  89  0  0  28  train  MIPLIB 3 
misc03  160  159  0  1  96  train  MIPLIB 3 
neos20  1165  937  30  198  2446  train  MILPLib 
neos21  614  613  0  1  1085  train  MILPLib 
neos476283  11915  5588  0  6327  10015  train  MIPLIB 2010 
neos648910  814  748  0  66  1491  train  MILPLib 
pp08aCUTS  240  64  0  176  246  train  MIPLIB 3 
rmatr100p10  7359  100  0  7259  7260  train  MIPLIB 2010 
rmatr100p5  8784  100  0  8684  8685  train  MIPLIB 2010 
sp150x300d  600  300  0  300  450  train  MIPLIB 2017 
stein27  27  27  0  0  118  train  MIPLIB 3 
swath1  6805  2306  0  4499  884  train  MIPLIB 2017 
vpm2  378  168  0  210  234  train  MIPLIB 3 
map18  164547  146  0  164401  328818  test  MIPLIB 2010 
mine1665  830  830  0  0  8429  test  MIPLIB 2010 
neos11  1220  900  0  320  2706  test  MILPLib 
neos18  3312  3312  0  0  11402  test  MIPLIB 2010 
ns1830653  1629  1458  0  171  2932  test  MIPLIB 2010 
nu25pr12  5868  5832  36  0  2313  test  MIPLIB 2017 
rail507  63019  63009  0  10  509  test  MIPLIB 2010 
seymour1  1372  451  0  921  4944  test  MIPLIB 2017 
Appendix B Handcrafted Input Features
Handcrafted input features for candidate variables () and tree state () are reported in Table 8. To ease their reading, we present them subdivided in groups, and synthetically describe them by the SCIP API functions with which they are computed. We make use of different functions to normalize and compare the solver inputs.
To compute the branching scores of a candidate variable , with respect to an average score , we use the formula implemented in SCIP relpscost [41]:
As in [4], we normalize inputs that naturally span different ranges by the following:
To compare commensurable quantities (e.g., upper and lower bounds), we compute measures of relative distance and relative position:
We also make use of usual statistical functions such as min, max, mean, standard deviation std and 2575% quantile values (denoted in Table 8 as q1 and q3, respectively).
Further information on each feature can be gathered by searching the SCIP online documentation at https://scip.zib.de/doc6.0.1/html/.
Appendix C Solver Setting and Hardware
Regarding the MILP solver parametric setting, we use SCIP 6.0.1 and set a timelimit of 1h on all B&B evaluations. We leave on presolve routines and cuts separation (as in default mode), while disabling all primal heuristics and reoptimization (also off at default). To control SB sideeffects and properly compute the fair number of nodes [17], we additionally turn off SB conflict analysis and the use of probing bounds identified during SB evaluations. We also disable feasibility checking of LP solutions found during SB with propagation, and always trigger the reevaluation of SB values. Finally, the known optimal solution value is provided as cutoff to each model, and a random seed determines variables’ permutations. Parameters are summarized in Table 6.
limits/time = 3600 
presolving/maxrounds = 1 
separating/maxrounds = 1 
separating/maxroundsroot = 1 
heuristics/*/freq = 1 
reoptimization/enable = False 
conflict/usesb = False 
branching/fullstrong/probingbounds = False 
branching/relpscost/probingbounds = False 
branching/checksol = False 
branching/fullstrong/reevalage = 0 
model.setObjlimit(cutoff_value) 
randomization/permutevars = True 
randomization/permutationseed = scip_seed 
For the IL experiments, we used the following hardware: Two Intel Core(TM) i76850K CPU @ 3.60GHz, 16GB RAM and an NVIDIA TITAN Xp 12GB GPU. Evaluations of SCIP branching rules ran on dual Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz, equipped with 512GB of RAM.
Appendix D Data Augmentation
Instance  Set  
air04  train  0.20  8.19  12.02  46.57  85.35  119.49  0.79 
air05  train  0.26  60.25  61.07  115.94  196.27  274.44  0.59 
dcmulti  train  0.21  9.38  13.75  27.53  34.99  45.00  0.50 
eil332  train  0.69  583.34  648.47  492.95  531.37  441.24  0.13 
istanbulnocutoff  train  0.11  242.39  234.01  271.35  279.38  311.39  0.10 
l152lav  train  0.36  10.14  16.54  29.31  55.51  61.14  0.59 
lseu  train  0.43  148.99  152.65  154.16  182.55  177.35  0.09 
misc03  train  0.38  12.11  10.59  13.80  22.59  31.80  0.44 
neos20  train  1.22  200.26  282.68  557.15  434.03  944.75  0.54 
neos21  train  0.15  668.44  771.77  898.79  1110.82  1158.07  0.21 
neos648910  train  0.60  39.83  48.16  65.84  41.05  59.72  0.20 
neos476283  train  0.48  204.88  219.58  384.86  480.37  715.78  0.47 
pp08aCUTS  train  0.31  69.66  80.39  92.60  69.94  76.43  0.11 
rmatr100p5  train  0.04  411.93  419.21  451.01  461.83  494.09  0.07 
rmatr100p10  train  0.03  806.35  799.24  860.60  933.80  965.07  0.08 
sp150x300d  train  1.70  182.22  462.45  484.55  483.89  439.69  0.28 
stein27  train  0.42  926.82  1062.69  1098.41  1162.57  1154.01  0.08 
swath1  train  0.53  298.58  280.49  230.12  256.84  267.55  0.09 
vpm2  train  0.19  199.46  180.93  275.57  273.33  316.82  0.20 
map18  test  0.09  270.25  309.77  401.79  447.34  489.85  0.21 
mine1665  test  0.82  175.10  70.77  642.33  942.63  1619.75  0.81 
neos11  test  0.30  2618.27  3114.62  3488.40  2898.41  2659.96  0.11 
neos18  test  0.53  2439.29  2747.77  4061.40  4655.59  5714.05  0.31 
ns1830653  test  0.09  3489.07  3913.58  4091.59  4839.39  4772.73  0.12 
nu25pr12  test  1.18  21.39  16.97  56.04  101.34  119.05  0.66 
rail507  test  0.08  543.39  562.09  854.76  1207.15  1196.33  0.33 
seymour1  test  0.07  866.32  1174.18  1825.04  2739.45  3313.87  0.47 
To augment our dataset, we (i) run MILP instances with different random seeds to exploit performance variability [32], and (ii) perform random branchings at the top of the tree, before switching to the default SCIP branching rule and collect data. To quantify the effects of such operations in diversifying the search trees, we compute coefficients of variations of performance measurements [26]. In particular, assuming performance measurements are available, we compute the variability score as
(3) 
Table 7 reports such coefficients for all instances, using as performance measures the number of nodes explored in the five runs from (i). Similarly, we report the shifted geometric means of the number of nodes over the five runs for each , and additionally compute the variability of those means, across different ’s ().
Appendix E IL Optimization Dynamics
Best Policies
We present additional plots of the optimization dynamics for the best selected NoTree and TreeGate policies. Figure 3 shows the training loss curves, as well as top1 and top5 validation accuracy curves. In general, we see that the TreeGate policy enjoys a better conditioned optimization. Note however that for top5 validation accuracy the two policies are quite close.
Instability of Batchnorm
As observed in Figure 3, optimization dynamics for NoTree seem to be of a much slower nature than those of TreeGate. One common option to speed up training is to use batch normalization (BN) [23]. In our architectures for branching, one may view the cardinality of the candidate sets as a batch dimension. When learning to branch across heterogeneous MILPs, such batch dimension can (and will) vary by orders of magnitude. Practically, our dataset has varying from candidates to over 300. To this end, BN has been shown to struggle in the smallbatch setting [47], and in general we were unsure of the reliability of BN with such variable batchsizes.
Indeed, in our initial trials with BN we observed highly unreliable performance. Two troubling outcomes emerge when using BN in our NoTree policies: 1) the validation accuracy varies wildly, as shown in Figure 4, or 2) the NoTree+BN policy exhibits a stable validation accuracy curve, but would timelimit on train instances, i.e., would perform poorly in terms of solver performance. In particular, case 2) happened for a NoTree+BN policy with hidden size and , reaching the 1h timelimit on train instance neos476283, over all five runs (on different seeds); the geometric mean of explored nodes was . We remark that in our nonBN experiments, all of our trained policies (both TreeGate and NoTree) managed to solve all the train instances without even coming close to timelimiting. Moreover, none of our training and validation curves ever remotely resemble those in Figure 4(b).
For these reasons we opted for a more streamlined presentation of our results, without BN in the current framework. We leave it for future work to analyze the relationship between the nature of local minima in the IL optimization landscape and solver performance.
Group description (#)  Feature formula (SCIP API) 

Candidate state  
General solution (2)  SCIPvarGetLPSol 
SCIPvarGetAvgSol  
Branchings depth (2)  1  (SCIPvarGetAvgBranchdepthCurrentRun / SCIPgetMaxDepth) [x2] 
Branching scores (5)  varScore(SCIPgetVarConflictScore, SCIPgetAvgConflictScore) 
varScore(SCIPgetVarConflictlengthScore, SCIPgetAvgConflictlengthScore)  
varScore(SCIPgetVarAvgInferenceScore, SCIPgetAvgInferenceScore)  
varScore(SCIPgetVarAvgCutoffScore, SCIPgetAvgCutoffScore)  
varScore(SCIPgetVarPseudocostScore, SCIPgetAvgPseudocostScore)  
PC stats (6)  SCIPgetVarPseudocostCountCurrentRun / SCIPgetPseudocostCount [x2] 
SCIPgetVarPseudocostCountCurrentRun / SCIPvarGetNBranchingsCurrentRun [x2]  
SCIPgetVarPseudocostCountCurrentRun / branch_count [x2]  
Implications (2)  SCIPvarGetNImpls [x2] 
Cliques (2)  SCIPvarGetNCliques / SCIPgetNCliques [x2] 
Cutoffs (2)  gNormMax(SCIPgetVarAvgCutoffsCurrentRun) [x2] 
Conflict length (2)  gNormMax(SCIPgetVarAvgConflictlengthCurrentRun) [x2] 
Inferences (2)  gNormMax(SCIPgetVarAvgInferencesCurrentRun) [x2] 
Search tree state  
Current node (8)  SCIPnodeGetDepth / SCIPgetMaxDepth 
SCIPgetPlungeDepth / SCIPnodeGetDepth  
relDist(SCIPgetLowerbound, SCIPgetLPObjval)  
relDist(SCIPgetLowerboundRoot, SCIPgetLPObjval)  
relDist(SCIPgetUpperbound, SCIPgetLPObjval)  
relPos(SCIPgetLPObjval, SCIPgetUpperbound, SCIPgetLowerbound)  
len(getLPBranchCands) / getNDiscreteVars  
nboundchgs / SCIPgetNVars  
Nodes and leaves (8)  SCIPgetNObjlimLeaves / nleaves 
SCIPgetNInfeasibleLeaves / nleaves  
SCIPgetNFeasibleLeaves / nleaves  
(SCIPgetNInfeasibleLeaves + 1) / (SCIPgetNObjlimLeaves + 1)  
SCIPgetNNodesLeft / SCIPgetNNodes  
nleaves / SCIPgetNNodes  
ninternalnodes / SCIPgetNNodes  
SCIPgetNNodes / ncreatednodes  
Depth and backtracks (4)  nactivatednodes / SCIPgetNNodes 
ndeactivatednodes / SCIPgetNNodes  
SCIPgetPlungeDepth / SCIPgetMaxDepth  
SCIPgetNBacktracks / SCIPgetNNodes  
LP iterations (4)  log(SCIPgetNLPIterations / SCIPgetNNodes) 
log(SCIPgetNLPs / SCIPgetNNodes)  
SCIPgetNNodes / SCIPgetNLPs  
SCIPgetNNodeLPs / SCIPgetNLPs  
Gap (4)  log(primaldualintegral) 
SCIPgetGap / lastsolgap  
SCIPgetGap / firstsolgap  
lastsolgap / firstsolgap  
Bounds and solutions (5)  relDist(SCIPgetLowerboundRoot, SCIPgetLowerbound) 
relDist(SCIPgetLowerboundRoot, SCIPgetAvgLowerbound)  
relDist(SCIPgetUpperbound, SCIPgetLowerbound)  
SCIPisPrimalboundSol  
nnodesbeforefirst / SCIPgetNNodes  
Average scores (12)  gNormMax(SCIPgetAvgConflictScore) 
gNormMax(SCIPgetAvgConflictlengthScore)  
gNormMax(SCIPgetAvgInferenceScore)  
gNormMax(SCIPgetAvgCutoffScore)  
gNormMax(SCIPgetAvgPseudocostScore)  
gNormMax(SCIPgetAvgCutoffs) [x2]  
gNormMax(SCIPgetAvgInferences) [x2]  
gNormMax(SCIPgetPseudocostVariance) [x2]  
gNormMax(SCIPgetNConflictConssApplied)  
Open nodes bounds (12)  len(open_lbs at {min, max}) / nopen [x2] 
relDist(SCIPgetLowerbound, max(open_lbs))  
relDist(min(open_lbs), max(open_lbs))  
relDist(min(open_lbs), SCIPgetUpperbound)  
relDist(max(open_lbs), SCIPgetUpperbound)  
relPos(mean(open_lbs), SCIPgetUpperbound, SCIPgetLowerbound)  
relPos(min(open_lbs), SCIPgetUpperbound, SCIPgetLowerbound)  
relPos(max(open_lbs), SCIPgetUpperbound, SCIPgetLowerbound)  
relDist(q1(open_lbs), q3(open_lbs))  
std(open_lbs) / mean(open_lbs)  
(q3(open_lbs)  q1(open_lbs)) / (q3(open_lbs) + q1(open_lbs))  
Open nodes depths (4)  mean(open_ds) / SCIPgetMaxDepth 
relDist(q1(open_ds), q3(open_ds))  
std(open_ds) / mean(open_ds)  
(q3(open_ds)  q1(open_ds)) / (q3(open_ds) + q1(open_ds)) 
Footnotes
 Indeed, a “dynamic factor” takes care of adjusting hybrid weights in the default branching scheme of SCIP [41].
References
 (2005) Branching rules revisited. Oper Res Lett 33 (1), pp. 42–54. External Links: Document Cited by: §2.
 (2009) Hybrid Branching. In Integration of AI and OR Techniques in Constraint Programming for Combinatorial Optimization Problems: 6th International Conference, CPAIOR 2009 Pittsburgh, PA, USA, May 2731, 2009 Proceedings, pp. 309–311. External Links: Document Cited by: §2.
 (2013) Mixed integer programming: analyzing 12 years of progress. In Facets of Combinatorial Optimization: Festschrift for Martin Grötschel, M. Jünger and G. Reinelt (Eds.), pp. 449–481. External Links: Document Cited by: §1.
 (2007) Constraint integer programming. Ph.D. Thesis. Cited by: Appendix B.
 (2017) A machine learningbased approximation of strong branching. INFORMS Journal on Computing 29 (1), pp. 185–195. External Links: Document Cited by: §1, §5.
 (1995) Finding cuts in the TSP (a preliminary report). Technical report Center for Discrete Mathematics & Theoretical Computer Science. Cited by: §2.
 (2018) Learning to branch. In International Conference on Machine Learning, pp. 344–353. Cited by: §1, §5.
 (2018) Machine learning for combinatorial optimization: a methodological tour d’horizon. arXiv:1811.06128. Cited by: §1.
 (1971) Experiments in mixedinteger programming. Math Program 1, pp. 76–94. Cited by: §2.
 (1998) An updated mixedinteger programming library: MIPLIB 3. Technical report Technical Report TR9803. Note: External Links: Link Cited by: Appendix A, §4.
 (2019) Learning action representations for reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 941–950. Cited by: §5.
 (2019) Lifelong learning with a changing action set. CoRR abs/1906.01770. Cited by: §5.
 (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555. Cited by: §5.
 (2015) Reinforcement learning in large discrete action spaces. CoRR abs/1512.07679. External Links: 1512.07679 Cited by: §5.
 (2019) Learning MILP resolution outcomes before reaching timelimit. In Integration of Constraint Programming, Artificial Intelligence, and Operations Research, L. Rousseau and K. Stergiou (Eds.), Cham, pp. 275–291. Cited by: §1.
 (2012) Backdoor branching. INFORMS J Comput 25 (4), pp. 693–700. External Links: Document Cited by: §6.
 (2018) Measuring the impact of branching rules for mixedinteger programming. In Operations Research Proceedings 2017, N. Kliewer, J. F. Ehmke and R. Borndörfer (Eds.), Cham, pp. 165–170. Cited by: Appendix C, §4.1, §4.
 (2019) Exact combinatorial optimization with graph convolutional neural networks. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’AlchéBuc, E. Fox and R. Garnett (Eds.), pp. 15554–15566. Cited by: §1, §1, §5, §6.
 (201807) The SCIP Optimization Suite 6.0. ZIBReport Technical Report 1826, Zuse Institute Berlin. External Links: Link Cited by: §1.
 (2018) Cuts, primal heuristics, and learning to branch for the timedependent traveling salesman problem. arXiv:1805.01415. Cited by: §5.
 (2014) Learning to search in branch and bound algorithms. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence and K. Q. Weinberger (Eds.), pp. 3293–3301. Cited by: item (ii), §5.
 (1997) Long shortterm memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §5.
 (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei (Eds.), Proceedings of Machine Learning Research, Vol. 37, pp. 448–456. Cited by: Appendix E.
 (2016) Learning to branch in mixed integer programming. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, pp. 724–731. Cited by: §1, §5.
 (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: §4.
 (2011) MIPLIB 2010. Mathematical Programming Computation 3 (2), pp. 103. External Links: Document Cited by: Appendix A, Appendix D, §4, §4.
 (1960) An automatic method of solving discrete programming problems. Econometrica 28 (3), pp. 497–520. Cited by: §1.
 (2017) An abstract model for branching and its application to mixed integer programming. Math Program, pp. 1–37. External Links: Document Cited by: §1.
 (2016) Feature pyramid networks for object detection. CoRR abs/1612.03144. Cited by: §5.
 (1999) A computational study of search strategies for mixed integer programming. INFORMS Journal on Computing 11 (2), pp. 173–187. External Links: Document Cited by: §4.
 (2012) Branch and bound—why does it work?. Note: \urlhttps://rjlipton.wordpress.com/2012/12/19/branchandboundwhydoesitwork/Accessed: 2019 Cited by: §1.
 (2013) Performance variability in mixedinteger programming. In Theory Driven by Influential Applications, pp. 1–12. External Links: Document Cited by: Appendix D.
 (2013) Performance variability in mixedinteger programming. In Theory Driven by Influential Applications, pp. 1–12. External Links: Document Cited by: item (i).
 (20170701) On learning and branching: a survey. TOP 25 (2), pp. 207–236. External Links: Document Cited by: §1, §2.
 (2009) Mixed integer programming computation. In 50 Years of Integer Programming 19582008, M. Jünger, T.M. Liebling, D. Naddef, G.L. Nemhauser, W.R. Pulleyblank, G. Reinelt, G. Rinaldi and L.A. Wolsey (Eds.), pp. 619–645. Cited by: §1, §2.
 (2019) Learning in gated neural networks. CoRR abs/1906.02777. Cited by: §5.
 (2018) MIPLIB 2017. Note: \urlhttp://miplib.zib.de Cited by: Appendix A, §4.
 (2020) MILPlib. Note: Accessed 2019\url http://plato.asu.edu/ftp/milp/ Cited by: Appendix A, §4.
 (2019) PyTorch: an imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ AlchéBuc, E. Fox and R. Garnett (Eds.), pp. 8024–8035. Cited by: §4.
 (2019) PySCIPOpt. GitHub. Note: \urlhttps://github.com/SCIPInterfaces/PySCIPOpt Cited by: §3.1.
 Code for the relpscost branching rule in SCIP. Note: \urlhttps://scip.zib.de/doc6.0.0/html/branch__relpscost_8c_source.php#l00524Accessed: 2019 Cited by: Appendix B, footnote 1.
 (2012) Guiding combinatorial optimization with UCT. In Integration of AI and OR Techniques in Contraint Programming for Combinatorial Optimzation Problems: 9th International Conference, CPAIOR 2012, Nantes, France, May 28 – June 1, 2012. Proceedings, N. Beldiceanu, N. Jussien and É. Pinson (Eds.), Lecture Notes in Computer Science, Berlin, Heidelberg, pp. 356–361. External Links: Document Cited by: §5.
 (2016) Beyond skip connections: topdown modulation for object detection. CoRR abs/1612.06851. Cited by: §5.
 (2018) ExGate: externally controlled gating for featurebased attention in artificial neural networks. CoRR abs/1811.03403. Cited by: §5.
 (2008) Visualizing highdimensional data using tsne. Journal of Machine Learning Research 9 (nov), pp. 2579–2605 (English). Note: Pagination: 27 Cited by: §3.1.
 (1998) Integer programming. WileyInterscience, New York, NY, USA. Cited by: §2.
 (201809) Group normalization. In The European Conference on Computer Vision (ECCV), Cited by: Appendix E.