Optimal Resampling for Learning Small Models
Abstract
Models often need to be constrained to a certain size for them to be considered interpretable, for e.g., a decision tree of depth 5 is much easier to make sense of than one of depth 30. This suggests a tradeoff between interpretability and accuracy. Our work tries to minimize this tradeoff by suggesting the optimal distribution of the data to learn from, that surprisingly, may be different from the original distribution. We use an Infinite Beta Mixture Model (IBMM) to represent a specific set of sampling schemes. The parameters of the IBMM are learned using a Bayesian Optimizer (BO). While even under simplistic assumptions a distribution in the original dimensional space would need to optimize for variables  cumbersome for most realworld data  our technique lowers this number significantly to a fixed set of 8 variables at the cost of some additional preprocessing. The proposed technique is modelagnostic; it can be applied to any classifier. It also admits a general notion of model size. We demonstrate its effectiveness using multiple realworld datasets to construct decision trees, linear probability models and gradient boosted models.
Keywords:
Interpretable Machine Learning∎
1 Introduction
As Machine Learning (ML) becomes pervasive in our daily lives, the need to know how models reach specific decisions has significantly increased. In certain contexts this might not be as important, as long as the ML model itself works well, e.g., in product or movie recommendations. But for certain others, such as medicine and healthcare (Caruana et al., 2015; Ustun and Rudin, 2016), banking (Castellanos and Nash, 2018), defence applications (Gunning, 2016) and law enforcement (Angwin et al., 2016; Larson et al., 2016) model transparency is an important concern.Very soon, regulations governing digital interactions might necessitate interpretability (Goodman and Flaxman, 2017).
All these factors have generated a lot of interest in the area of Model Interpretability or Explainable AI (XAI). Approaches in the area can be broadly divided into two categories:

Build interpretable models, e.g., rule lists (Letham et al., 2013; Angelino et al., 2017), decision trees (Breiman et al., 1984; Quinlan, 1993, 2004), sparse linear models (Ustun and Rudin, 2016), decision sets (Lakkaraju et al., 2016), pairwise interaction models that may be linear (Lim and Hastie, 2015) or additive (Lou et al., 2013).
Our interest is in the first category of techniques since this enables continued use of the existing vast pool of learning algorithms in applications requiring interpretability.
Interpretable models are preferably small in size: this is referred to as low explanation complexity in Herman (2017), is seen as a form of simulability in Lipton (2018) and is often listed as a desirable property (Ribeiro et al., 2016; Lakkaraju et al., 2016; Angelino et al., 2017) . For instance, a decision tree of is easier to understand than one of . Similarly, a linear model with nonzero terms might be easier to comprehend than one with nonzero terms. This indicates an obvious problem: an interpretable model is often small in its size, and since the size is usually inversely proportional to the bias, a model often sacrifices accuracy for interpretability.
We propose a novel adaptive sampling technique to minimize the tradeoff between size and accuracy. Our approach is model agnostic: it may be applied to minimize the size/accuracy tradeoff for any model that a user decides is interpretable for her problem. This is made possible by the fact that we ignore the model specification completely and rely solely on modifying the training data to improve accuracy.
We would like to point out that even though there is no standard notion of size across model families, or even within a model family, we assume the term informally denotes some model attribute(s) that influence its bias. In practice, the same model family may have multiple definitions of size depending upon the modeler, e.g., depth or number of leaves for a tree  or the size might even be decided by multiple attributes in conjunction, e.g., maximum depth of each tree and number of boosting rounds in the case of a gradient boosted model (GBM). It’s also possible that while users of a model might agree on a definition of size they might disagree on the value for the size up to which the model stays interpretable, e.g., are decision trees interpretable up to a depth of or ? Clearly, the definition of size and its admissible values may vary across individuals and applications. However, irrespective of this subjectivity, as long as the notion of size is inversely proportional to the bias, the discussion in this paper remains valid. With this general notion in mind, we say that interpretable models are typically small.
Let’s begin with a quick demonstration of the effectiveness of using a modified distribution for learning. We have the binary class data, shown in Figure 1, that we wish to classify with decision trees with . We refer to the data distribution provided as the original distribution.
Our training data is a subset of this data (not shown). The data is approximately uniformly distributed in the input space  see the 2D histogram in the topleft panel in Figure 2. The bottomleft panel in the figure shows the regions a decision tree of depth 5 learns. The topright panel shows a modified distribution of the data, and the regions of the corresponding decision tree of depth 5 is shown in the bottomright panel. Both decision trees used the same learning algorithm  CART (Breiman et al., 1984)  and both trees have a . As we can see, the accuracies are vastly different.
Where does this additional accuracy come from?
All classification algorithms use some heuristic to make learning tractable, e.g.:

Decision Trees  one step lookahead (note that the CART tree has a significantly smaller number of leaves than the possible , in our example)

Logistic Regression  local search techniques, e.g., Stochastic Gradient Descent (SGD)

Artificial Neural Networks (ANN)  local search based optimizer e.g. SGD, Adam
Increasing the size works around the shortcomings of the heuristic by overparameterizing the model till it is satisfactorily accurate: increasing depth, terms, hidden layers or nodes per layer. In being restricted to use a model of a specific size, the possible gap between the effective and representational capacities affects model accuracy. Modifying the density to better focus the model on regions of the input space that are most valuable in terms of accuracy, enhances the effective capacity, thus decreasing this gap.
Figure 3 depicts the modified model building workflow indicating how the sampling technique is used. In the standard workflow, we feed the data into a learning algorithm to obtain a model. In our setup, the data is presented to a component that subsumes both the learning algorithm and our sampling algorithm; this component  represented by the dashed box  produces the final model.
Using a distribution different from the one provided to learn from, is not unprecedented. We see this commonly in the case of learning on data with class imbalance where over/undersampling is used (Japkowicz and Stephen, 2002). Seen from this perspective, we are positing that modifying the original distribution is helpful in a wider set of circumstances.
Our key contributions in this work are:

Proposing and demonstrating the effect that the ideal distribution for learning a small model may be different from the original distribution. This challenges the general wisdom that the training data must come from the original distribution.

Proposing a sampling based technique that utilizes this effect to learn accurate small models. Although we believe this effect exists for all ML models, it finds utility in applications requiring interpretability, since preferred models tend to be small. The effectiveness of this technique is shown on different learning algorithms and multiple real world datasets.
While the focus of this paper is our technique, we hope that the effect itself prompts a larger discourse.
We are aware of no prior work that studies the relationship between data density and accuracy in the small model regime. In terms of the larger view of modifying density to influence learning, parallels may be drawn to active learning (Settles, 2009), and the identification of coresets (Bachem et al., 2017; Munteanu and Schwiegelshohn, 2018).
The rest of the paper is organized as follows: in Section 2 we describe in detail two formulations of the problem of learning the optimal density. Section 3 reports experiments we have conducted to evaluate our technique. It also presents our analysis of the results. We conclude with Section 4 where we discuss possible extensions and additional applications of our technique.
2 Methodology
In this section we describe our sampling technique. We start with a simple formulation to motivate discussion around the challenges of our broader approach. Subsequently, we detail the actual technique we use.
2.1 A Naive Formulation
We phrase the problem of finding the ideal density (for the learning algorithm) as an optimization problem. We represent the density over the input space with the pdf , where is a parameter vector. Our optimization algorithm runs for a budget of time steps. We loosely express sampling from as . Algorithm 1 lists the execution steps.
In Algorithm 1:

Our objective function is denoted by . This may specifically measure F1score, AUC, lift etc as needed. It is evaluated on the validation set . The result of a call to is represented by .

denotes the learning algorithm we are interested in, e.g., a decision tree or a sparse linear model learner.

is a call to the optimizer that accepts all past values of the objective function and the density parameter . Note that not all optimizers need this information, but we refer to a generic version of optimization in our pseudo code.

Although the training happens on a sample drawn from the current density , the validation dataset isn’t modified by the algorithm and always reflects the original distribution. Hence, represents quality of a model on the original distribution.

For a sample , where do the come from? For now, we assume that we have kept aside a training set , from which we draw samples based on .
Algorithm 1 represents a general framework to discover the optimal density within a time budget . The key aspects to consider in an implementation are :

The optimizer to use for .

The precise representation of the .
We look at these next.
2.1.1 Optimization
Having a good optimizer is a crucial requirement for Algorithm 1 to be useful, especially since we work within a fixed time budget of . Here are some characteristics we need our optimizer to possess:

Requirement 1: it should be able to work with a blackbox objective function. Our objective function is , which depends on a model produced by the learning algorithm . Since our interest is to propose a technique that is universally applicable across learning algorithms, we want the optimizer to make no assumptions about the form of . This effectively makes a blackbox function.

Requirement 2: should be robust against noise. Results of may be noisy. There are multiple sources of noise, e.g.:

the model itself is learned on a sample

the classifier might use a local search method like SGD whose final value for a given training dataset depends on various factors like initialization, order of points etc.


Requirement 3: minimizes calls to the objective function. Since calls to per iteration are expensive, we want the optimizer to avoid them as much as possible, instead shifting the burden of computation to the optimization strategy.
The class of optimization algorithms referred to as Bayesian Optimization (BO) satisfy all of the above criteria (Brochu et al., 2010). They build their own model of the response surface over multiple evaluations of the objective function; this model serves as a surrogate for the actual, possibly blackbox, objective function, and the BO only relies on this (requirement 1 above). The model gets better with successive iterations since it has an increasing number of evaluations to generalize from. It is also probabilistic; this helps it to quantify uncertainties in the evaluations, leading it to account for reasonable amounts of noise (requirement 2). Since every call to is informed by this model, it focuses on only the most promising regions in the input space in a principled manner, thus minimizing trial and error (requirement 3) ^{1}^{1}1This makes BO an attractive candidate for hyperparameter tuning..
For more details refer Brochu et al. (2010).
The family of BO algorithms is fairly large (Rana et al., 2017; Snoek et al., 2015; Hutter et al., 2011; Snoek et al., 2012; Bergstra et al., 2011; Wang et al., 2013; Li et al., 2017; Letham et al., 2017; Malkomes and Garnett, 2018). We use the Tree Structured Parzen Estimator (TPE) algorithm (Bergstra et al., 2011) since it scales linearly with the number of evaluations and has a mature library: Hyperopt (Bergstra et al., 2013).
2.1.2 Density Representation
Our , , must possess the following characteristics:

It must be able to represent an arbitrary density function.

Have a fixed set of parameters. This is required since most optimizers cannot handle a setup where the number of variables to optimize for depends on the specific values of certain variables. For instance, this happens to be the case for the popular Gaussian Mixture Model (GMM); assuming components in dimensions, and allowing the covariance matrix to be an identity matrix, we have density parameters; thus, the total number of parameters depends on the value of .
The Infinite Gaussian Mixture Model (IGMM) (Rasmussen, 1999), which is a nonparametric Bayesian extension to the standard GMM, satisfies these criteria. It sidesteps the problem of explicitly denoting the number of components by representing it using a Dirichlet Process (DP). The DP is characterized by a concentration parameter , which determines both the number of components (also known as partitions or clusters) and association of a data point to a specific component. The parameters for these components are drawn from prior distributions; the parameters of these prior distributions comprises our (fixed set of) variables.
Since our data is limited to a finite region or “bounding box” within (this region is easily found by determining the min and max values across points, for each dimension, ignoring outliers if needed), we replace the Gaussian mixture components with a multivariate generalization of the distribution over this region and refer to the mixture model as an Infinite Beta Mixture Model (IBMM) ^{2}^{2}2We justify this name by noting that there is no one multivariate generalization of the : the Dirichlet distribution is a popular one, but there are others, e.g., Olkin and Trikalinos (2014).
This is how we sample points from the IBMM:

Determine partitioning over the points induced by the . Let’s assume this step produces partitions and quantities where . Here, denotes the number of points that belong to partition .

For point associated with partition , we make the simplifying assumption of independence across dimensions^{3}^{3}3We do this to minimize the number of parameters; this is similar to using a diagonal covariance matrix in GMMs., and let , where is the dimension of . We assume the priors for these are also represented by distributions: and , where . Thus we have two prior distributions associated with a dimension.

For a partition , we sample based on . The sampled values are scaled to be commensurate with the bounding box previously determined.
For dimensional data, we have . This is a total of parameters for representing our pdf.
Note that we use the IBMM purely for representational convenience. Here, all parameters are learned by the optimizer, so we ignore the standard associated machinery for estimation or inference.
For further details about this form of the pdf, the interested reader is referred to Rasmussen (1999).
2.2 Challenges
Our primary challenge with this formulation is the size of the parameter search space. For most real world datasets, optimizing over variables leads to an impractically high runtime even using a fast optimizer like TPE. We successfully tried out Algorithm 1 on small toy datasets as proofofconcept.
One could also question the independence assumption for dimensions, but that doesn’t address the larger problem that learning the density directly would need at least variables for optimization.
2.3 An Efficient Approach using Decision Trees
The key question to address is improving the efficiency of the density determination process. Recall that we want to eventually solve a classification problem on our validation set. An interesting possibility to consider is, given we already have the data, can we restrict the search space of the parameters of the ?
One form of information important to the classifier is the location of the class boundaries. We need to also account for the fact that the learning algorithm might focus on a certain subset of the boundary region. Hence, we consider speeding up the optimization by:

Explicitly representing boundary locations. Compare this to Algorithm 1, where the optimizer needs to discover these.

Parameterizing the problem in a way that a subset of these locations may be ignored.
How do we represent boundary locations?
We consider an interesting property of decision trees (DT) here. A DT fragments its input space into axisparallel rectangles. Figure 4 shows what this looks like when we learn a CART tree on the dataset from Figure 1.
Note how regions with relatively small areas almost always occur near boundaries. This happens here since none of the class boundaries are axisparallel and the DT, in being constrained in representation to axisparallel rectangles, must use multiple small rectangles to approximate the curvature of the boundary for regions adjacent to it. Figure 5 shows a magnified view of the interaction of leaf edges with boundaries. The first panel shows how trapezoid leaves might closely approximate boundary curvature. However, since the DT may only use axisparallel rectangles, we are led to multiple small rectangles as an approximation, as shown in the second panel.
We exploit this geometrical feature; in general, leaf regions with relatively small areas (volumes, in higher dimensions) produced by a DT, represent regions close to the boundary. Instead of determining an optimal pdf on the input space, we now do the following:

Learn a DT on the data . Assume the tree produces leaves, where the region in the input space encompassed by a leaf is denoted by .

Define a pmf over the leaves, that assigns mass to a leaf in inverse proportion to its volume. Let be a random variable denoting a leaf. Our pmf is .

To sample a point, sample a leaf based on the pmf first, and then sample a point from within this leaf assuming a uniform distribution:

Sample a leaf, .

Sample a point within this leaf, .
The probability of sampling outside any is set to . Since leaves are characterized by low entropy over the label distribution, we can assign the majority label for a leaf , to points sampled from it. Let . Then,
where, 
We refer to such a DT, along with a suitably defined pmf over its leaves, as a density tree, since it helps us define a pdf over the input space .
While this simple scheme represents our approach at a highlevel, it raises a few questions; we discuss these below:

What function must we use to construct our pmf? One could just use where is the normalization constant . However, this quantity changes rapidly with volume. Consider a hypercube with edgelength in dimensions; the difference in the (nonnormalized) mass between this and another hypercube with edgelength is . Not only is this change drastic, but it also has potential for numeric underflow.
An alternative is to use a “milder” function like the inverse of the length of the diagonal, . Since DT leaves are axisparallel hyperrectangles, is always well defined. In our hypercube example, the the probability mass is when the edgelength is . The difference in the nonnormalized masses between the two cubes is now .
We use the inverse of the diagonal length in our experiments.

The previous point begs the question: is there a better pmf we could use? Instead of finding the optimal pmf, we propose a simpler solution: given a pmf, allow the sampling scheme to modify it to some extent. In our case, we allow for Laplace smoothing, with the smoothing coefficient . This modifies our pmf thus:
Here, is the normalization constant. Our algorithm discovers the optimal value for .
We pick Laplace smoothing because it is fast. Our framework, however, is general enough to admit a wide variety of options (discussed in Section 4).

An obvious flaw in our geometric view is if a boundary is axisaligned, there are no leaf regions of small volume along this boundary. An easy way to address this problem is to transform the data by rotating or shearing it, and then construct a decision tree. See Figure 6. The image on the left shows a DT with two leaves constructed on the data that has an axisparallel boundary. The image on the right shows multiple leaves around the boundary region, after the data is slightly rotated.
The idea of transforming data by rotation is not new (Rodriguez et al., 2006; Blaser and Fryzlewicz, 2016). However, a couple of significant differences in our setup are:

We don’t require a specific transformation; any transformation that produces small leaf regions near the boundary works for us. As mentioned, this needn’t only be rotation.

Since we are sampling points to learn a classifier that we eventually want to interpret in the original input space, we need to transform back our sample. This would not be required, say, if our only goal is increase classification accuracy.
The need to transform back introduces an additional challenge: we cannot drastically transform the data since sampled points in the transformed space might become outliers in the original space. Figure 7 illustrates this idea. The first panel shows data in the transformed space. The solid rectangle bounds the region at the root node; since this must have axisparallel sides, it is defined by the extremities of the region occupied by the transformed data. Any point within this region becomes part of some leaf when the DT is grown. Consider point  it is valid for our sampler to pick this. The second panel shows what the training data and leafregions look like when they are transformed back to the original space. Clearly, the leaves may extend well beyond the region occupied by the data, and here, becomes an outlier.
An efficient solution to this problem is to restrict the extent of transformation using a “near identity” matrix : this has as its diagonal elements, and the other elements are close to ; we sample the offdiagonal elements from , where and small. This only slightly transforms the data, so that we obtain small volume leaves at class boundaries, but also, all valid samples are within the original data region or close to it. The tree is constructed on , where is the original data, and samples from the tree, , are transformed back with . Figure 6 is actually an example of such a nearidentity transformation.
How do we know when to transform our data, i.e., when do we know we have axisaligned boundaries? Since this is hard to determine, we create multiple trees, each on a transformed version of the data (with different transformation matrices ), and randomly pick a tree to sample from its leaves. It is highly unlikely that all trees from this bagging step simultaneously see axisaligned boundaries in the respective transformed space. We denote this bag of trees and their corresponding transformations by . Bagging also provides the additional benefit of low variance. Algorithm 2 details how is created. We set for our experiments.


Since we rely on geometric properties alone to define our pmf, all boundary regions receive a high probability mass irrespective of how much they contribute to classifier accuracy. This is not desirable when the classifier is small and must focus on a few high impact patterns. In other words, the problem is that while the pmf reflects how close a region is to a boundary, it completely ignores its impact on learning.
Figure 8 suggests a solution. It shows two trees of different depths learned on the same data. The data has a small green region, shown with a dashed blue circle in the first panel. A smaller tree, shown on the left, automatically ignores this, while a larger tree, on the right, identifies a leaf around it.
One way to utilize this property would be to construct density trees of different depths, and then learn a distribution over these trees to sample from. But since learning trees of different depths (actually we require multiple trees at a given depth, for the bagging effect discussed in the previous point) can be time consuming, we instead learn a “depth distribution” over fully grown density trees.
A sample from a depthdistribution tells us the depth in the density tree at which we should consider nodes to sample points from. We treat these nodes as our leaves, constructing our smoothed pmf over them as before. This distribution is represented as a IBMM on one dimension  the depth. Figure 9(a) illustrates this idea. Similar to the representation described in Section 2.1.2, the depthdistribution has a parameter for the DP and parameters for its Beta priors: .
Since we use CART to learn our density trees, we have binary trees that are always full, but not necessarily complete i.e. the nodes at a certain depth alone might not represent the entire input space. To sample at such depths, we “back up” to the nodes at the closest depth. Figure 9(b) shows this: at and , we can construct our pmf with only nodes available at these depths, and respectively, and still cover the whole input space. But for and , we consider nodes and respectively. The dotted red line connects the nodes that contribute to the pmf for a certain depth.
Note that when we were sampling only from the leaves of a density tree, we decided that we could just assign the majority label to the samples given low label entropy. This is not true at nodes at intermediate levels which may have high label entropies. We deal with this problem by defining an entropy threshold . If the label distribution at a node has , we sample uniformly within the region encompassed by the node (which may be a leaf or an internal node) and use the majority label. This potentially generates new points. However, if the , we sample from the training data points that the node covers. can be set to a reasonably low value like and need not be learned.
This completes the description of our sampling technique. Summarizing the parameters mentioned above, we have:

, regularization coefficient for Laplace smoothing.

, the DP parameter.

, the parameters of the Beta priors for the IBMM depth distribution. A cluster/partition is described by , where , .
Additionally, we propose the following parameters:

, sample size. The sample size can have a significant effect on model performance. We let the optimizer determine the best sample size to learn from. We constrain to be larger than the minimum number of points needed for statistically significant results.

 proportion of the sample from the original distribution. Given a value for , we sample points from the density tree(s) and points from our training data . The sample from is stratified based on . Recall that our hypothesis is that learning a distribution might help when the model is small; but at larger sizes we expect learning to be optimal on the original distribution. This parameter helps the optimizer to gracefully transition to the original distribution, by setting , as we run our algorithm on increasing model sizes. In fact, observing a gradual transition would empirically validate our hypothesis.
See section A.1 for additional details.
We have a total of 8 parameters in this technique. The parameters relevant to picking the right depth are collectively denoted by . The complete set of parameters is denoted by .
Although our bag of density trees, , has more than one tree, the parameter is shared; we sample a value , and scale and discretize it to reflect a valid value for depth for the density tree being considered. For instance, if we want to sample from tree where , implies we must sample at i.e. at the root, and implies we must sample at . Algorithm 3 shows how sampling from works.
This is a welcome departure from our naive solution: the number of optimization variables does not depend on the dimensionality at all! We have a fixed set of 8 variables for any data. Of course, there is a hidden cost we incur in learning the bag of density trees, but the total run time is still much lower than the previous solution.
As before, we discover the optimal using TPE. We begin by splitting our dataset into train, validation and test sets  respectively. We then construct our bag of density trees, , on transformed versions of , as described in Algorithm 2. At each iteration in the optimization, based on the current value of , we sample data from and , train our model on this sample, and evaluate it on . Algorithm 4 lists these steps.
3 Experiments
This section discusses our experiments. In Section 3.1, we describe our setup: datasets, model families, parameter settings, etc. Section 3.2 presents the results and our analysis.
3.1 Setup
We evaluate Algorithm 4 using different models on multiple real world datasets. The datasets were obtained from the LIBSVM website (Chang and Lin, 2011). These are listed in Table I.
dataset  dimensions  # classes 

codrna  8  2 
ijcnn1  22  2 
higgs  28  2 
covtype.binary  54  2 
phishing  68  2 
a1a  123  2 
pendigits  16  10 
letter  16  26 
Sensorless  48  11 
senseit_aco  50  3 
senseit_sei  50  3 
covtype  54  7 
connect4  126  3 
We use the following models:

DT: We use the implementation of CART in the scikitlearn library (Pedregosa et al., 2011). Our notion of size here is the depth of the tree.

Linear Probability Model (LPM): This is a linear classifier. Our notion of size is the number of terms in the model, i.e., features from the original data with nonzero coefficients. We use our own implementation that heavily uses scikitlearn. Since LPMs inherently handle only binary class data, for a multiclass problem, we construct a onevsrest (ovr) model, comprising of as many binary classifiers as there are distinct labels. The given size is enforced for each binary classifier. For instance, if we have a 3class problem, and we specify a size of , then we construct binary classifiers, each with terms. We did not use the more common Logistic Regression classifier because: (1) from the perspective of interpretability, LPMs provide a better sense of variable importance (Mood, 2010) (2) we believe our effect is equally well illustrated by either linear classifier.
We use the Least Angle Regression (Efron et al., 2004) algorithm, that grows the model one term at a time, to enforce the size constraint.

Gradient Boosted Model (GBM): We use decision trees as our base classifier in the boosting. Our notion of size is the number of trees in the boosted forest for a fixed maximum depth of the base classifiers. We make use of the implementation provided by the LightGBM library (Ke et al., 2017).
We run two sets of experiments with the GBM, with maximum depths fixed at and . This helps us compare the impact of our technique across models ( in Algorithm 4) with inherently different expressive powers, e.g., we would expect a GBM with trees and a maximum depth of to be more accurate than a GBM with trees and a maximum depth of .
The density trees themselves are built using the CART implementation from scikitlearn. We draw a sample from the IBMM using BlackwellMacQueen sampling (Blackwell and MacQueen, 1973).
We list the various settings for our experiments below:

Since TPE performs optimization with box constraints, we need to specify our search space for the various parameters in Algorithm 4:

: this is varied in the logspace such that .

: We want to allow the algorithm to arbitrarily mix samples from and . Hence, we set .

: We set . The lower bound ensures that we have statistically significant results. The upper bound is set to a reasonably large value.

: For a DP, . We use a lower bound of .
We rely on the general properties of a DP to estimate an upper bound. For points, the expected number of components of a has an upper bound of , where is the harmonic sum (Teh, 2010). Since our distribution is over the depth of a density tree, we already know the maximum number of components possible, . Thus the upper bound for must be . We estimate this upper bound with , where . We use this value of since we assume that the we would have enough points (lower bound of the parameter) to at least sample once from each depth; also, typically , giving us a liberal upper bound.
Thus, we set , where . We use .

: Each of these parameters are allowed a range to admit various shapes for the distributions.


Iterations: We need to provide a budget of iterations for the TPE to run. In the case of DT, GBM and binary class problems for LPM, . Since multiclass problems in LPM require learning multiple classifiers, leading to high running times, we use a lower value of .

Model sizes:

DT: For a dataset, we first learn an optimal tree based on the F1score, without any size constraints. Denote the depth of this tree by . We then try our algorithm for the these settings of CART’s parameter: , i.e., we experiment only up to a model size of , stopping early if we encounter the optimal model size. Stopping early makes sense since the model has attained the size needed to capture all patterns in the data; changing the input distribution is not going to help much/at all beyond this point.
Note that since the actual tree depth is only guaranteed to be , we might not see all sizes in , e.g., might give us a tree with , might also result in a tree with , but might give us a tree with .

LPM: For a dataset with dimensionality , we construct models of sizes: . Here, the early stopping for LPM happens only for the dataset codrna, which has . All other datasets have (see Table I).

GBM: If the optimal number of boosting rounds for a dataset is , we explore the model size range: . We run two sets of experiments with GBM  one using base classification trees with , and another with . Both experiments use the same range for size/boosting rounds.

At each model size, we record the percentage relative improvement in the (macro) score on compared to the baseline of training with the original distribution. This score is calculated as:
Since the original distribution is part of the optimization search space, e.g., , the lowest improvement we report is .
3.2 Observations
The DT results are shown in Table II. A series of unavailable scores, denoted by “”, toward the right end of the table for a dataset denotes we have already reached its optimal size. For ex in Table II, codrna has an optimal size of . A missing value before the optimal size denotes that the corresponding actual size was not obtained for any setting of the parameter. An example is for codrna.
For each dataset, the best improvement across different sizes is shown in bold. The horizontal line separates binary datasets from multiclass datasets.
depth =  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15 

datasets  
codrna  0.67    0.34  1.84  0.00  0.00  0.70  0.00  0.46  0.00           
ijcnn1  4.88  19.38  12.01  16.12  3.42  0.91  1.15  0.57  0.00  0.00  0.19  0.00  0.00  0.00   
higgs  7.11  1.91  0.00  0.00  0.00  0.94                   
covtype.binary  0.00  0.00  0.01  0.68  0.47  0.00  0.00  0.55  1.89  1.44           
phishing  0.00  0.71  0.00  0.00  0.46  0.00  0.08  0.00  0.00  0.00  0.00  0.00  0.00    0.00 
a1a  0.00  2.50  7.23    4.91  3.06  4.75  0.88  0.87    0.85    2.35    0.92 
pendigits  10.14  3.46  3.04  6.71  5.97  4.21  1.61  0.00  0.00  0.00  0.00  0.00  0.00     
letter  0.18  9.71  31.91  36.26  30.03  17.73  6.76  3.65  1.90  1.95  0.00  0.00  0.00  0.00  0.00 
Sensorless  0.00  41.80  85.34  65.99  27.63  14.84  7.40  5.29  2.31  1.43  0.77  0.47    0.84   
senseit_aco  18.69  0.51  5.29  1.98  1.44  0.56                   
senseit_sei  1.97  0.68  1.56  0.09  1.30  0.10                   
covtype  16.89  125.49  16.73  5.29  10.83  7.31    5.91  8.18  8.15  11.49  0.00  0.00  1.34  2.06 
connect4  181.76  22.50  18.55  25.13    23.79    4.20  0.00  3.08  2.21  1.92  0.00  0.00  0.68 
This data is also visualized in Figure 10. The xaxis shows a “normalized size” for easy comparison; actual model sizes for a dataset are divided by the maximum actual model size explored. This allows us to compare a dataset like codrna, which only has models up to a size of , with , where model sizes go all the way up to .
We observe that aside from the datasets codrna, covtype.binary, phishing, we see significant improvement in the F1score for at least one model size. More so, these improvements seem to happen at small sizes: only one best score  for covtype.binary  shows up on the right half of Table II. The best improvements themselves vary a lot, ranging from for phishing to for connect4.
It also seems that we do much better with multiclass data than with binary classes. Because of the large variance in improvements, this is hard to observe in Figure 10. However, if we separate out the binary and multiclass results, as in Figure 11, we note that there are improvements in both the binary and multiclass cases, and the magnitude in the latter are typically higher (note the yaxes). We surmise this happens because, in general, DTs of a fixed depth have a harder problem to solve when the data is multiclass, providing our algorithm with an easier baseline to beat.
Figure 12 shows the behavior of , only for the datasets where our models have grown to the optimal size (the last column in Table II for these datasets are empty). Thus, we exclude phishing, a1a, letter, covtype, connect4. We observe that indeed as our model grows to the optimal size. This empirically validates our hypothesis that smaller models prefer a distribution different from the original distribution to learn from, but the latter is optimal for larger models. And we gradually transition to it as model size increases.
Demonstrating this effect is a key contribution of our work.
We are also interested in knowing what the depthdistribution IBMM looks like. This is challenging to visualize for multiple datasets in one plot, since we have an optimal IBMM learned by our optimizer, for each size setting. We summarize this information for a dataset in the following manner:

We identify a sample size of points to use.

We allocate points to sample for the IBMM for a particular model size in proportion of the relative improvement in the F1 score. For instance, if we have experimented with 3 model sizes, where the improvements seen in the F1score are and , we sample and points respectively from the IBMMs of these models.

We combine the points into one sample of size , fit a Kernel Density Estimate (KDE) to it, and plot the KDE curve. This plot represents the IBMM across model sizes for a dataset weighted by the improvement seen for a size.
We use . The value for the KDE bandwidth parameter was arrived at by trial and error.
Figure 13 shows such a plot for DTs. The xaxis represents the depth of the density tree normalized to . The smoothing by the KDE causes some spillover beyond these bounds.
We observe that, in general, the depth distribution is concentrated either near the root of a density tree, where we have little or no information about class boundaries and the distribution is nearly identical to the original distribution, or at the leaves, where we have complete information of the class boundaries. An intermediate depth is not used as much. Our best guess as to why this is so is that the information provided at an intermediate depth  where we have moved away from the original distribution, but have not yet completely discovered the class boundaries  might be relatively noisy to be useful. This pattern in the depth distribution is surprisingly consistent across all the models we have experimented with.
# terms =  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15 

datasets  
codrna  13.95  29.93  66.56  76.41  67.93  15.30  2.44  3.76               
ijcnn1  19.68  7.09  0.00  2.06  0.20  1.15  0.20  0.41  1.03  0.87  0.41  1.61  2.20  0.15  3.45 
higgs  0.00  0.11  0.21  0.11  1.08  1.29  1.12  2.87  1.17  1.39  4.11  2.92  4.37  3.36  4.96 
covtype.binary  0.40  3.68  4.03  6.49  9.24  11.61  11.27  4.76  3.71  8.70  9.48  8.33  10.43  13.76  12.82 
phishing  0.00  0.00  0.00  0.00  0.00  0.15  0.94  0.00  0.00  0.01  0.00  0.00  0.00  0.00  0.33 
a1a  0.00  41.77  62.21  0.00  0.00  1.55  0.38  1.25  1.81  0.10  1.89  2.68  2.95  0.69  2.89 
pendigits  11.36  9.94  10.38  4.82  9.06  2.36  3.10  0.95  1.40  0.96  0.58  0.43  0.82  0.00  0.35 
letter  8.94  16.12  23.44  11.49  8.55  3.42  0.00  1.44  1.27  5.18  0.00  1.88  5.55  6.70  6.88 
Sensorless  59.59  60.09  19.07  26.65  26.38  34.96  38.89  47.82  37.05  49.47  34.26  44.64  46.68  40.58  44.13 
senseit_aco  19.27  79.89  68.00  32.05  13.35  16.77  8.49  6.00  1.87  3.64  1.30  1.67  2.25  0.34  0.13 
senseit_sei  137.91  44.84  26.91  7.67  2.25  0.59  0.00  2.13  0.95  0.89  0.71  1.70  0.08  0.40  0.48 
covtype  43.36  19.21  4.96  6.01  3.08  0.72  4.83  7.60  6.46  5.39  4.29  1.76  2.20  5.07  3.10 
connect4  0.00  38.59  50.94  30.86  8.71  6.57  5.31  13.10  10.80  4.98  1.68  1.98  0.00  0.00  2.33 
The results for LPM are shown in Table III. The improvements look different from what we observed for DT, which is to be expected across different model families. Notably, compared to DTs, there is no prominent disparity in the improvements between binary class and multiclass datasets. Since the LPM builds ovr binary classifiers in the multiclass case, and the size restriction  number of terms  applies to each one of them, this intuitively makes sense.
Figure 14 shows the plots for improvement in the F1score and the weighted depth distribution. Note that unlike the case of the DT, we haven’t determined how many terms the optimal model for a dataset has; we go up to where possible, i.e., . The depth distribution plot displays concentration near the root and the leaves, similar to the case of the DT in Figure 13.
An interesting question to ask is how, if at all, the strength of the model family of in Algorithm 4, influences the improvements in accuracy. We cannot directly compare DTs with LPMs since we don’t know how to order models from different families: a DT of what depth is best compared to a LPM with, say, nonzero terms?
To answer this question we look at GBMs where we identify two levers to control the model size. We constrain the of the base classifier trees to and , and vary the number of boosting rounds in .
We recognize that qualitatively there are two factors at play:

A weak model family implies it might not learn sufficiently well from the samples our technique produces. Hence, we expect to see smaller improvements than when using a stronger model family.

A weak model family implies there is a lower baseline to beat. Hence, we expect to see larger improvements.
We present an abridged version of the GBM results in Table IV. The complete data is made available in Table V in the Appendix. We present both the improvement in the score, , and its new value, .
boosting rounds =  1  2  3  4  5  6  7  8  9  10  

datasets  max depth  score type  
Sensorless  2  0.78  0.79  0.80  0.82  0.82  0.82  0.82  0.82  0.82  0.83  
5.21  6.61  5.34  6.24  3.89  3.21  4.32  3.67  3.61  4.83  
5  0.90  0.91  0.92  0.93  0.93  0.93  0.94  0.94  0.94  0.94  
0.00  0.00  0.46  1.13  0.00  0.00  1.55  0.71  0.00  0.74  
senseit_aco  2  0.22  0.30  0.40  0.41  0.54  0.61  0.62  0.63  0.64  0.64  
0.00  34.07  81.50  86.58  141.87  9.49  4.16  6.97  3.77  1.49  
5  0.22  0.29  0.45  0.54  0.63  0.63  0.66  0.68  0.68  0.69  
0.00  30.06  101.82  54.52  16.52  1.99  0.00  0.00  0.00  0.00  
senseit_sei  2  0.61  0.59  0.61  0.62  0.61  0.62  0.62  0.62  0.62  0.61  
173.64  167.01  174.07  176.79  176.20  157.16  178.51  62.46  32.67  17.33  
5  0.64  0.65  0.66  0.66  0.67  0.67  0.67  0.67  0.66  0.66  
189.62  194.02  194.71  188.23  73.50  37.07  16.51  8.44  1.89  0.00 
See Table V for complete data, Fig 15 and Fig 16 for plots.
Figure 15 and Figure 16 show the improvement and depth distribution plots for the GBMs with and respectively.
The cells highlighted in blue in Table IV are where the GBM with showed a larger improvement than a GBM with for the same number of boosting rounds. The cells highlighted in red exhibit the opposite case. Clearly, both factors manifest themselves. Looking at the relative improvement plots in Figure 15 and Figure 16 it would seem that a weaker model typically shows larger improvements.
Observe that in Table IV (and Table V), irrespective of the improvements, , the new scores for the weaker base classifier with is up to as large (within some margin of error) as the score for the GBMs with for the same number of rounds. This is to be expected since our sampling technique diminishes the gap between representational and effective capacities (when such a gap exists); it does not improve the representational capacity. However, this opens up an interesting line for future enquiry: is the effectiveness of diminishing the gap same/similar across different model families? In the case of our GBM experiments, where the only difference between the two experiments is the size of the base classifiers, the effectiveness might be reasonably assumed to be the same. But how about disparate model families like LPM and DTs?
The depth distribution plots for the GBMs show a familiar pattern: high concentration at the root or the leaves.
In Figure 17, we compare the depth distributions of all our models. The yaxes are scaled to have the same range.
It appears that stronger model families, in general, tend to prefer a concentration near leaves compared to weaker model families. Compare DT and LPM, where DT is the stronger model family. There is a clear higher concentration near the leaves for the DT, and a clear higher concentration near the root for the LPM. Between the two categories of GBM models, we see a milder version of this phenomena: the models with prefer the root, and while the models with do not entirely prefer the leaves, there is a distinct shift away from the root. We believe this happens because low bias classifiers are relatively better able to assimilate the richer information at the leaves.
4 Discussion and Conclusion
The previous section exclusively looked at experiments that validate our hypothesis; however, given that our algorithm is reasonably abstracted from low level details, there are various useful extensions and applications possible, some of which we list below:

We had mentioned in Section 2.3 that there are alternatives to Laplace smoothing that may be used. We discuss one possibility here. Let denote a similarity matrix for our density tree with nodes. Here, is the similarity score between nodes and . Let denote the initial (i.e. before smoothing) probability masses for the nodes. Appropriately normalizing gives us a smoothed pmf that is entirely determined by our understanding of similarity, , between nodes. For instance, might be determined based on the WuPalmer distance (Wu and Palmer, 1994).
This view of characterizing the smoothing step with a node similarity matrix opens up a wide range of possibilities.

Our sampling operates within a bounding box around the data in , where is the dimensionality. There is no reason that the bounding box must contain all the data: we may arbitrarily shrink it to a region that we want to focus on, and construct our interpretable model in this region. This can help in, say, segment analysis of purchase behavior, where we want to learn about people only in the age group years and having salaries greater than per annum. We construct a bounding box around data satisfying these criteria.
In the limit, shrinking the bounding box makes our algorithm functionally similar to the LIME technique (Ribeiro et al., 2016), which constructs an interpretable model for a single test point by sampling points in its neighborhood.

We have not explicitly discussed the case of categorical features. There are a couple of ways to handle data with such features:

The density tree can directly deal with categorical variables. When uniformly sampling from a node that is described by conditions on both continuous and categorical variables, we need to use both a continuous uniform sampler (which we use now) and a discrete uniform sampler (i.e. multinomial with equal masses) for the respective features.

We could create a version of the data with onehot encoded categorical features to construct the density trees and sample from them. To train the small model in each optimizer iteration, we transform back the sampled data by identifying values for the categorical features to be the maximums in their corresponding subvectors. Since the optimizer already assumes a blackbox function, this transformation would be modeled as a part of it.


The notion of size need not be a scalar. Our GBM experiments touch upon this possibility. The definition of size only influences how the call to in Algorithm 4 internally executes. This is makes our technique fairly flexible. For ex it is easy in our setup to vary both and number of boosting rounds for GBMs, thus defining a bivariate size, and assess the effect of our algorithm.

Since the range of the sample size parameter is fixed by the user, the possibility of oversampling is subsumed by the optimization. For instance, if our dataset has points, and we believe that oversampling till up to times might help, we need not separately explore this possibility and instead set .

An interesting possible usecase is model compression. Consider the column for boosting round for the senseit_sei dataset in Table IV. Assuming the base classifiers have grown to their , the “storage size” in terms of nodes for the GBMs with and are and respectively.
Going from the second model to the first, the decrease in score is , but the reduction in model size is ! And this is after the accuracy for the second model has already been improved by our algorithm; if we look at its original score , our score improves by (comparing the original score of the second model with the improved score of the first) for the same model size decrease of , and we make an even stronger case for using our algorithm to enhance small models and use them inlieu of larger ones in memory sensitive applications.

Since our parameter search space has no special structure, the workings of the optimizer is decoupled from the larger sampling framework. This makes it easy to use a different optimizer. However, we still insist on using BO because of the reasons presented in Section 2.1.1.
The optimizer we use, TPE, in fact supports conditional parameter spaces, but we have purposefully used a distribution with a fixed number of parameters so we don’t need to rely on optimizerspecific features.
In summary, we have demonstrated how changing the density in the small model regime can dramatically influence model accuracy. In doing so, we have additionally shown that this phenomenon exists for multiple model families, and is therefore general, and have proposed a practical algorithm for finding the optimal density for learning. Our algorithm also has the surprising property that beyond a preprocessing step, the number of optimization variables does not depend on the dimensionality of the data. We have also discussed how our algorithm may be extended in some useful ways.
We believe that the results presented here are valuable in terms of their utility in building interpretable models, and hope they would prompt a larger discussion around the effect itself.
=0mu plus 1mu
References
 Angelino et al. (2017) Angelino, E., LarusStone, N., Alabi, D., Seltzer, M., and Rudin, C. Learning certifiably optimal rule lists. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17, pages 35–44, New York, NY, USA, 2017. ACM. ISBN 9781450348874. doi: 10.1145/3097983.3098047. URL http://doi.acm.org/10.1145/3097983.3098047.
 Angwin et al. (2016) Angwin, J., Larson, J., Mattu, S., and Kirchner, L. Machine Bias. https://www.propublica.org/article/machinebiasriskassessmentsincriminalsentencing, May 2016.
 Bachem et al. (2017) Bachem, O., Lucic, M., and Krause, A. Practical Coreset Constructions for Machine Learning. arXiv eprints, art. arXiv:1703.06476, Mar 2017.
 Bergstra et al. (2013) Bergstra, J., Yamins, D., and Cox, D. D. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In Proceedings of the 30th International Conference on International Conference on Machine Learning  Volume 28, ICML’13, pages I–115–I–123. JMLR.org, 2013. URL http://dl.acm.org/citation.cfm?id=3042817.3042832.
 Bergstra et al. (2011) Bergstra, J., Bardenet, R., Bengio, Y., and Kégl, B. Algorithms for hyperparameter optimization. In Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS’11, pages 2546–2554, USA, 2011. Curran Associates Inc. ISBN 9781618395993. URL http://dl.acm.org/citation.cfm?id=2986459.2986743.
 Blackwell and MacQueen (1973) Blackwell, D. and MacQueen, J. B. Ferguson distributions via polya urn schemes. Ann. Statist., 1(2):353–355, 03 1973. doi: 10.1214/aos/1176342372. URL https://doi.org/10.1214/aos/1176342372.
 Blaser and Fryzlewicz (2016) Blaser, R. and Fryzlewicz, P. Random rotation ensembles. Journal of Machine Learning Research, 17(4):1–26, 2016. URL http://jmlr.org/papers/v17/blaser16a.html.
 Breiman et al. (1984) Breiman, L. et al. Classification and Regression Trees. Chapman & Hall, New York, 1984. ISBN 0412048418.
 Brochu et al. (2010) Brochu, E., Cora, V. M., and de Freitas, N. A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. CoRR, abs/1012.2599, 2010.
 Caruana et al. (2015) Caruana, R., Lou, Y., Gehrke, J., Koch, P., Sturm, M., and Elhadad, N. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30day readmission. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, pages 1721–1730, New York, NY, USA, 2015. ACM. ISBN 9781450336642. doi: 10.1145/2783258.2788613. URL http://doi.acm.org/10.1145/2783258.2788613.
 Castellanos and Nash (2018) Castellanos, S. and Nash, K. S. Bank of America Confronts AI’s ‘Black Box’ With Fraud Detection Effort. https://blogs.wsj.com/cio/2018/05/11/bankofamericaconfrontsaisblackboxwithfrauddetectioneffort/, May 2018.
 Chang and Lin (2011) Chang, C.C. and Lin, C.J. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
 Efron et al. (2004) Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. Least angle regression. Ann. Statist., 32(2):407–499, 04 2004. doi: 10.1214/009053604000000067. URL https://doi.org/10.1214/009053604000000067.
 Goodman and Flaxman (2017) Goodman, B. and Flaxman, S. European union regulations on algorithmic decisionmaking and a ”right to explanation”. AI Magazine, 38:50–57, 2017.
 Gunning (2016) Gunning, D. Explainable Artificial Intelligence. https://www.darpa.mil/program/explainableartificialintelligence, July 2016.
 Herman (2017) Herman, B. The promise and peril of human evaluation for model interpretability. CoRR, abs/1711.07414, 2017. URL http://arxiv.org/abs/1711.07414.
 Hutter et al. (2011) Hutter, F., Hoos, H. H., and LeytonBrown, K. Sequential modelbased optimization for general algorithm configuration. In Proceedings of the 5th International Conference on Learning and Intelligent Optimization, LION’05, pages 507–523, Berlin, Heidelberg, 2011. SpringerVerlag. ISBN 9783642255656. doi: 10.1007/9783642255663˙40. URL http://dx.doi.org/10.1007/9783642255663_40.
 Japkowicz and Stephen (2002) Japkowicz, N. and Stephen, S. The class imbalance problem: A systematic study. Intell. Data Anal., 6(5):429–449, October 2002. ISSN 1088467X. URL http://dl.acm.org/citation.cfm?id=1293951.1293954.
 Ke et al. (2017) Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.Y. Lightgbm: A highly efficient gradient boosting decision tree. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pages 3149–3157, USA, 2017. Curran Associates Inc. ISBN 9781510860964. URL http://dl.acm.org/citation.cfm?id=3294996.3295074.
 Koh and Liang (2017) Koh, P. W. and Liang, P. Understanding blackbox predictions via influence functions. In Precup, D. and Teh, Y. W., editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1885–1894, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/koh17a.html.
 Lakkaraju et al. (2016) Lakkaraju, H., Bach, S. H., and Leskovec, J. Interpretable decision sets: A joint framework for description and prediction. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 1675–1684, New York, NY, USA, 2016. ACM. ISBN 9781450342322. doi: 10.1145/2939672.2939874. URL http://doi.acm.org/10.1145/2939672.2939874.
 Larson et al. (2016) Larson, J., Mattu, S., Kirchner, L., and Angwin, J. How We Analyzed the COMPAS Recidivism Algorithm. https://www.propublica.org/article/howweanalyzedthecompasrecidivismalgorithm, May 2016.
 Letham et al. (2013) Letham, B., Rudin, C., McCormick, T. H., and Madigan, D. Interpretable classifiers using rules and bayesian analysis: Building a better stroke prediction model. CoRR, abs/1511.01644, 2013.
 Letham et al. (2017) Letham, B., Karrer, B., Ottoni, G., and Bakshy, E. Constrained bayesian optimization with noisy experiments. Bayesian Analysis, 06 2017. doi: 10.1214/18BA1110.
 Li et al. (2017) Li, C., Gupta, S., Rana, S., Nguyen, V., Venkatesh, S., and Shilton, A. High dimensional bayesian optimization using dropout. In Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence, IJCAI17, pages 2096–2102, 2017. doi: 10.24963/ijcai.2017/291. URL https://doi.org/10.24963/ijcai.2017/291.
 Lim and Hastie (2015) Lim, M. and Hastie, T. Learning interactions via hierarchical grouplasso regularization. J Comput Graph Stat, 24(3):627–654, 2015. ISSN 10618600. doi: 10.1080/10618600.2014.938812. URL https://www.ncbi.nlm.nih.gov/pubmed/26759522. 26759522[pmid].
 Lipton (2018) Lipton, Z. C. The mythos of model interpretability. Queue, 16(3):30:31–30:57, June 2018. ISSN 15427730. doi: 10.1145/3236386.3241340. URL http://doi.acm.org/10.1145/3236386.3241340.
 Lou et al. (2013) Lou, Y., Caruana, R., Gehrke, J., and Hooker, G. Accurate intelligible models with pairwise interactions. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’13, pages 623–631, New York, NY, USA, 2013. ACM. ISBN 9781450321747. doi: 10.1145/2487575.2487579. URL http://doi.acm.org/10.1145/2487575.2487579.
 Malkomes and Garnett (2018) Malkomes, G. and Garnett, R. Automating bayesian optimization with bayesian optimization. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., CesaBianchi, N., and Garnett, R., editors, Advances in Neural Information Processing Systems 31, pages 5984–5994. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/7838automatingbayesianoptimizationwithbayesianoptimization.pdf.
 Mood (2010) Mood, C. Logistic regression : Why we cannot do what we think we can do, and what we can do about it. European Sociological Review, 26(1):67–82, 2010. doi: 10.1093/esr/jcp006.
 Munteanu and Schwiegelshohn (2018) Munteanu, A. and Schwiegelshohn, C. Coresetsmethods and history: A theoreticians design pattern for approximation and streaming algorithms. KI  Künstliche Intelligenz, 32(1):37–53, Feb 2018. ISSN 16101987. doi: 10.1007/s1321801705193. URL https://doi.org/10.1007/s1321801705193.
 Olkin and Trikalinos (2014) Olkin, I. and Trikalinos, T. Constructions for a bivariate beta distribution. Statistics & Probability Letters, 96, 06 2014. doi: 10.1016/j.spl.2014.09.013.
 Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
 Quinlan (1993) Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1993. ISBN 1558602380.
 Quinlan (2004) Quinlan, J. R. C5.0. https://rulequest.com/, 2004.
 Rana et al. (2017) Rana, S., Li, C., Gupta, S., Nguyen, V., and Venkatesh, S. High dimensional Bayesian optimization with elastic Gaussian process. In Precup, D. and Teh, Y. W., editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2883–2891, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/rana17a.html.
 Rasmussen (1999) Rasmussen, C. E. The infinite gaussian mixture model. In Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS’99, pages 554–560, Cambridge, MA, USA, 1999. MIT Press. URL http://dl.acm.org/citation.cfm?id=3009657.3009736.
 Ribeiro et al. (2016) Ribeiro, M. T., Singh, S., and Guestrin, C. ”why should I trust you?”: Explaining the predictions of any classifier. CoRR, abs/1602.04938, 2016. URL http://arxiv.org/abs/1602.04938.
 Ribeiro et al. (2018) Ribeiro, M. T., Singh, S., and Guestrin, C. Anchors: Highprecision modelagnostic explanations, 2018. URL https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16982.
 Rodriguez et al. (2006) Rodriguez, J. J., Kuncheva, L. I., and Alonso, C. J. Rotation forest: A new classifier ensemble method. IEEE Trans. Pattern Anal. Mach. Intell., 28(10):1619–1630, October 2006. ISSN 01628828. doi: 10.1109/TPAMI.2006.211. URL https://doi.org/10.1109/TPAMI.2006.211.
 Selvaraju et al. (2017) Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., and Batra, D. Gradcam: Visual explanations from deep networks via gradientbased localization. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 618–626, Oct 2017. doi: 10.1109/ICCV.2017.74.
 Settles (2009) Settles, B. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin–Madison, 2009. URL http://axon.cs.byu.edu/~martinez/classes/778/Papers/settles.activelearning.pdf.
 Snoek et al. (2012) Snoek, J., Larochelle, H., and Adams, R. P. Practical bayesian optimization of machine learning algorithms. In Pereira, F., Burges, C. J. C., Bottou, L., and Weinberger, K. Q., editors, Advances in Neural Information Processing Systems 25, pages 2951–2959. Curran Associates, Inc., 2012. URL http://papers.nips.cc/paper/4522practicalbayesianoptimizationofmachinelearningalgorithms.pdf.
 Snoek et al. (2015) Snoek, J., Rippel, O., Swersky, K., Kiros, R., Satish, N., Sundaram, N., Patwary, M. M. A., Prabhat, P., and Adams, R. P. Scalable bayesian optimization using deep neural networks. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning  Volume 37, ICML’15, pages 2171–2180. JMLR.org, 2015. URL http://dl.acm.org/citation.cfm?id=3045118.3045349.
 Teh (2010) Teh, Y. W. Dirichlet Process, pages 280–287. Springer US, Boston, MA, 2010. ISBN 9780387301648. doi: 10.1007/9780387301648˙219. URL https://doi.org/10.1007/9780387301648_219.
 Ustun and Rudin (2016) Ustun, B. and Rudin, C. Supersparse linear integer models for optimized medical scoring systems. Machine Learning, 102(3):349–391, Mar 2016. ISSN 15730565. doi: 10.1007/s1099401555286. URL https://doi.org/10.1007/s1099401555286.
 Wang et al. (2013) Wang, Z., Zoghi, M., Hutter, F., Matheson, D., and De Freitas, N. Bayesian optimization in high dimensions via random embeddings. In Proceedings of the TwentyThird International Joint Conference on Artificial Intelligence, IJCAI ’13, pages 1778–1784. AAAI Press, 2013. ISBN 9781577356332. URL http://dl.acm.org/citation.cfm?id=2540128.2540383.
 Wu and Palmer (1994) Wu, Z. and Palmer, M. Verbs semantics and lexical selection. In Proceedings of the 32Nd Annual Meeting on Association for Computational Linguistics, ACL ’94, pages 133–138, Stroudsburg, PA, USA, 1994. Association for Computational Linguistics. doi: 10.3115/981732.981751. URL https://doi.org/10.3115/981732.981751.
Appendix A Appendix
a.1 Implementation Details
Setting the lower bound of the parameter (see Section 2.3) to ensure statistical significance is not sufficient in itself. Since our sample comes from both the density trees and the original training data, we must ensure these samples lead to statistically significant results individually.
In order to do so the our implementation internally adjusts the quantity . Recall that . A low value of can result in a small sample of size from the original training data, while a high value of might result in a small sample of size from the density trees. Interestingly however, and should be allowed as valid values, since the sample is then drawn from only one of the sources and is therefore not small!
The adjustment we make is shown in Figure 18. The xaxis shows the current value of , the yaxis shows what the sampler sees.
Below a user specified threshold for , it is adjusted to . Beyond a certain user specified threshold, it is adjusted to .
An additional consideration is the possible variance in model performance due to training on a sample from the current in Algorithm 4. We address this issue by taking multiple samples at each iteration, training one model per sample and then reporting the average accuracy across the models. For our experiments, we sampled times per iteration.
a.2 GBM Results
Table V represents the improvements seen using GBMs where we have or for the base classifier trees. This is an expanded version of data presented in Table IV.
boosting rounds =  1  2  3  4  5  6  7  8  9  10  

datasets  max depth  score type  
codrna  2  0.40  0.57  0.70  0.70  0.71  0.71  0.76  0.76  0.85  0.86  
0.00  42.67  74.63  3.16  3.38  3.21  6.66  8.40  19.72  19.51  
5  0.46  0.83  0.85  0.85  0.87  0.88  0.88  0.89  0.89  0.90  
15.46  106.29  29.37  0.63  0.00  0.00  0.00  0.00  0.00  0.53  
ijcnn1  2  0.71  0.72  0.72  0.72  0.72  0.73  0.73  0.73  0.73  0.73  
5.42  7.38  7.43  5.64  7.60  6.73  6.54  8.66  6.87  7.78  
5  0.77  0.77  0.77  0.77  0.78  0.79  0.79  0.80  0.80  0.80  
7.95  4.62  3.66  2.02  4.06  5.58  4.41  4.51  2.76  3.58  
higgs  2  0.62  0.61  0.62  0.63  0.62  0.63  0.62  0.63  0.64  0.63  
42.29  39.00  21.85  11.11  6.64  3.83  2.22  1.22  2.50  5.19  
5  0.62  0.62  0.64  0.64  0.65  0.65  0.67  0.65  0.67  0.67  
30.09  13.80  3.53  0.65  0.57  0.00  0.00  0.75  0.61  0.00  
covtype.binary  2  0.72  0.72  0.72  0.72  0.72  0.72  0.73  0.73  0.72  0.72  
0.00  0.40  0.53  0.96  0.91  1.14  1.54  1.19  0.43  1.14  
5  0.74  0.75  0.76  0.75  0.76  0.76  0.76  0.77  0.77  0.77  
0.00  0.00  0.00  0.00  0.00  0.87  0.48  0.00  0.51  0.00  
phishing  2  0.91  0.91  0.91  0.91  0.91  0.91  0.91  0.91  0.91  0.91  
153.69  9.67  0.18  0.09  0.00  0.14  0.27  0.37  0.28  0.16  
5  0.92  0.92  0.92  0.92  0.92  0.93  0.94  0.93  0.93  0.93  
156.10  1.65  0.48  1.16  1.16  0.20  0.00  0.00  0.95  0.41  
a1a  2  0.71  0.72  0.72  0.73  0.73  0.73  0.73  0.72  0.73  0.73  
5.75  5.59  6.58  6.90  7.33  7.43  2.12  6.97  5.98  7.56  
5  0.75  0.76  0.75  0.75  0.76  0.75  0.77  0.76  0.77  0.77  
1.76  2.56  2.04  2.49  1.69  1.73  1.83  1.79  4.54  4.46  
pendigits  2  0.76  0.80  0.81  0.81  0.82  0.83  0.82  0.83  0.83  0.83  
0.78  1.58  1.42  0.00  1.39  1.42  0.00  0.00  1.34  0.89  
5  0.92  0.94  0.94  0.95  0.96  0.95  0.95  0.96  0.96  0.96  
0.00  0.00  0.00  0.00  0.00  0.00  0.20  0.00  0.00  0.00  
letter  2  0.52  0.58  0.60  0.62  0.62  0.62  0.63  0.63  0.64  0.64  
0.00  0.00  0.00  0.00  1.16  0.00  0.00  0.00  1.73  0.91  
5  0.71  0.76  0.78  0.77  0.78  0.80  0.80  0.80  0.82  0.81  
2.51  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  
Sensorless  2  0.78  0.79  0.80  0.82  0.82  0.82  0.82  0.82  0.82  0.83  
5.21  6.61  5.34  6.24  3.89  3.21  4.32  3.67  3.61  4.83  
5  0.90  0.91  0.92  0.93  0.93  0.93  0.94  0.94  0.94  0.94  
0.00  0.00  0.46  1.13  0.00  0.00  1.55  0.71  0.00  0.74  
senseit_aco  2  0.22  0.30  0.40  0.41  0.54  0.61  0.62  0.63  0.64  0.64  
0.00  34.07  81.50  86.58  141.87  9.49  4.16  6.97  3.77  1.49  
5  0.22  0.29  0.45  0.54  0.63  0.63  0.66  0.68  0.68  0.69  
0.00  30.06  101.82  54.52  16.52  1.99  0.00  0.00  0.00  0.00  
senseit_sei  2  0.61  0.59  0.61  0.62  0.61  0.62  0.62  0.62  0.62  0.61  
173.64  167.01  174.07  176.79  176.20  157.16  178.51  62.46  32.67  17.33  
5  0.64  0.65  0.66  0.66  0.67  0.67  0.67  0.67  0.66  0.66  
189.62  194.02  194.71  188.23  73.50  37.07  16.51  8.44  1.89  0.00  
covtype  2  0.40  0.41  0.42  0.40  0.39  0.40  0.40  0.40  0.41  0.40  
34.86  36.52  39.49  20.04  32.86  32.39  16.78  21.76  21.05  17.97  
5  0.46  0.48  0.49  0.48  0.49  0.49  0.51  0.49  0.50  0.51  
3.41  10.79  0.00  4.51  3.30  0.00  5.87  3.61  0.00  0.00  
connect4  2  0.44  0.45  0.46  0.47  0.47  0.48  0.48  0.49  0.48  0.48  
19.33  20.09  28.08  34.54  22.94  20.19  12.55  20.14  12.07  13.53  
5  0.47  0.48  0.49  0.51  0.51  0.53  0.52  0.52  0.53  0.53  
3.33  7.32  3.72  5.57  1.96  9.38  9.23  4.05  7.51  4.31 