Neural Rule Ensembles: Encoding Sparse Feature Interactions into Neural Networks
Abstract
Artificial Neural Networks form the basis of very powerful learning methods. It has been observed that a naive application of fully connected neural networks to data with many irrelevant variables often leads to overfitting. In an attempt to circumvent this issue, a prior knowledge pertaining to what features are relevant and their possible feature interactions can be encoded into these networks. In this work, we use decision trees to capture such relevant features and their interactions and define a mapping to encode extracted relationships into a neural network. This addresses the initialization related concerns of fully connected neural networks. At the same time through feature selection it enables learning of compact representations compared to state of the art treebased approaches. Empirical evaluations and simulation studies show the superiority of such an approach over fully connected neural networks and treebased approaches.
I Introduction
Tree based ensemble methods have emerged as being one of the most powerful learning methods [10, 4] owing to the simplicity and transparency of trees, combined with an ability to explain complex data sets.
Predictive models based on rules have gained momentum in the recent years [17],[3],[5],[16],[7]. One of the simplest rule based approaches was proposed in [19] where a single decision tree is decomposed into a set of rules. Each such rule is pruned by removing nodes that improved its estimated accuracy. This is followed by sorting the pruned rules in the ascending order of their accuracy. Prediction at any new example is obtained using a single activated rule that is highest in the sorted list. RuleFit [9] is another popular rule based predictive model. It involves generating a large pool of rules using existing fast tree growing procedures. The coefficients of these rules are fit through a regularized regression. [1] replaces the hard rules in RuleFit with soft rules using a logistic transformation. [6] employs gradient boosting using rules as a base classifiers and rules are added iteratively to an ensemble by greedily minimizing the negative loglikelihood function. The major concern in all of these approaches is that the activated region of rules is fixed and does not allow for any training. Since the support is aligned along the feature axes, a large number of rules would be required to approximate oblique decision boundaries and therefore, would result in a loose representation of the prediction function.
Another line of work focuses on restructuring decision tree into multilayered neural networks with sparse connections and fewer restrictions on the inclination of decision boundaries. One such mapping was explored in the works of [23] which was later used by [25] for every tree in a random forest. The mapping in [23] replaces the Heaviside unit step function with a hyperbolic tangent activation which is known to suffer from the vanishing gradients problem. Also, it is not clear how to choose the hyperparameters of the hyperbolic tangent activation, which heavily dictate the initialization and the magnitude of gradients.
Some works at the intersection of decision trees and neural networks replace every decision node with a neural network. One such study was explored by [15], who learns differentiable split functions to guide inputs through a tree. The conditional networks from [13] also use trainable routing functions to perform conditional transformations on the inputs which enables them to transfer computational efficiency benefits of decision trees into the domain of convolutional networks. This appears like an ensembling of neural networks but structured in a hierarchical fashion.
In this paper, we present a novel method called Neural Rule Ensembles (NRE) for encoding the feature interactions captured by a single decision tree into a neural network. We discuss some training aspects of the algorithm and perform empirical evaluations on binary classification datasets from the Penn Machine Learning Benchmark (PMLB) [18]. To evaluate the statistical significance of the results, we use two statistical tests: Wilcoxon signedrank test and the sign test, and individually compare NRE with Random Forests (RF), Gradient Boosted Trees (GB) and Artificial Neural Networks (ANN).
Ii Preliminaries and Notations
We will work on regression and binary classification problems, where we are given training examples and we need to find a prediction function parameterized by a vector , such that agrees with as much as possible. For example, for linear models, the prediction function is and . The agreement between and on the training examples is measured by a loss function.
(1) 
that should be minimized, where is a penalty such as the shrinkage which helps with generalization.
The loss depends on the problem. For regression, it could be the square loss . For binary classification (when ), it could be the logistic loss , the hinge loss , or other loss functions.
Iia Rule Generation
For an input with realvalued continuous attributes , one can express a conjunctive rule in the following mathematical form:
(2) 
where is an indicator function, is the activation value and is some subset of possible values for an attribute .
A decision tree can be regarded as a collection of conjunctive rules where each path from the root to a terminal node defines one such rule. Specifically, a regression tree with terminal nodes can be represented as
(3) 
Nonzero values of correspond to a hyperrectangle in the input feature space. These hyperrectangles are nonoverlapping with each other and collectively define a partitioning of the feature space.
In cases where all the features involved in a decision treeinduced rule are continuous variables, the rule in (2) can be reduced to a much simpler form. Let be a set of features involved in a rule , one can now rewrite the expression for a conjunctive rule in (2) as follows.
(4) 
where is a Heaviside step function with , is the value at an associated terminal node, is either or and is the split threshold for feature if , and is the negative of the split threshold otherwise.
In order to create an initial pool consisting of a large number of rules, we first invoke existing fast algorithms such as Random Forests and Gradient Boosted Trees to generate tree ensembles. In the subsequent step, each of the tree thus produced is decomposed into a set of conjunctive rules as described above.
IiB Margin Maximizing Rules
In order to perform either an implicit or explicit selection of rules, we would need a metric to quantify their importance, which can then be used to rank them in the order of their relevance. One such metric that can be employed is the hinge loss, also referred to as the maximum margin loss, where rules with lower values of the loss function will have higher relevance. Mathematically, the expression for this loss function is given as
(5) 
where . The quantity , known as the margin or confidence in the prediction, is positive for the correct prediction and is negative in case of the wrong one. Notice that, in the absence of rule scaling , the scale of could be artificially chosen to make the confidence arbitrarily large which in that case would render the definition of margin useless.
Let be the number of training examples that activate the rule where of them belong to the positive class and the remaining come from the negative class. Using (4), the euclidean norm of the rule can thus be computed as
For any training input x, the activation value of a rule is either or . Consequently, the normalized value is either or is given by
which implies that the margin is unconditionally bounded from above by or more precisely,
This allows us to simplify the expression for the hinge loss given in (5) to
Since both and are potentially valid rules, the quantifying metric becomes
(6) 
where
(7) 
This means that the rules can simply be sorted in the decreasing order of the scores to obtain a ranking in the order of their relevance. In other words, higher values of correspond to higher relevance. Let us further simply the expression given in (7) to
(8) 
Iii Rule Generation Limitation of Conventional Decision Tree
The rule generation procedure conducted by conventional decision tree uses only one feature per node for recursive binary partitioning which corresponds to the fact that all the slices or partitions in the feature space are now either perpendicular or (inclusive) parallel to the feature axes. In other words, such models will have difficulties in approximating oblique decision boundaries in the feature space.
Let us consider a linearly separable dataset as shown in Figure 1(a), where the decision boundary is inclined at an angle of with either of the feature axes. Owing to the aforementioned rigidity, it can be seen from Figures 1 (b) and (c) that the only way to improve performance is to keep adding more rules to the ensemble. The inability to evolve shapes of the rules seems highly restrictive and challenges the goal of achieving compact representations. Hence, we would not want to restrict the activated region of a rule to just a hyperrectangle. This motivates the definition of a neural rule with trainable support.
Linear Separable  Ensemble 5 Rules  Ensemble 20 Rules 
Iv Neural Rule Ensembles
In this part, we introduce a new form of the conjunctive rules (4) called Neural Rules. In order to see how a decision tree based rule inspires the design of a neural rule, let us revisit the expression of a conjunctive rule specifically for a decision tree based rule given in (4).
where is a set of features involved in a rule , is a Heaviside step function with , is the value at an associated terminal node, is either or and is the split threshold for feature if , and is the negative of split threshold otherwise.
Rules extracted from a decision tree involve only one feature for every node. In a neural rule, we modify (4) to now connect each node of a given rule with all the features used in the corresponding decision tree. Let be a vector of all features used in the decision tree without repetitions. With this modification, we can now have oblique decision boundaries in the feature subspace spanned by . The updated expression of the rule looks as follows.
(9) 
Denote the ReLU operation as . Next, we observe that a Heaviside step function with is invariant to the ReLU transformation of an input . Also note that the product of several Heaviside step functions can be represented using a single Heaviside step function and the minimum pooling operation.
Using these identities, the rule in (9) becomes,
(10)  
(11) 
Since the derivative of a step function is zero, the gradients of all the learnable parameters will stay zero unless some modification is made. In order to be able to jointly train all the weights and splitting thresholds of all the nodes in a rule, we switch the Heaviside step function in previous equations with an identity function. This gives us the neural rule in its final form as
(12) 
Iva Initialization
We now discuss an initialization of a neural rule corresponding to any given rule obtained from a decision tree.
First, we make a list of all the features involved in that tree. All those features along with a bias unit are the input layer of a neural rule. The number of hidden units in the first hidden layer of a neural rule equals the number of decision nodes of a treeinduced rule, with onetoone correspondence between them. The connection weight between the input feature and the hidden unit is if the corresponding decision node of that hidden unit does not involve the feature under consideration. It is if the corresponding decision node utilizes that feature and traverses its left child along the rule path and is otherwise.
The magnitude of the bias for every hidden unit is given by the absolute value of the splitting threshold utilized in the corresponding decision node. The sign of the bias for a hidden unit is positive if the corresponding decision node traverses its left child along the rule path, otherwise it is negative.
We use Figure 2 to illustrate one such mapping. Figure 2 (a) shows a decision tree with four rules. Figure 2 (b) shows a neural rule corresponding to the rule with terminal label (red colored branch) of the decision tree. All the bold lines in a neural rule represent trainable parameters with their initial values displayed alongside in Figure 2 (b). The red bold lines in a neural rule carry nonzero initial weights and have their counterparts in the decision tree whereas black bold lines represent new connections with zero initialization.
A Decision Tree  A Neural Rule 
We observed that the support of a proposed neural rule is a convex set. In order to allow for the rules to assume complicated nonconvex shapes in the feature space, we extend the definition of a neural rule by stacking a new hidden layer with the same number of hidden units as the previous one. We refer to such a modification of the neural rule as a deep neural rule. Since we need to preserve the support of a treeinduced rule while mapping it into a corresponding deep neural rule, we use an identity transformation for initializing the parameters of this new hidden layer as depicted in \figreffig:deepneuralrule.
IvB Characteristics
Trainable Support. Each in the equation (12) represents a hyperplane in the feature subspace with the corresponding upper half space given by . The application of the operation evaluates the intersection of these upper half spaces and thus, defines the activated region of a rule. For an input x that lies on the upper half space of the plane given by , is proportional to the shortest euclidean distance of the input to that hyperplane. This quantifies the margin or level of confidence in the prediction and the further the input lies from the hyperplane in its upper half space, the more confident it becomes in its prediction of that input. During training using backpropagation, the hyperplane is rotated, shifted and scaled in order to maximize the expected margin of the inputs. Because of the pooling operation, each input contributes in updating the parameters of only the hyperplane that predicts the least margin for it at that training step among all the hyperplanes involved in a rule. The rationale here is to maximize the margin of an input only from the least confident hyperplane.
A Decision Tree  A Deep Neural Rule 
Restricted Gradients. For an input x that does not belong to the activated region of a rule, would be less than or equal to zero, which implies zero gradients of all the trainable parameters as the derivative of a ReLU activation for negative inputs is zero. This suggests that only the training examples lying inside the activated region are responsible for modifying the shape of this region. In order to maximize their margins, activated examples try to push or pull the rule boundaries depending on the sign of their class membership and the sign of weight, . If both of these signs agree then the corresponding training examples push the rule boundary outwards, which expands the region and as a result, brings in more training examples. Otherwise, if the signs do not match then those contradictory examples will pull in the rule boundary to get themselves out of it and thereby, shrink the region.
Compact Convex Support. Let denote the support of a neural rule given by equation (12) and defined as . We show that is a compactly supported convex set.
Proposition 1.
For any , the convex combination of x and z satisfies
where with
IvC Training
Conventional procedures for generating decision tress on binary classification tasks employ either the Gini index or the cross entropy measure. We use a new splitting criterion for invoking a decision tree based on margin maximizing rules discussed in \secrefsec:MMR. Let denote the number of examples. We use subscript to refer to the left child, for the right child and for the parent node. Additionally, superscripts and refer to positive and negative examples, respectively.
Assuming binary partitioning of a decision node, each split defines two simple rules and . Using the maximum margin metric for a rule given in (8), the node splitting criterion can be written as follows
(15) 
We decompose a single decision tree into a set of conjunctive rules to obtain a pool of diverse feature interactions. Each of these rules is used to initialize their neural counterparts using mapping discussed in Section IVA. Such an ensemble of neural rules, collectively referred to as Neural Rule Ensembles (NRE) is essentially a 2layered artificial neural network with min pooling operation and thus, a universal approximator [12]. However, in the proposed approach, feature interactions extracted from a decision tree are explicitly encoded into the network through its initialization, thus performing feature selection and leading to better generalization. Another characteristic of such an initialization is that the activations of any two pooled hidden units are orthogonal to each other.
V Experiments
Va Simulation Result
In this section, we perform a simulation to illustrate the ability of a neural rule to evolve and expand its activated region. We consider a rotated XOR dataset, which is a nonlinearly separable dataset since there does not exist any single hyperplane that can separate the positive training examples (shown in blue) from the negative ones (shown in red). Additionally, since we have rotated the XOR dataset by , treebased approaches such as Gradient boosted trees would have a hard time approximating the oblique decision boundaries and would require a large number of trees and/or rules.
fig:evolutionNeuralRule (a) shows a single neural rule just after its initialization from a corresponding treeinduced rule. \figreffig:evolutionNeuralRule (b) shows an intermediary state after training for 150 iterations. It can be seen that the rule evolves its activated region to include more examples of the same type into its support. After training for a long time, the neural rule settles into an equilibrium state consisting of only positive examples as shown in \figreffig:evolutionNeuralRule (c).
Initialization  After 150 Iterations  After 3k Iterations 
It is evident from \figreffig:evolutionNeuralRule (c) that there are still many positive training examples that do not belong to the support of a neural rule. In order to include them, a neural rule would have to assume a nonconvex shape, which is not possible. This limitation has motivated the extension to a deep neural rule. \figreffig:evolutionDeepNeural shows the evolution of the activated region in the case of a deep neural rule, which can now achieve a nonconvex shape and hence contain all the positive training examples into its support as shown in \figreffig:evolutionDeepNeural (c).
Initialization  After 150 Iterations  After 3k Iterations 
VB Real Data Analysis
Datasets. In order to compare the performance of the proposed algorithm with state of the art approaches, we perform an empirical evaluation on simulated and multiple real datasets, which ensures a wide variety of different targets in terms of their dependence on the input features.
For simulation, we use a highly nonlinear and multivariate artificial dataset, MADELON, featured in the NIPS 2003 feature selection challenge [11]. It is a generalization of the classic XOR dataset to five dimensions. Each vertex of a five dimensional hypercube contains a cluster of data points randomly labeled as or . The five dimensions constitute informative features and linear combinations of those features were added to form a set of 20 redundant but informative features. Additionally, a number of distractor features with no predictive power were added and the order of the features was randomized.
For benchmarking on real datasets, we will use Penn Machine Learning Benchmark (PMLB) [18] which includes datasets from a wide range of sources such as UCI ML repository [8], Kaggle, KEEL [2] and the metalearning benchmark [20]. Since we are limiting our focus to binary classification tasks, we only consider datasets having two classes. Additionally, we removed all the datasets with fewer than training examples. This leaves us with a total of datasets for our investigation.
Statistical Tests. We use a statistical framework for hypothesis testing to investigate whether Neural Rule Ensembles (NRE) is significantly better or not compared to state of the art classifiers, namely Random Forests (RF), Gradient boosted trees (GB) and Artificial Neural Networks (ANN). A hypothesis test is a decision between two complementary hypotheses, the null hypothesis and the alternate hypothesis . We are trying to reject the null hypothesis, which states that there is no difference in the classification performance of algorithms, that is, both of them perform equally well. We use the following statistical tests designed to compare two classifiers on multiple data sets.
Wilcoxon SignedRank Test. For the Wilcoxon signedrank test [26], the results are sorted by the magnitude of absolute difference in the performance scores of the two classifiers. This is followed by assigning ranks from the lowest to the highest absolute difference. In case of ties, average ranks are assigned. Finally, a test statistic is formed based on the ranks of the positive and negative differences.
GB  NRE  difference  rank  
wilt  18.60  10.40  8.20  19.0 
madelon  14.50  10.30  4.20  18.0 
adult  12.91  14.22  1.31  17.0 
phoneme  9.25  8.14  1.11  16.0 
dis  0.71  1.77  1.06  15.0 
titanic  27.49  26.89  0.60  14.0 
churn  3.60  4.13  0.53  13.0 
banana  9.31  8.93  0.38  12.0 
ring  3.15  3.51  0.36  11.0 
spambase  4.34  4.63  0.29  10.0 
krvskp  0.42  0.20  0.22  9.0 
chess  0.21  0.42  0.21  7.5 
coil2000  6.04  5.83  0.21  7.5 
twonorm  2.34  2.25  0.09  6.0 
clean2  0.00  0.00  0.00  3.0 
hypothyroid  1.47  1.47  0.00  3.0 
agaricuslepiota  0.00  0.00  0.00  3.0 
magic  11.67  11.67  0.00  3.0 
mushroom  0.00  0.00  0.00  3.0 
wins  6  8  
ties  5  5 
Let be the difference between the performance scores of two classifiers on the data set. Let be the sum of ranks for the data sets on which NRE outperforms the other classifier, and the sum of ranks on data sets where NRE gets defeated. Ranks corresponding to zero difference are split evenly between and ; if there is an odd number of them, one is ignored. The test statistic, is given by
(16) 
where
(17) 
(18) 
For a twotailed test with significance level, the critical value of the test statistic corresponding to data sets is . In other words, if is less than or equal to , NRE can be considered significantly better than the other classifier with and we can reject the nullhypothesis in favor of alternate one.
Sign Test: Wins, Losses & Ties Counts. The sign test [22, 24] is much weaker than the Wilcoxon signedrank test and will not reject the nullhypothesis unless one algorithm almost always outperforms the other. In the sign test, we compare the generalization performance of classifiers by counting the number of data sets on which a classifier outperforms others.
Under the assumption that nullhypothesis is correct, that is, both classifiers perform equally well, one would expect each one of them to win on approximately out of data sets. This tell us that the number of wins is distributed according to the binomial distribution.
For datasets, the critical number of wins needed to reject the nullhypothesis for a twotailed sign test at significance is . This implies that NRE can be considered significantly better than the other classifier with if it is the overall winner on out of datasets. Since nullhypothesis is true for ties, instead of throwing them, we distribute them evenly between the two classifiers. And, we ignore one of the ties if there is an odd number of them.
RF  NRE  difference  rank  
madelon  26.40  10.30  16.10  19.0 
wilt  21.60  10.40  11.20  18.0 
coil2000  7.06  5.83  1.23  17.0 
phoneme  9.00  8.14  0.86  16.0 
banana  9.56  8.93  0.63  15.0 
titanic  27.49  26.89  0.60  14.0 
spambase  4.92  4.63  0.29  13.0 
twonorm  2.52  2.25  0.27  12.0 
adult  14.47  14.22  0.25  11.0 
krvskp  0.42  0.20  0.22  10.0 
magic  11.88  11.67  0.21  9.0 
hypothyroid  1.68  1.47  0.21  8.0 
chess  0.62  0.42  0.20  7.0 
ring  3.33  3.51  0.18  6.0 
agaricuslepiota  0.00  0.00  0.00  3.0 
mushroom  0.00  0.00  0.00  3.0 
dis  1.77  1.77  0.00  3.0 
clean2  0.00  0.00  0.00  3.0 
churn  4.13  4.13  0.00  3.0 
wins  1  13  
ties  5  5 
Test Error Evaluation. In this section, we compare NRE with GradientBoost (GB), Random Forest (RF) and Artificial Neural Networks (ANN) on datasets. The test errors for the datasets without a test set are obtained using fivefold crossvalidation.
For each classifier, the operating settings and the tuned hyperparameters are the following:

Random Forests: The number of trees used in the forest are tuned from the set .

Gradient Boosted Trees: We use boosting iterations with the maximum tree depth selected from the range .

Artificial Neural Networks: Fully connected networks with a single hidden layer (since NRE contains one hidden layer) and rectified linear (ReLU) activation. The number of hidden units is selected for optimal performance.

Neural Rule Ensembles: Maximum depth of the tree used for initializing the network is searched over the set .
The hyperparameters for the methods being evaluated have been obtained by internal fivefold crossvalidation on the training set. We use the scikitlearn implementation for evaluating the existing algorithms.
ANN  NRE  difference  rank  
madelon  45.50  10.30  35.20  19.0 
phoneme  14.18  8.14  6.04  18.0 
wilt  14.20  10.40  3.80  17.0 
churn  6.27  4.13  2.14  16.0 
coil2000  7.46  5.83  1.63  15.0 
spambase  3.47  4.63  1.16  14.0 
ring  2.52  3.51  0.99  13.0 
magic  12.44  11.67  0.77  12.0 
adult  14.79  14.22  0.57  11.0 
hypothyroid  1.89  1.47  0.42  9.5 
krvskp  0.62  0.20  0.42  9.5 
banana  9.31  8.93  0.38  8.0 
twonorm  2.43  2.25  0.18  7.0 
dis  1.94  1.77  0.17  6.0 
agaricuslepiota  0.00  0.00  0.00  3.0 
mushroom  0.00  0.00  0.00  3.0 
clean2  0.00  0.00  0.00  3.0 
chess  0.42  0.42  0.00  3.0 
titanic  26.89  26.89  0.00  3.0 
wins  2  12  
ties  5  5 
NRE vs Gradient Boosted Trees. From Table I, it can be seen that NRE wins on data sets, GB wins on data sets and there are ties. Ignoring one tie and splitting the remaining ones evenly, we find that NRE is better on out of datasets. Since the critical number of wins needed under sign test is 14, we fail to reject the nullhypothesis. Similarly, we fail to reject the nullhypothesis under the Wilcoxon signedrank test because the test statistic is greater than 46. This implies that we don’t have enough statistical evidence to establish that NRE outperforms GB. However, we realize that NRE initialized from a single tree gives a tough competition to boosted trees and is a more compactly represented model.
NRE vs Random Forest. We find from Table II that NRE outperforms RF on almost all the data sets except for the ring data set and the 5 tied matches. Splitting the ties evenly, NRE is better on out of data sets which is greater than the critical number of wins needed, that is , under the sign test. We can therefore reject the null hypothesis. For Wilcoxonsigned ranks test, the statistic is less than the critical value which allows us to reject the null hypothesis as well. This implies that NRE is significantly better than Random Forest and given that it utilizes only one tree compared to the up to trees in RF, it is more compact too.
NRE vs Artificial Neural Network. It is evident from Table III that NRE outperforms ANN on data sets, loses on sets and there are ties. NRE passes the sign test since it is better on data sets (splitting the ties evenly) which matches the critical number of wins needed. Since, the test statistic for the Wilcoxon signedrank test is less than the critical value , we reject the nullhypothesis in favor of alternate one. Both of the statistical tests agree that NRE is significantly better than the Artificial Neural Networks.
Dataset  GB  RF  ANN  NRE  
wilt  4839  6  18.60  21.60  14.20  10.40 
madelon  2600  500  14.50  26.40  45.50  10.30 
phoneme  5404  6  9.25  9.00  14.18  8.14 
krvskp  3197  37  0.42  0.42  0.62  0.20 
coil2000  9822  86  6.04  7.06  7.46  5.83 
banana  5300  3  9.31  9.56  9.31  8.93 
twonorm  7400  21  2.34  2.52  2.43  2.25 
adult  48842  15  12.91  14.47  14.79  14.22 
dis  3772  30  0.71  1.77  1.94  1.77 
churn  5000  21  3.60  4.13  6.27  4.13 
ring  7400  21  3.15  3.33  2,52  3.51 
spambase  4601  58  4.34  4.92  3.47  4.63 
chess  3196  37  0.21  0.62  0.42  0.42 
titanic  2201  4  27.49  27.49  26.89  26.89 
hypothyroid  3163  26  1.47  1.68  1.89  1.47 
magic  19020  11  11.67  11.88  12.44  11.67 
mushroom  8124  23  0.00  0.00  0.00  0.00 
clean2  6598  169  0.00  0.00  0.00  0.00 
agaricuslepiota  8145  23  0.00  0.00  0.00  0.00 
wins  11  3  4  13 
Overall comparison. In Table V is shown a summary of the Wilcoxon rank T statistics and the number of NRE wins vs the other methods, with their significance in bold. In Table IV, are shown all the classification test errors for all the methods in a single table. Also shown are the number of observations and the number of features of each dataset.
NRE vs GB  NRE vs RF  NRE vs ANN  

Wilcoxon T Statistic  81  13.5  34.5 
Number of NRE wins  10  15  14 
Vi Conclusion
In this work, we presented a novel method called Neural Rule Ensembles (NRE) for encoding into a neural network and refining the feature interactions captured by a decision tree. This was achieved by defining a neural transformation of a treeinduced rule using ReLU units and the min pooling operation. Such a mapping addresses the initialization related concerns of fully connected neural networks as well as the feature selection problem, and enables learning of compact representations compared to conventional treebased approaches.
Empirical evaluations on binary classification datasets from the Penn Machine Learning Benchmark (PMLB) [18] were performed to compare the generalization performance of Neural Rule Ensembles (NRE) with state of the art approaches such as Random Forests (RF), Gradient Boosted Trees (GB) and Artificial Neural Networks (ANN). We used two statistical tests, the Wilcoxon signedrank test and the sign test, to evaluate the statistical significance of these results. Both of these statistical tests found NRE to be significantly better than Random Forests and the Artificial Neural Networks with . When NRE was compared to Gradient Boosted Trees, we could not find enough statistical evidence to reject the null hypothesis stating that both of them perform equally well. However, NRE only utilizes one tree, so it obtains a more compact and interpretable representation.
References
 (201301) Soft rule ensembles for supervised learning. pp. 78–83. Cited by: §I.
 (2011) KEEL datamining software tool: data set repository, integration of algorithms and experimental analysis framework.. MultipleValued Logic and Soft Computing 17 (23), pp. 255–287. External Links: Link Cited by: §VB.
 (2016) Multiobjective search for comprehensible rule ensembles. In International Joint Conference on Rough Sets, pp. 503–513. Cited by: §I.
 (2001) Random forests. Machine Learning 45 (1), pp. 5–32. Cited by: §I.
 (2017) The best of two worlds: balancing model strength and comprehensibility in business failure prediction using splinerule ensembles. Expert Systems with Applications 90, pp. 23–39. Cited by: §I.
 (2008) Maximum likelihood rule ensembles. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, New York, NY, USA, pp. 224–231. External Links: ISBN 9781605582054, Link, Document Cited by: §I.
 (2019) Interpreting tree ensembles with intrees. International Journal of Data Science and Analytics 7 (4), pp. 277–287. Cited by: §I.
 (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §VB.
 (2008) Predictive learning via rule ensembles. The Annals of Applied Statistics, pp. 916–954. Cited by: §I.
 (2001) Greedy function approximation: a gradient boosting machine. Annals of Statistics, pp. 1189–1232. Cited by: §I.
 (2004) Result analysis of the nips 2003 feature selection challenge. In NIPS, pp. 545–552. Cited by: §VB.
 (1991) Approximation capabilities of multilayer feedforward networks. Neural Networks 4 (2), pp. 251–257. Cited by: §IVC.
 (2016) Decision forests, convolutional networks and the models inbetween. CoRR abs/1603.01250. Cited by: §I.
 (2014) Adam: a method for stochastic optimization.. CoRR abs/1412.6980. External Links: Link Cited by: §IVC.
 (201512) Deep neural decision forests. In 2015 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 1467–1475. External Links: Document, ISSN Cited by: §I.
 (201702) Tree ensembles with rule structured horseshoe regularization. The Annals of Applied Statistics 12, pp. . External Links: Document Cited by: §I.
 (2016) Horseshoe rulefit: learning rule ensembles via bayesian regularization. Cited by: §I.
 (20171211) PMLB: a large benchmark suite for machine learning evaluation and comparison. BioData Mining 10 (1), pp. 36. External Links: ISSN 17560381, Document, Link Cited by: §I, §VB, §VI.
 (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. External Links: ISBN 1558602380 Cited by: §I.
 (2012) A comprehensive dataset for evaluating approaches of various metalearning tasks.. In ICPRAM (1), P. L. Carmona, J. S. SÃ¡nchez and A. L. N. Fred (Eds.), pp. 273–276. External Links: Link Cited by: §VB.
 (1988) Neurocomputing: foundations of research. J. A. Anderson and E. Rosenfeld (Eds.), pp. 696–699. External Links: ISBN 0262010976, Link Cited by: §IVC.
 (19970901) On comparing classifiers: pitfalls to avoid and a recommended approach. Data Mining and Knowledge Discovery 1 (3), pp. 317–328. External Links: ISSN 1573756X, Document, Link Cited by: §VB.
 (1991) Decision tree performance enhancement using an artificial neural network implementation1 1this work was supported in part by nsf grant iri9002087. In Artificial Neural Networks and Statistical Pattern Recognition, I. K. SETHI and A. K. JAIN (Eds.), Machine Intelligence and Pattern Recognition, Vol. 11, pp. 71 – 88. External Links: ISSN 09230459, Document, Link Cited by: §I.
 (2007) Handbook of parametric and nonparametric statistical procedures. 4 edition, Chapman & Hall/CRC. External Links: ISBN 1584888148, 9781584888147 Cited by: §VB.
 (2014) Casting random forests as artificial neural networks (and profiting from it). In Pattern Recognition, X. Jiang, J. Hornegger and R. Koch (Eds.), Cham, pp. 765–771. External Links: ISBN 9783319117522 Cited by: §I.
 (1945) Individual comparisons by ranking methods. Biometrics Bulletin 1 (6), pp. 80–83. External Links: ISSN 00994987, Link Cited by: §VB.