Neural Rule Ensembles: Encoding Sparse Feature Interactions into Neural Networks

Neural Rule Ensembles: Encoding Sparse Feature Interactions into Neural Networks


Artificial Neural Networks form the basis of very powerful learning methods. It has been observed that a naive application of fully connected neural networks to data with many irrelevant variables often leads to overfitting. In an attempt to circumvent this issue, a prior knowledge pertaining to what features are relevant and their possible feature interactions can be encoded into these networks. In this work, we use decision trees to capture such relevant features and their interactions and define a mapping to encode extracted relationships into a neural network. This addresses the initialization related concerns of fully connected neural networks. At the same time through feature selection it enables learning of compact representations compared to state of the art tree-based approaches. Empirical evaluations and simulation studies show the superiority of such an approach over fully connected neural networks and tree-based approaches.

I Introduction

Tree based ensemble methods have emerged as being one of the most powerful learning methods [10, 4] owing to the simplicity and transparency of trees, combined with an ability to explain complex data sets.

Predictive models based on rules have gained momentum in the recent years [17],[3],[5],[16],[7]. One of the simplest rule based approaches was proposed in [19] where a single decision tree is decomposed into a set of rules. Each such rule is pruned by removing nodes that improved its estimated accuracy. This is followed by sorting the pruned rules in the ascending order of their accuracy. Prediction at any new example is obtained using a single activated rule that is highest in the sorted list. RuleFit [9] is another popular rule based predictive model. It involves generating a large pool of rules using existing fast tree growing procedures. The coefficients of these rules are fit through a regularized regression. [1] replaces the hard rules in RuleFit with soft rules using a logistic transformation. [6] employs gradient boosting using rules as a base classifiers and rules are added iteratively to an ensemble by greedily minimizing the negative log-likelihood function. The major concern in all of these approaches is that the activated region of rules is fixed and does not allow for any training. Since the support is aligned along the feature axes, a large number of rules would be required to approximate oblique decision boundaries and therefore, would result in a loose representation of the prediction function.

Another line of work focuses on restructuring decision tree into multi-layered neural networks with sparse connections and fewer restrictions on the inclination of decision boundaries. One such mapping was explored in the works of [23] which was later used by [25] for every tree in a random forest. The mapping in [23] replaces the Heaviside unit step function with a hyperbolic tangent activation which is known to suffer from the vanishing gradients problem. Also, it is not clear how to choose the hyperparameters of the hyperbolic tangent activation, which heavily dictate the initialization and the magnitude of gradients.

Some works at the intersection of decision trees and neural networks replace every decision node with a neural network. One such study was explored by [15], who learns differentiable split functions to guide inputs through a tree. The conditional networks from [13] also use trainable routing functions to perform conditional transformations on the inputs which enables them to transfer computational efficiency benefits of decision trees into the domain of convolutional networks. This appears like an ensembling of neural networks but structured in a hierarchical fashion.

In this paper, we present a novel method called Neural Rule Ensembles (NRE) for encoding the feature interactions captured by a single decision tree into a neural network. We discuss some training aspects of the algorithm and perform empirical evaluations on binary classification datasets from the Penn Machine Learning Benchmark (PMLB) [18]. To evaluate the statistical significance of the results, we use two statistical tests: Wilcoxon signed-rank test and the sign test, and individually compare NRE with Random Forests (RF), Gradient Boosted Trees (GB) and Artificial Neural Networks (ANN).

Ii Preliminaries and Notations

We will work on regression and binary classification problems, where we are given training examples and we need to find a prediction function parameterized by a vector , such that agrees with as much as possible. For example, for linear models, the prediction function is and . The agreement between and on the training examples is measured by a loss function.


that should be minimized, where is a penalty such as the shrinkage which helps with generalization.

The loss depends on the problem. For regression, it could be the square loss . For binary classification (when ), it could be the logistic loss , the hinge loss , or other loss functions.

Ii-a Rule Generation

For an input with real-valued continuous attributes , one can express a conjunctive rule in the following mathematical form:


where is an indicator function, is the activation value and is some subset of possible values for an attribute .

A decision tree can be regarded as a collection of conjunctive rules where each path from the root to a terminal node defines one such rule. Specifically, a regression tree with terminal nodes can be represented as


Non-zero values of correspond to a hyper-rectangle in the input feature space. These hyper-rectangles are non-overlapping with each other and collectively define a partitioning of the feature space.

In cases where all the features involved in a decision tree-induced rule are continuous variables, the rule in (2) can be reduced to a much simpler form. Let be a set of features involved in a rule , one can now rewrite the expression for a conjunctive rule in (2) as follows.


where is a Heaviside step function with , is the value at an associated terminal node, is either or and is the split threshold for feature if , and is the negative of the split threshold otherwise.

In order to create an initial pool consisting of a large number of rules, we first invoke existing fast algorithms such as Random Forests and Gradient Boosted Trees to generate tree ensembles. In the subsequent step, each of the tree thus produced is decomposed into a set of conjunctive rules as described above.

Ii-B Margin Maximizing Rules

In order to perform either an implicit or explicit selection of rules, we would need a metric to quantify their importance, which can then be used to rank them in the order of their relevance. One such metric that can be employed is the hinge loss, also referred to as the maximum margin loss, where rules with lower values of the loss function will have higher relevance. Mathematically, the expression for this loss function is given as


where . The quantity , known as the margin or confidence in the prediction, is positive for the correct prediction and is negative in case of the wrong one. Notice that, in the absence of rule scaling , the scale of could be artificially chosen to make the confidence arbitrarily large which in that case would render the definition of margin useless.

Let be the number of training examples that activate the rule where of them belong to the positive class and the remaining come from the negative class. Using (4), the euclidean norm of the rule can thus be computed as

For any training input x, the activation value of a rule is either or . Consequently, the normalized value is either or is given by

which implies that the margin is unconditionally bounded from above by or more precisely,

This allows us to simplify the expression for the hinge loss given in (5) to

Since both and are potentially valid rules, the quantifying metric becomes




This means that the rules can simply be sorted in the decreasing order of the scores to obtain a ranking in the order of their relevance. In other words, higher values of correspond to higher relevance. Let us further simply the expression given in (7) to


Iii Rule Generation Limitation of Conventional Decision Tree

The rule generation procedure conducted by conventional decision tree uses only one feature per node for recursive binary partitioning which corresponds to the fact that all the slices or partitions in the feature space are now either perpendicular or (inclusive) parallel to the feature axes. In other words, such models will have difficulties in approximating oblique decision boundaries in the feature space.

Let us consider a linearly separable dataset as shown in Figure 1(a), where the decision boundary is inclined at an angle of with either of the feature axes. Owing to the aforementioned rigidity, it can be seen from Figures 1 (b) and (c) that the only way to improve performance is to keep adding more rules to the ensemble. The inability to evolve shapes of the rules seems highly restrictive and challenges the goal of achieving compact representations. Hence, we would not want to restrict the activated region of a rule to just a hyper-rectangle. This motivates the definition of a neural rule with trainable support.

Linear Separable Ensemble 5 Rules Ensemble 20 Rules
Fig. 1: Limitations of Conventional Ensemble of Rules in approximating linearly separable datasets

Iv Neural Rule Ensembles

In this part, we introduce a new form of the conjunctive rules (4) called Neural Rules. In order to see how a decision tree based rule inspires the design of a neural rule, let us revisit the expression of a conjunctive rule specifically for a decision tree based rule given in (4).

where is a set of features involved in a rule , is a Heaviside step function with , is the value at an associated terminal node, is either or and is the split threshold for feature if , and is the negative of split threshold otherwise.

Rules extracted from a decision tree involve only one feature for every node. In a neural rule, we modify (4) to now connect each node of a given rule with all the features used in the corresponding decision tree. Let be a vector of all features used in the decision tree without repetitions. With this modification, we can now have oblique decision boundaries in the feature subspace spanned by . The updated expression of the rule looks as follows.


Denote the ReLU operation as . Next, we observe that a Heaviside step function with is invariant to the ReLU transformation of an input . Also note that the product of several Heaviside step functions can be represented using a single Heaviside step function and the minimum pooling operation.

Using these identities, the rule in (9) becomes,


Since the derivative of a step function is zero, the gradients of all the learnable parameters will stay zero unless some modification is made. In order to be able to jointly train all the weights and splitting thresholds of all the nodes in a rule, we switch the Heaviside step function in previous equations with an identity function. This gives us the neural rule in its final form as


Iv-a Initialization

We now discuss an initialization of a neural rule corresponding to any given rule obtained from a decision tree.

First, we make a list of all the features involved in that tree. All those features along with a bias unit are the input layer of a neural rule. The number of hidden units in the first hidden layer of a neural rule equals the number of decision nodes of a tree-induced rule, with one-to-one correspondence between them. The connection weight between the input feature and the hidden unit is if the corresponding decision node of that hidden unit does not involve the feature under consideration. It is if the corresponding decision node utilizes that feature and traverses its left child along the rule path and is otherwise.

The magnitude of the bias for every hidden unit is given by the absolute value of the splitting threshold utilized in the corresponding decision node. The sign of the bias for a hidden unit is positive if the corresponding decision node traverses its left child along the rule path, otherwise it is negative.

We use Figure 2 to illustrate one such mapping. Figure 2 (a) shows a decision tree with four rules. Figure 2 (b) shows a neural rule corresponding to the rule with terminal label (red colored branch) of the decision tree. All the bold lines in a neural rule represent trainable parameters with their initial values displayed alongside in Figure 2 (b). The red bold lines in a neural rule carry non-zero initial weights and have their counterparts in the decision tree whereas black bold lines represent new connections with zero initialization.

A Decision Tree A Neural Rule
Fig. 2: Mapping a Tree-induced Rule, with terminal label 2, into a Neural Rule

We observed that the support of a proposed neural rule is a convex set. In order to allow for the rules to assume complicated non-convex shapes in the feature space, we extend the definition of a neural rule by stacking a new hidden layer with the same number of hidden units as the previous one. We refer to such a modification of the neural rule as a deep neural rule. Since we need to preserve the support of a tree-induced rule while mapping it into a corresponding deep neural rule, we use an identity transformation for initializing the parameters of this new hidden layer as depicted in \figreffig:deepneuralrule.

Iv-B Characteristics

Trainable Support. Each in the equation (12) represents a hyperplane in the feature subspace with the corresponding upper half space given by . The application of the operation evaluates the intersection of these upper half spaces and thus, defines the activated region of a rule. For an input x that lies on the upper half space of the plane given by , is proportional to the shortest euclidean distance of the input to that hyperplane. This quantifies the margin or level of confidence in the prediction and the further the input lies from the hyperplane in its upper half space, the more confident it becomes in its prediction of that input. During training using backpropagation, the hyperplane is rotated, shifted and scaled in order to maximize the expected margin of the inputs. Because of the pooling operation, each input contributes in updating the parameters of only the hyperplane that predicts the least margin for it at that training step among all the hyperplanes involved in a rule. The rationale here is to maximize the margin of an input only from the least confident hyperplane.

A Decision Tree A Deep Neural Rule
Fig. 3: Mapping a Tree-induced Rule, with terminal label 2, into a Deep Neural Rule

Restricted Gradients. For an input x that does not belong to the activated region of a rule, would be less than or equal to zero, which implies zero gradients of all the trainable parameters as the derivative of a ReLU activation for negative inputs is zero. This suggests that only the training examples lying inside the activated region are responsible for modifying the shape of this region. In order to maximize their margins, activated examples try to push or pull the rule boundaries depending on the sign of their class membership and the sign of weight, . If both of these signs agree then the corresponding training examples push the rule boundary outwards, which expands the region and as a result, brings in more training examples. Otherwise, if the signs do not match then those contradictory examples will pull in the rule boundary to get themselves out of it and thereby, shrink the region.

Compact Convex Support. Let denote the support of a neural rule given by equation (12) and defined as . We show that is a compactly supported convex set.

Proposition 1.

For any , the convex combination of x and z satisfies

where with


Given any , we have from equation (12),

By multiplying both sides by , we get


Similarly, we have for any ,

Multiplying both sides with ,


Adding equations (13) and (14), we obtain

which implies

Iv-C Training

Conventional procedures for generating decision tress on binary classification tasks employ either the Gini index or the cross entropy measure. We use a new splitting criterion for invoking a decision tree based on margin maximizing rules discussed in \secrefsec:MMR. Let denote the number of examples. We use subscript to refer to the left child, for the right child and for the parent node. Additionally, superscripts and refer to positive and negative examples, respectively.

Assuming binary partitioning of a decision node, each split defines two simple rules and . Using the maximum margin metric for a rule given in (8), the node splitting criterion can be written as follows


We decompose a single decision tree into a set of conjunctive rules to obtain a pool of diverse feature interactions. Each of these rules is used to initialize their neural counterparts using mapping discussed in Section IV-A. Such an ensemble of neural rules, collectively referred to as Neural Rule Ensembles (NRE) is essentially a 2-layered artificial neural network with min pooling operation and thus, a universal approximator [12]. However, in the proposed approach, feature interactions extracted from a decision tree are explicitly encoded into the network through its initialization, thus performing feature selection and leading to better generalization. Another characteristic of such an initialization is that the activations of any two pooled hidden units are orthogonal to each other.

After initializing the network, all the parameters are trained using Backpropagation [21]. We use the Adam optimization method [14] with learning rate to calculate the weight updates.

Input: Training data
Parameters: Learning rate , number of training epochs
Output: Trained Neural Rules Ensemble

1:  Standardize the training data : mean centering and unit standard deviation
2:  Build a decision tree using the splitting criterion from Eq. (15)
3:  Decompose the resulting tree into a set of conjunctive rules
4:  Map each tree-induced rule into a corresponding neural rule
5:  Initialize each neural rule as detailed in Section (IV-).
6:  Train an ensemble of neural rules simultaneously using backpropagation
7:  return Ensemble of Neural Rules
Algorithm 1 Neural Rule Ensemble (NRE) Training

V Experiments

V-a Simulation Result

In this section, we perform a simulation to illustrate the ability of a neural rule to evolve and expand its activated region. We consider a rotated XOR dataset, which is a non-linearly separable dataset since there does not exist any single hyperplane that can separate the positive training examples (shown in blue) from the negative ones (shown in red). Additionally, since we have rotated the XOR dataset by , tree-based approaches such as Gradient boosted trees would have a hard time approximating the oblique decision boundaries and would require a large number of trees and/or rules.


fig:evolutionNeuralRule (a) shows a single neural rule just after its initialization from a corresponding tree-induced rule. \figreffig:evolutionNeuralRule (b) shows an intermediary state after training for 150 iterations. It can be seen that the rule evolves its activated region to include more examples of the same type into its support. After training for a long time, the neural rule settles into an equilibrium state consisting of only positive examples as shown in \figreffig:evolutionNeuralRule (c).

Initialization After 150 Iterations After 3k Iterations
Fig. 4: Neural Rule: Evolution of the trainable support of a single rule with time

It is evident from \figreffig:evolutionNeuralRule (c) that there are still many positive training examples that do not belong to the support of a neural rule. In order to include them, a neural rule would have to assume a non-convex shape, which is not possible. This limitation has motivated the extension to a deep neural rule. \figreffig:evolutionDeepNeural shows the evolution of the activated region in the case of a deep neural rule, which can now achieve a non-convex shape and hence contain all the positive training examples into its support as shown in \figreffig:evolutionDeepNeural (c).

Initialization After 150 Iterations After 3k Iterations
Fig. 5: Deep Neural Rule: Evolution of the trainable support of a single rule with time

V-B Real Data Analysis

Datasets. In order to compare the performance of the proposed algorithm with state of the art approaches, we perform an empirical evaluation on simulated and multiple real datasets, which ensures a wide variety of different targets in terms of their dependence on the input features.

For simulation, we use a highly non-linear and multivariate artificial dataset, MADELON, featured in the NIPS 2003 feature selection challenge [11]. It is a generalization of the classic XOR dataset to five dimensions. Each vertex of a five dimensional hypercube contains a cluster of data points randomly labeled as or . The five dimensions constitute informative features and linear combinations of those features were added to form a set of 20 redundant but informative features. Additionally, a number of distractor features with no predictive power were added and the order of the features was randomized.

For benchmarking on real datasets, we will use Penn Machine Learning Benchmark (PMLB) [18] which includes datasets from a wide range of sources such as UCI ML repository [8], Kaggle, KEEL [2] and the meta-learning benchmark [20]. Since we are limiting our focus to binary classification tasks, we only consider datasets having two classes. Additionally, we removed all the datasets with fewer than training examples. This leaves us with a total of datasets for our investigation.

Statistical Tests. We use a statistical framework for hypothesis testing to investigate whether Neural Rule Ensembles (NRE) is significantly better or not compared to state of the art classifiers, namely Random Forests (RF), Gradient boosted trees (GB) and Artificial Neural Networks (ANN). A hypothesis test is a decision between two complementary hypotheses, the null hypothesis and the alternate hypothesis . We are trying to reject the null hypothesis, which states that there is no difference in the classification performance of algorithms, that is, both of them perform equally well. We use the following statistical tests designed to compare two classifiers on multiple data sets.

Wilcoxon Signed-Rank Test. For the Wilcoxon signed-rank test [26], the results are sorted by the magnitude of absolute difference in the performance scores of the two classifiers. This is followed by assigning ranks from the lowest to the highest absolute difference. In case of ties, average ranks are assigned. Finally, a test statistic is formed based on the ranks of the positive and negative differences.

GB NRE difference rank
wilt 18.60 10.40 8.20 19.0
madelon 14.50 10.30 4.20 18.0
adult 12.91 14.22 -1.31 17.0
phoneme 9.25 8.14 1.11 16.0
dis 0.71 1.77 -1.06 15.0
titanic 27.49 26.89 0.60 14.0
churn 3.60 4.13 -0.53 13.0
banana 9.31 8.93 0.38 12.0
ring 3.15 3.51 -0.36 11.0
spambase 4.34 4.63 -0.29 10.0
kr-vs-kp 0.42 0.20 0.22 9.0
chess 0.21 0.42 -0.21 7.5
coil2000 6.04 5.83 0.21 7.5
twonorm 2.34 2.25 0.09 6.0
clean2 0.00 0.00 0.00 3.0
hypothyroid 1.47 1.47 0.00 3.0
agaricus-lepiota 0.00 0.00 0.00 3.0
magic 11.67 11.67 0.00 3.0
mushroom 0.00 0.00 0.00 3.0
wins 6 8
ties 5 5
TABLE I: Comparison of the test error performance of NRE with Gradient Boosted trees (GB) on binary classification tasks

Let be the difference between the performance scores of two classifiers on the data set. Let be the sum of ranks for the data sets on which NRE outperforms the other classifier, and the sum of ranks on data sets where NRE gets defeated. Ranks corresponding to zero difference are split evenly between and ; if there is an odd number of them, one is ignored. The test statistic, is given by




For a two-tailed test with significance level, the critical value of the test statistic corresponding to data sets is . In other words, if is less than or equal to , NRE can be considered significantly better than the other classifier with and we can reject the null-hypothesis in favor of alternate one.

Sign Test: Wins, Losses & Ties Counts. The sign test [22, 24] is much weaker than the Wilcoxon signed-rank test and will not reject the null-hypothesis unless one algorithm almost always outperforms the other. In the sign test, we compare the generalization performance of classifiers by counting the number of data sets on which a classifier outperforms others.

Under the assumption that null-hypothesis is correct, that is, both classifiers perform equally well, one would expect each one of them to win on approximately out of data sets. This tell us that the number of wins is distributed according to the binomial distribution.

For datasets, the critical number of wins needed to reject the null-hypothesis for a two-tailed sign test at significance is . This implies that NRE can be considered significantly better than the other classifier with if it is the overall winner on out of datasets. Since null-hypothesis is true for ties, instead of throwing them, we distribute them evenly between the two classifiers. And, we ignore one of the ties if there is an odd number of them.

RF NRE difference rank
madelon 26.40 10.30 16.10 19.0
wilt 21.60 10.40 11.20 18.0
coil2000 7.06 5.83 1.23 17.0
phoneme 9.00 8.14 0.86 16.0
banana 9.56 8.93 0.63 15.0
titanic 27.49 26.89 0.60 14.0
spambase 4.92 4.63 0.29 13.0
twonorm 2.52 2.25 0.27 12.0
adult 14.47 14.22 0.25 11.0
kr-vs-kp 0.42 0.20 0.22 10.0
magic 11.88 11.67 0.21 9.0
hypothyroid 1.68 1.47 0.21 8.0
chess 0.62 0.42 0.20 7.0
ring 3.33 3.51 -0.18 6.0
agaricus-lepiota 0.00 0.00 0.00 3.0
mushroom 0.00 0.00 0.00 3.0
dis 1.77 1.77 0.00 3.0
clean2 0.00 0.00 0.00 3.0
churn 4.13 4.13 0.00 3.0
wins 1 13
ties 5 5
TABLE II: Comparison of the test error performance of NRE with Random Forests (RF) on binary classification tasks

Test Error Evaluation. In this section, we compare NRE with GradientBoost (GB), Random Forest (RF) and Artificial Neural Networks (ANN) on datasets. The test errors for the datasets without a test set are obtained using five-fold cross-validation.
For each classifier, the operating settings and the tuned hyperparameters are the following:

  • Random Forests: The number of trees used in the forest are tuned from the set .

  • Gradient Boosted Trees: We use boosting iterations with the maximum tree depth selected from the range .

  • Artificial Neural Networks: Fully connected networks with a single hidden layer (since NRE contains one hidden layer) and rectified linear (ReLU) activation. The number of hidden units is selected for optimal performance.

  • Neural Rule Ensembles: Maximum depth of the tree used for initializing the network is searched over the set .

The hyperparameters for the methods being evaluated have been obtained by internal five-fold cross-validation on the training set. We use the scikit-learn implementation for evaluating the existing algorithms.

ANN NRE difference rank
madelon 45.50 10.30 35.20 19.0
phoneme 14.18 8.14 6.04 18.0
wilt 14.20 10.40 3.80 17.0
churn 6.27 4.13 2.14 16.0
coil2000 7.46 5.83 1.63 15.0
spambase 3.47 4.63 -1.16 14.0
ring 2.52 3.51 -0.99 13.0
magic 12.44 11.67 0.77 12.0
adult 14.79 14.22 0.57 11.0
hypothyroid 1.89 1.47 0.42 9.5
kr-vs-kp 0.62 0.20 0.42 9.5
banana 9.31 8.93 0.38 8.0
twonorm 2.43 2.25 0.18 7.0
dis 1.94 1.77 0.17 6.0
agaricus-lepiota 0.00 0.00 0.00 3.0
mushroom 0.00 0.00 0.00 3.0
clean2 0.00 0.00 0.00 3.0
chess 0.42 0.42 0.00 3.0
titanic 26.89 26.89 0.00 3.0
wins 2 12
ties 5 5
TABLE III: Comparison of the test error performance of NRE with Artificial Neural Networks (ANN) on binary classification tasks

NRE vs Gradient Boosted Trees. From Table I, it can be seen that NRE wins on data sets, GB wins on data sets and there are ties. Ignoring one tie and splitting the remaining ones evenly, we find that NRE is better on out of datasets. Since the critical number of wins needed under sign test is 14, we fail to reject the null-hypothesis. Similarly, we fail to reject the null-hypothesis under the Wilcoxon signed-rank test because the test statistic is greater than 46. This implies that we don’t have enough statistical evidence to establish that NRE outperforms GB. However, we realize that NRE initialized from a single tree gives a tough competition to boosted trees and is a more compactly represented model.

NRE vs Random Forest. We find from Table II that NRE outperforms RF on almost all the data sets except for the ring data set and the 5 tied matches. Splitting the ties evenly, NRE is better on out of data sets which is greater than the critical number of wins needed, that is , under the sign test. We can therefore reject the null hypothesis. For Wilcoxon-signed ranks test, the statistic is less than the critical value which allows us to reject the null hypothesis as well. This implies that NRE is significantly better than Random Forest and given that it utilizes only one tree compared to the up to trees in RF, it is more compact too.

NRE vs Artificial Neural Network. It is evident from Table III that NRE outperforms ANN on data sets, loses on sets and there are ties. NRE passes the sign test since it is better on data sets (splitting the ties evenly) which matches the critical number of wins needed. Since, the test statistic for the Wilcoxon signed-rank test is less than the critical value , we reject the null-hypothesis in favor of alternate one. Both of the statistical tests agree that NRE is significantly better than the Artificial Neural Networks.

wilt 4839 6 18.60 21.60 14.20 10.40
madelon 2600 500 14.50 26.40 45.50 10.30
phoneme 5404 6 9.25 9.00 14.18 8.14
kr-vs-kp 3197 37 0.42 0.42 0.62 0.20
coil2000 9822 86 6.04 7.06 7.46 5.83
banana 5300 3 9.31 9.56 9.31 8.93
twonorm 7400 21 2.34 2.52 2.43 2.25
adult 48842 15 12.91 14.47 14.79 14.22
dis 3772 30 0.71 1.77 1.94 1.77
churn 5000 21 3.60 4.13 6.27 4.13
ring 7400 21 3.15 3.33 2,52 3.51
spambase 4601 58 4.34 4.92 3.47 4.63
chess 3196 37 0.21 0.62 0.42 0.42
titanic 2201 4 27.49 27.49 26.89 26.89
hypothyroid 3163 26 1.47 1.68 1.89 1.47
magic 19020 11 11.67 11.88 12.44 11.67
mushroom 8124 23 0.00 0.00 0.00 0.00
clean2 6598 169 0.00 0.00 0.00 0.00
agaricus-lepiota 8145 23 0.00 0.00 0.00 0.00
wins 11 3 4 13
TABLE IV: Comparison of the test error performance of NRE with Gradient Boosted trees (GB), Random Forests (RF) and Artificial Neural Networks (ANN) on binary classification tasks

Overall comparison. In Table V is shown a summary of the Wilcoxon rank T statistics and the number of NRE wins vs the other methods, with their significance in bold. In Table IV, are shown all the classification test errors for all the methods in a single table. Also shown are the number of observations and the number of features of each dataset.

Wilcoxon T Statistic 81 13.5 34.5
Number of NRE wins 10 15 14
TABLE V: Summary results comparing the NRE with GB, RF and ANN.

Vi Conclusion

In this work, we presented a novel method called Neural Rule Ensembles (NRE) for encoding into a neural network and refining the feature interactions captured by a decision tree. This was achieved by defining a neural transformation of a tree-induced rule using ReLU units and the min pooling operation. Such a mapping addresses the initialization related concerns of fully connected neural networks as well as the feature selection problem, and enables learning of compact representations compared to conventional tree-based approaches.

Empirical evaluations on binary classification datasets from the Penn Machine Learning Benchmark (PMLB) [18] were performed to compare the generalization performance of Neural Rule Ensembles (NRE) with state of the art approaches such as Random Forests (RF), Gradient Boosted Trees (GB) and Artificial Neural Networks (ANN). We used two statistical tests, the Wilcoxon signed-rank test and the sign test, to evaluate the statistical significance of these results. Both of these statistical tests found NRE to be significantly better than Random Forests and the Artificial Neural Networks with . When NRE was compared to Gradient Boosted Trees, we could not find enough statistical evidence to reject the null hypothesis stating that both of them perform equally well. However, NRE only utilizes one tree, so it obtains a more compact and interpretable representation.


  1. D. Akdemir, N. Heslot and J. Jannink (2013-01) Soft rule ensembles for supervised learning. pp. 78–83. Cited by: §I.
  2. J. Alcalá-Fdez, A. Fernández, J. Luengo, J. Derrac and S. García (2011) KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework.. Multiple-Valued Logic and Soft Computing 17 (2-3), pp. 255–287. External Links: Link Cited by: §V-B.
  3. J. Błaszczyński, B. Prusak and R. Słowiński (2016) Multi-objective search for comprehensible rule ensembles. In International Joint Conference on Rough Sets, pp. 503–513. Cited by: §I.
  4. L. Breiman (2001) Random forests. Machine Learning 45 (1), pp. 5–32. Cited by: §I.
  5. K. W. De Bock (2017) The best of two worlds: balancing model strength and comprehensibility in business failure prediction using spline-rule ensembles. Expert Systems with Applications 90, pp. 23–39. Cited by: §I.
  6. K. Dembczyński, W. Kotlowski and R. Slowiński (2008) Maximum likelihood rule ensembles. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, New York, NY, USA, pp. 224–231. External Links: ISBN 978-1-60558-205-4, Link, Document Cited by: §I.
  7. H. Deng (2019) Interpreting tree ensembles with intrees. International Journal of Data Science and Analytics 7 (4), pp. 277–287. Cited by: §I.
  8. D. Dheeru and E. Karra Taniskidou (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §V-B.
  9. J. H. Friedman and B. E. Popescu (2008) Predictive learning via rule ensembles. The Annals of Applied Statistics, pp. 916–954. Cited by: §I.
  10. J. H. Friedman (2001) Greedy function approximation: a gradient boosting machine. Annals of Statistics, pp. 1189–1232. Cited by: §I.
  11. I. Guyon, S. Gunn, A. Ben-Hur and G. Dror (2004) Result analysis of the nips 2003 feature selection challenge. In NIPS, pp. 545–552. Cited by: §V-B.
  12. K. Hornik (1991) Approximation capabilities of multilayer feedforward networks. Neural Networks 4 (2), pp. 251–257. Cited by: §IV-C.
  13. Y. Ioannou, D. P. Robertson, D. Zikic, P. Kontschieder, J. Shotton, M. Brown and A. Criminisi (2016) Decision forests, convolutional networks and the models in-between. CoRR abs/1603.01250. Cited by: §I.
  14. D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization.. CoRR abs/1412.6980. External Links: Link Cited by: §IV-C.
  15. P. Kontschieder, M. Fiterau, A. Criminisi and S. R. Bulò (2015-12) Deep neural decision forests. In 2015 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 1467–1475. External Links: Document, ISSN Cited by: §I.
  16. M. Nalenz and M. Villani (2017-02) Tree ensembles with rule structured horseshoe regularization. The Annals of Applied Statistics 12, pp. . External Links: Document Cited by: §I.
  17. M. Nalenz (2016) Horseshoe rulefit: learning rule ensembles via bayesian regularization. Cited by: §I.
  18. R. S. Olson, W. La Cava, P. Orzechowski, R. J. Urbanowicz and J. H. Moore (2017-12-11) PMLB: a large benchmark suite for machine learning evaluation and comparison. BioData Mining 10 (1), pp. 36. External Links: ISSN 1756-0381, Document, Link Cited by: §I, §V-B, §VI.
  19. J. R. Quinlan (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. External Links: ISBN 1-55860-238-0 Cited by: §I.
  20. M. Reif (2012) A comprehensive dataset for evaluating approaches of various meta-learning tasks.. In ICPRAM (1), P. L. Carmona, J. S. Sánchez and A. L. N. Fred (Eds.), pp. 273–276. External Links: Link Cited by: §V-B.
  21. D. E. Rumelhart, G. E. Hinton and R. J. Williams (1988) Neurocomputing: foundations of research. J. A. Anderson and E. Rosenfeld (Eds.), pp. 696–699. External Links: ISBN 0-262-01097-6, Link Cited by: §IV-C.
  22. S. L. Salzberg (1997-09-01) On comparing classifiers: pitfalls to avoid and a recommended approach. Data Mining and Knowledge Discovery 1 (3), pp. 317–328. External Links: ISSN 1573-756X, Document, Link Cited by: §V-B.
  23. I. K. Sethi (1991) Decision tree performance enhancement using an artificial neural network implementation1 1this work was supported in part by nsf grant iri-9002087. In Artificial Neural Networks and Statistical Pattern Recognition, I. K. SETHI and A. K. JAIN (Eds.), Machine Intelligence and Pattern Recognition, Vol. 11, pp. 71 – 88. External Links: ISSN 0923-0459, Document, Link Cited by: §I.
  24. D. J. Sheskin (2007) Handbook of parametric and nonparametric statistical procedures. 4 edition, Chapman & Hall/CRC. External Links: ISBN 1584888148, 9781584888147 Cited by: §V-B.
  25. J. Welbl (2014) Casting random forests as artificial neural networks (and profiting from it). In Pattern Recognition, X. Jiang, J. Hornegger and R. Koch (Eds.), Cham, pp. 765–771. External Links: ISBN 978-3-319-11752-2 Cited by: §I.
  26. F. Wilcoxon (1945) Individual comparisons by ranking methods. Biometrics Bulletin 1 (6), pp. 80–83. External Links: ISSN 00994987, Link Cited by: §V-B.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description