Decision Stream: Cultivating Deep Decision Trees
Abstract
Various modifications of decision trees have been extensively used during the past years due to their high efficiency and interpretability. Tree node splitting based on relevant feature selection is a key step of decision tree learning, at the same time being their major shortcoming: the recursive nodes partitioning leads to geometric reduction of data quantity in the leaf nodes, which causes an excessive model complexity and data overfitting. In this paper, we present a novel architecture — a Decision Stream, — aimed to overcome this problem. Instead of building a tree structure during the learning process, we propose merging nodes from different branches based on their similarity that is estimated with two-sample test statistics, which leads to generation of a deep directed acyclic graph of decision rules that can consist of hundreds of levels. To evaluate the proposed solution, we test it on several common machine learning problems — credit scoring, twitter sentiment analysis, aircraft flight control, MNIST and CIFAR image classification, synthetic data classification and regression. Our experimental results reveal that the proposed approach significantly outperforms the standard decision tree learning methods on both regression and classification tasks, yielding a prediction error decrease up to 35 %.
1Introduction
With the recent growth of data amount available for analysis and exploration, there is an inevitable need of comprehensive and automated methods for intellectual data processing. Decision tree (DT) is one of the most popular techniques in this area, and due to robustness and efficiency this prediction model became a standard tool for machine learning and big data problems. The idea behind this method is to separate one complex decision rule into a union of primitive rules, which leads to another crucial property — DT can be easily interpreted by human compared to many other machine learning techniques.
The DT construction is performed by recursive data partitioning. At each stage the best splitting rule is determined, and data from the current node is divided into child nodes according to the selected criterion. The same procedure is recursively applied to all new nodes in the generated tree until the stopping condition is met. While being a fast and clear way of data splitting, the geometrical reduction of data quantity in the nodes leads to their exhaustion and causes poor generalization ability and data overfitting. Since multiple partitioning generates many nodes with the same or similar label distribution (especially in the lower layers), it looks quite natural to merge such nodes to diminish the problem of data exhaustion and to continually increase the purity of the separated samples.
In this paper, we propose a novel method for regression and classification tasks — a Decision Stream (DS), where decision branches are loosely split and merged like natural streams of a waterfall (Fig. Figure 1). In contrast to the classical decision tree algorithm, the proposed method builds a deep directed acyclic graph with higher degree of connectivity by merging statistically indistinguishable nodes, which leads to reduction of the model width and better generalization due to more representative data samples. The split and merge operations are combined in this approach and repeated at each step of the iterative learning process. The performed experiments demonstrate that the proposed method achieves notably better results compared to the standard decision tree approach, at the same time showing high computational performance during training in distributed systems. The data and software related to this paper are available on GitHub
The rest of the paper is organized as follows. Section 2 gives an overview of the related works. Section 3 presents in details the proposed approach, and Section 4 provides the experimental results obtained on the real-world problems as well as on synthetic data. Section 5 summarizes our conclusions.
2Related Work
Decision trees have been extensively studied, and a large number of their modifications were developed during the past years. The proposed methods include the Iterative Dichotomiser 3 and its successor — C4.5 [1], Classification and Regression Tree (CART) [2], Chi-squared Automatic Interaction Detection (CHAID) [3], Quick, Unbiased, Efficient, Statistical Tree (QUEST) [4] and various modifications of these algorithms [4]–[8]. Despite the essential difference in the training procedure, they usually tend to show similar performance on many real-world regression and classification tasks [9]–[15].
The majority of these algorithms consider only node partitioning for decision tree construction, or use node merging as an auxiliary procedure that has no significant effect on the tree structure. For instance, C4.5 and CART algorithms as well as their modifications [4]–[8] perform only node splitting based on the selected features without any merging or fusion operations. QUEST algorithm merges several classes into two superclasses to produce one binary split [16]. In [17], the number of terminal nodes is reduced by fusing the leaves with similar predictions after the training is finished. The CHAID algorithm merges data samples within a node, which is equivalent to using a modified splitting criterion. Data samples are fused based on the significance of their similarity estimated by test statistics: test [3] for categorical label and F-test [18] for continuous.
A fundamentally different approach based on Occam’s razor concept was proposed for decision tree size reduction in [19], where decision graph is constructed on the basis of hill climbing heuristic by merging nodes from adjacent levels according to minimum message length principle with goal to produce a model of minimum size while preserving/increasing its accuracy. This technique has demonstrated an advantage over standard decision trees in experiments [20]–[22].
In this work, we present a Decision Stream algorithm that combines the classical decision tree learning method with a new procedure — statistically-based merging of nodes from the same and/or different levels of DS. The predictive model is growing till no improvements are achievable, considering different data recombinations, and resulting in deep directed acyclic graph architecture and statistically-significant data partition.
3Decision Stream
In this section, we describe the proposed Decision Stream algorithm. The main concept of this method consists in merging similar nodes after each splitting iteration. The similarity is estimated using two-sample test statistics that is applied to compare the distribution of labels in each pair of nodes. The nodes are merged if the difference is statistically insignificant. This procedure eliminates the classical problem of decision trees — progressive decrease of data quantity in the leaf nodes, and produces a more general structure — a directed acyclic graph (Fig. Figure 1), which can be extremely deep. A more detailed explanation of the algorithm is provided below.
3.1Node Merging with Two-Sample Test Statistics
The overview of the merging operation is illustrated in the Fig. ?. After the classical decision tree branching, the merging algorithm takes as an input leaf nodes generated at the current stage (Fig. ?(a)) as well as previously obtained unsplit leaves from the upper levels of the model, and fuses statistically similar nodes (Fig. ?(b-c)) using an input parameter — significance threshold . Since the nodes are merged based on the similarity of their label distributions, the merging procedure can be considered as the statistically-based label clustering.
Merging Algorithm ? consists of an outer and inner loop. In the outer loop the leaves are sorted in ascending order according to the number of associated samples. The inner loop consists of the following three steps:
Leaf is picked up from the head of the sorted collection.
For each (, ) pair we compute the similarity of two nodes and then take the leaf that corresponds to its highest value. The similarity is calculated by the function with two-sample test statistics (Section 3.3). Function returns the significance level representing the probability that the mean values of labels associated with these two nodes are identical.
If the obtained significance level is above the threshold , the leaves and are merged into a new leaf with parents obtained by uniting the parents of the merged nodes.
3.2Decision Stream Training
The whole DS training procedure is described in Algorithm ?, where each learning iteration consists of two steps. At the first step, DS grows using the classical decision tree branching operation — new nodes are created by splitting all current non-terminal leaves [2]. At the second step, the leaves are merged using the procedure described in Algorithm ?. A leaf is marked as terminal if it cannot be split into statistically different child nodes. The pair of splitting and merging steps is iteratively performed till the stopping criterion is met. If all leaves are terminal or the prediction accuracy is not improved, the DS training is finished and Algorithm ? returns the reference to the root node of the generated DS. To estimate the prediction accuracy, we use a cross-node Gini impurity measure calculated for leaf nodes and classes:
where and is the number of samples in all leaves and leaf node , respectively; is the fraction of samples of class in leaf .
3.3Splitting/Merging Criteria
The splitting and merging operations are performed according to significance threshold . We take as the null hypothesis that labels of two nodes are from the same distribution and have the same mean value. The null hypothesis is rejected at the significance level , and in case of rejection we consider that the nodes are statistically different. The similarity is estimated by function with pair of two-sample test statistics. We use Z-test/Student’s t-test for labels with presumably normal distribution. The choice between the tests is determined according to rule [23]: Z-test is applied if the size of both data samples is greater than 30, Student’s t-test — otherwise. For labels with non-normal distribution we use Kolmogorov-Smirnov/Mann-Whitney U tests: the first one is applied if the size of data samples is greater than 2, the second — otherwise. We prefer Kolmogorov-Smirnov over Mann-Whitney U test since it is more sensitive to variations of both location and shape of the empirical cumulative distribution function [24], and provides better prediction accuracy in our experiments.
We propose two different versions of the split function : one for relatively small datasets, where a precise selection of the split is crucial; and one for large-scale datasets where a trade-off between the accuracy and running time is important due to big amount of training samples.
3.4Node Splitting for Non-Distributed Data
For non-distributed datasets the splitting is performed according to Algorithm ?, which takes as an input the significance threshold and a particular . Firstly, binary splits of the data within the is generated for each unique value of every feature. Then the similarity function is calculated for each split, and the one with the lowest significance of similarity is selected. If this significance is smaller than the input threshold , the selected best split is returned, otherwise — splitting is rejected and the node becomes terminal. Though this method is rather computationally expensive, it provides the best split quality and is reasonable for compact datasets.
3.5Node Splitting for Distributed Data
Using the above algorithm for large-scale datasets is infeasible in most cases, thus we propose a different way of split selection designed for big data solutions. Instead of the greedy search, we perform data splitting based on the feature that is most correlated with label within a particular node [25]. Another difference of the proposed method is that it attempts to produce multiple leaves for each node as shown in Figure 2, so far as the large number of samples presumes the robustness of such split.
Algorithm ? demonstrates the body of the method. The procedure starts with function that selects the feature that is most correlated with the label. The obtained feature is then used to split the samples in the current node. If the feature is categorical, the samples are split by its categories, each one forming a leaf node. If the feature is continuous, all samples are firstly sorted according to values of the feature and then divided into ranges, where is a number of samples in the node. Samples from the same range are then associated with one leaf node (Fig. Figure 2(a)). At the next step, the adjacent leaves are merged using Algorithm ? with threshold until all neighboring nodes are statistically distinguishable (Fig. Figure 2(b-c)). Finally, as soon as splitting with regard to categorical or continuous feature is finished, the obtained leaf nodes are merged again (this time not only adjacent ones) and the leaves providing statistically different predictions are returned.
The strength of correlation between the feature and label is estimated by function as described in Algorithm ?: if the feature and label are continuous, the correlation strength is calculated as coefficient of determination:
otherwise it is computed as correlation ratio:
Since both coefficients measure the same characteristics in discrete and continuous cases, we can compare the values obtained for different types of features to select the best one.
4Experiments
In this section, we describe the experiments conducted to evaluate the performance of the proposed Decision Stream algorithm. The solution was tested on five common machine learning problems, and on large-scale synthetic classification/regression data.
4.1Datasets
Credit scoring
^{2} — classification problem, 2 classes, 10 features, 100K training and 20K test samples.Twitter sentiment analysis
^{3} — classification problem, 3 classes (positive, negative, neutral), 500 features, 6500 training and 824 test samples. Features were generated using the bag-of-words model.F16 aircraft control problem (Ailerons)
^{4} — regression problem, 40 features, 7154 training and 6596 test samples.MNIST handwritten digits classification
^{5} — 10 classes, 784 features, 60K training and 10K test samples.CIFAR-10 image classification
^{6} — 10 classes, 1024 features, 50K training and 10K test samples. Features were extracted from the last convolutional layer of the pre-trained ResNet-18 [26] CNN.
To tune model parameters, the training data for each problem was split into training (90 %) and validation (10 %) subsets. The same data was used for training and testing both decision tree and Decision Stream algorithms.
To get the baseline accuracy, we used the Scikit-learn
Additionally, DS and DT algorithms were tested on large-scale synthetic classification and regression data generated on the fly by Spark Performance Tests Suite
4.2Tuning the Significance Threshold
Significance threshold is the key parameter of DS algorithm, and in the first experiment our goal was to estimate its optimal value for each problem. The level of was tuned as follows: for each dataset we varied it between and 0.5 and for each value estimated the accuracy of DS on the validation set. For synthetic data the similarity of labels was estimated by unpaired two-sample Z-test and Students t-test, for all other datasets — by Kolmogorov-Smirnov and Mann-Whitney U nonparametric tests. For classification problems we use the standard accuracy metric, for regression tasks — the weighted absolute percentage error:
where X and Y are validation samples and their corresponding labels, and are the label and the prediction for sample , respectively.
The results of the experiment are presented in Figure 3. The best accuracy was achieved at the significance threshold that is equal to 0.005 for credit scoring, 0.05 for tweets, 0.02 for aileron control, 0.005 for MNIST, 0.01 for CIFAR-10 and 0.001 for synthetic data. The obtained values were used for DS training in the following experiments.
4.3Classification and Regression Results for Non-Distributed Data
This section presents the results obtained using a Decision Stream implementation for non-distributed data. Along with the single DS and DT models, we train their ensembles generated using five methods: random forest [27], extremely randomized trees [28], gradient boosting [29] and bagging [30]. Table 1 shows the results for single DS, DT and DS models, where the last one denotes a DS with disabled merging phase.
Model | Credit scoring | Tweet sentiments | Aileron control | MNIST | CIFAR-10 |
DT | 9.73 | 45.2 | 25.5 | 12.5 | 13.9 |
DS | 6.33 | 39.9 | 18.2 | 25.0 | 19.7 |
DS | 6.36 | 38.8 | 16.4 | 10.3 | 13.8 |
We should note that DS is not equivalent to DT since in this version node splitting is performed only if the resulting child nodes are statistically distinguishable. The results demonstrate that disabling of merging phase leads to substantially different accuracy — while on the first dataset with relatively low complexity (2 classes, 10 features) it prevents minor overfitting, for other datasets with higher complexity (3–10 classes or continuous label, 40–1024 features) it results in an oversimplified tree model. Enabling the merging operation changes the situation: the growth doesn’t stop on the stage of simple predictive model that has many similar leaf nodes — merging operation fuses them and thus forces the training procedure to continue that can result in very deep decision graphs. Figure 4 illustrates this oscillating behavior: the merge operation is performed till no more statistically distinguishable nodes can be produced. Table 1 demonstrates that this leads to significantly higher accuracy compared to the standard decision tree architecture: the error on the first four datasets is reduced by 34 %, 14 %, 35 % and 17 %, respectively.
Figure 5 illustrates the dependency between the size and the predictive error of ensembles constructed from decision trees and Decision Streams. The best results for all datasets are summarized in Table 2.
Model | Credit scoring | Tweet sentiments | CIFAR-10 | Aileron control | MNIST | |
DT | Method | |||||
Error, % | 7.62 | 30.4 | 23.9 | 2.91 | ||
DS |
Method | |||||
Error, % | 6.31 | 38.8 | 13.0 | 15.0 | 2.66 | |
As one can see, in all cases the best performance of Decision Stream ensemble was obtained when using the extremely randomized trees algorithm. The explanation of this effect is the following: in contrast to decision trees, the construction of Decision Streams involves a large number of recombinations caused by continuously repeating splitting and merging operations. The chances that DS will find the optimal solution are therefore higher compared to DT, but at the same time the resulting Decision Streams tend to provide less diverse results. The power of ensemble significantly depends on the diversity of predictors, which is thus lower in case of Decision Streams. Extremely randomized trees method partially solves this problem by using random features for training the DS, and therefore it tends to provide better final results compared to other methods.
In almost all cases the best results for Decision Stream are achieved by ensembles of size 500, with the only exception for twitter sentiment analysis problem. The greatest advantage of DS over the DT is obtained on the credit scoring and aileron control tasks: a single DS outperforms all DT ensembles. Overall, the Decision Stream based methods have shown the best results on four out of five datasets with an average advantage of 16 %.
4.4Classification and Regression Results for Large-Scale Data
The next set of experiments is conducted using Apache Spark-based
Figure 6 shows the classification error, the regression weighted absolute percentage error (Eq. Equation 1) and the training time for DT with a depth ranging from 3 to 15 levels, and DS which depth is regulated automatically. According to the results, decision trees trained with variance reduction metric and depth restriction of 5 levels demonstrate the best accuracy in both classification and regression tasks and so are used in our further experiments.
The prediction error of Decision Stream algorithm ((0,0) circle ;) is 9 — 48 times lower than the error obtained by DT. The explanation of this significant difference is in the fact that the generated synthetic data had a distribution that was close to normal, thus the used pair of Z-test/t-test was especially effective in this case. Another reason is that better accuracy was also obtained at the expense of higher running time of DS algorithm.
To find the time that is required for DS and DT to provide the same accuracy, and to compare the accuracy after corresponding training periods, the experiments with different quantity of training data and number of models in ensembles were carried out (Fig. Figure 7). According to the empirical results presented in Table 3, it takes significantly lower amount of data and less training time for DS to provide the same quality of prediction as for DT in both classification and regression tasks; for comparable training time Decision Stream demonstrates significantly better accuracy.
Gradient boosting and random forest ensembles improve DT performance, though the minimal error of ensembles with 30 decision trees is still higher than the corresponding error of 30 Decision Streams: the difference reaches 46 — 48 times for classification and 5.9 — 8.3 times for regression tasks. Thus, the proposed modification of Decision Stream for large-scale data demonstrates faster training and better accuracy on both regression and classification tasks compared to DT algorithm.
Condition | Model | Samples | Time, s | Error, % | |
Same number of samples | DT | 39.2 2.31 | 18.2 2.33 | ||
[4pt] | DS | 62.4 3.14 | 0.38 0.08 | ||
Ratio | 1 | 0.63 | 48 | ||
Similar time | DT | 22.7 1.62 | 19.9 2.43 | ||
DS | 5 10 | 16.7 1.12 | 6.36 0.83 | ||
Ratio | 5 | 1.36 | 3 | ||
Same accuracy | DT | 22.7 1.62 | 19.8 2.40 | ||
DS | 10 | 3.82 0.23 | 19.6 2.28 | ||
Ratio | 25 | 6 | 1 | ||
Same number of samples | DT | 28.3 3.43 | 12.2 1.28 | ||
[4pt] | DS | 60.1 2.99 | 1.32 0.12 | ||
Ratio | 1 | 0.47 | 9 | ||
Similar time | DT | 28.3 3.43 | 12.2 1.28 | ||
DS | 5 |
19.9 1.29 | 6,68 0.72 | ||
Ratio | 20 | 1.42 | 1.83 | ||
Same accuracy | DT | 17.1 2.15 | 13.2 0.13 | ||
DS | 10 |
5.6 0.31 |
13.1 0.11 | ||
Ratio | 25 | 3 | 1 | ||
5Conclusion
In this paper we presented a novel decision tree based algorithm — a Decision Stream, which avoids the problems of data exhaustion and formation of unrepresentative data samples in decision tree nodes by merging the leaves from the same and/or different levels of the predictive model structure. By increasing the number of samples in each node and reducing the tree width, the proposed algorithm preserves statistically representative data and allows extremely deep graph architecture that can consist of hundreds of levels. The main parameter of the algorithm — significance threshold, determines the results of each split/merge operation and automatically defines the depth of the Decision Stream model.
The experiments demonstrated that Decision Stream algorithm shows a strong advantage over the standard decision tree learning methods on both regression and classification tasks in both versions: non-distributed for relatively small datasets, where a precise selection of the best data splits is crucial; and distributed, where a balance between the accuracy and computational performance should be maintained.
Footnotes
- https://github.com/aiff22/Decision-Stream
- https://www.kaggle.com/c/GiveMeSomeCredit/data/
- http://alt.qcri.org/semeval2015/task10/
- http://www.dcc.fc.up.pt/\textasciitilde ltorgo/Regression/DataSets.html
- http://yann.lecun.com/exdb/mnist/
- https://www.cs.toronto.edu/\textasciitilde kriz/cifar.html
- http://scikit-learn.org (v. 0.18.1)
- https://github.com/databricks/spark-perf/ (v. 1.6)
- http://spark.apache.org (v. 1.6)
References
- J. R. Quinlan, “C4.5: Programs for machine learning,” Mach. Learn., vol. 16, pp. 235–240, 1994.
- L. Breiman, J. H. Friedman, R. A. Olshen and C. J. Stone, Classification and regression trees. Belmont, Wadsworth, 1984.
- G. V. Kass, “An exploratory technique for investigating large quantities of categorical data,” Appl. Stat., vol. 29, pp. 119–127, 1980.
- W.-Y. Loh, “Fifty years of classification and regression trees,” Intern. Stat. Review, vol. 82, pp. 329–348, 2014.
- A. Panhalkar and D. Doye, “An outlook in some aspects of hybrid decision tree classification approach: a survey,” in ICDECT, 2016, pp. 85–95.
- K. Kyoungok, “A hybrid classification algorithm by subspace partitioning through semi-supervised decision tree,” Pattern Recogn., vol. 60, pp. 157–163, 2016.
- H. Zhao and X. Li, “A cost sensitive decision tree algorithm based on weighted class distribution with batch deleting attribute mechanism,” Inform. Sciences, vol. 378, pp. 303–316, 2017.
- J. Sanz, J. Fernandez, H. Bustince, C. Gradin, M. Fort´un and T. Belzunegui, “A decision tree based approach with sampling techniques to predict the survival status of poly-trauma patients,” IJCIS, vol. 10, pp. 440–455, 2017.
- M. Ture, F. Tokatli and I. Kurt, “Using Kaplan-Meier analysis together with decision tree methods (C&RT, CHAID, QUEST, C4.5 and ID3) in determining recurrence-free survival of breast cancer patients,” Expert Syst. Appl., vol. 36, pp. 2017–2026, 2009.
- D. Delen, C. Kuzey and A. Uyar, “Measuring firm performance using financial ratios: a decision tree approach,” Expert Syst. Appl., vol. 40, pp. 3970–3983, 2013.
- P. C. Pendharkar and H. Khurana, “Machine learning techniques for predicting hospital length of stay in Pennsylvania Federal and Specialty hospitals,” IJACSA, vol. 11, pp. 45–56, 2014.
- I. R. Glăvan, D. Petcu and E. Simion, “CART versus CHAID behavioral biometric parameter segmentation analysis,” in SECITC, 2016, pp. 59–68.
- K. C. Chu, H. J. Huang and Y. S. Huang, “Machine learning approach for distinction of ADHD and OSA,” in ASONAM, 2016, pp. 1044–1049.
- S. Jhajharia, S. Verma and R. Kumar, “A cross-platform evaluation of various decision tree algorithms for prognostic analysis of breast cancer data,” in ICICT, 2016, pp. 1–7.
- C. S. Pitombo, A. D. Souza and A. Lindner, “Comparing decision tree algorithms to estimate intercity trip distribution,” Transport. Res. C-EMER, vol. 77, pp. 16–32, 2017.
- W.-Y. Loh and Y.-S. Shih, “Split selection methods for classification trees,” Stat. Sin., vol. 7, pp. 815–840, 1997.
- A. Ciampi, J. Thiffault and U. Sagman, “RECPAM: a computer program for recursive partition amalgamation for censored survival data and other situations frequently occurring in biostatistics,” Comput. Meth. Prog. Bio., vol. 30, pp. 283–296, 1989.
- R. Nisbet, J. Elder and G. Miner, Handbook of statistical analysis and data mining applications. Canada, Academic Press, 2009.
- J. J. Oliver and C. S. Wallace, “Inferring decision graphs,” in IJCAI, 1991, pp. 593–603.
- R. Kohavi, “Bottom-up induction of oblivious read-once decision graphs: strengths and limitations,” in AAAI, 1994, pp. 613–618.
- P. J. Tan and D. L. Dowe, “Decision Forests with Oblique Decision Trees,” in MICAI, 2006, pp. 593–603.
- J. Shotton, T. Sharp, P. Kohli, S. Nowozin, J. Winn and A. Criminisi, “Decision Jungles: Compact and Rich Models for Classification,” in NIPS, 2013, pp. 234–242.
- R. C. Sprinthall, Basic statistical analysis. Boston, Pearson Education, 2011.
- G. W. Corder and D. I. Foreman, Nonparametric statistics: a step-by-step approach. New York, Wiley, 2014.
- N. Salehi, H. Yazdi and H. Poostchi, “Correlation based splitting criterion in multi-branch decision tree,” Cent. Eur. J. Comp. Sci., vol. 2, pp. 205–220, 2011.
- K. He, X. Zhang, S. Ren and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
- T. K. Ho, “Random decision forests,” in ICDAR, 1995, pp. 278–282.
- P. Geurts, D. Ernst and L. Wehenkel, “Extremely randomized trees,” Mach. Learn., vol. 63, pp. 3–42, 2006.
- J. H. Friedman, “Greedy function approximation: a gradient boosting machine,” Ann. Stat., vol. 29, pp. 1189–1232, 2001.
- L. Breiman, “Bagging predictors,” Mach. Learn., vol. 24, pp. 123–140, 1996.