Multinomial Random Forest: Toward Consistency and Privacy-Preservation
Despite the impressive performance of standard random forests (RF), its theoretical properties have not been thoroughly understood. In this paper, we propose a novel RF framework, dubbed multinomial random forest (MRF), to discuss the consistency and privacy-preservation. Instead of deterministic greedy split rule, the MRF adopts two impurity-based multinomial distributions to randomly select a split feature and a split value respectively. Theoretically, we prove the consistency of the proposed MRF and analyze its privacy-preservation within the framework of differential privacy. We also demonstrate with multiple datasets that its performance is on par with the standard RF. To the best of our knowledge, MRF is the first consistent RF variant that has comparable performance to the standard RF.
Random forest (RF) Breiman (2001) is a popular type of ensemble learning method. Because of its excellent performance and fast yet efficient training process, the standard RF and its several variants have been widely used in many fields, such as computer vision Cootes et al. (2012); Kontschieder et al. (2015) and data mining Bifet et al. (2009); Xiong et al. (2012). However, due to the inherent bootstrap randomization and the highly greedy data-dependent construction process, it is very difficult to analyze the theoretical properties of random forests Biau (2012), especially for the consistency. Since consistency ensures that the model goes to optimal under a sufficient amount of data, this property is especially critical in this big data era.
To address this issue, several RF variants Breiman (2004); Biau et al. (2008); Genuer (2012); Biau (2012); Denil et al. (2014); Wang et al. (2017) were proposed. Unfortunately, all these consistent RF variants suffer from relatively poor performance compared with the standard RF due to two mechanisms introduced for consistency. On the one hand, the data partition process allows only half of the training samples to be used for the construction of tree structure, which significantly reduces the performance of consistent RF variants. On the other hand, extra randomness (, Poisson or Bernoulli distribution) is introduced, which further hinders the performance. Accordingly, those mechanisms introduced for theoretical analysis makes it difficult to eliminate the performance gap between consistent RF and standard RF.
Is this gap really impossible to fill? In this paper, we propose a new consistent RF framework, dubbed multinomial random forest (MRF), by introducing the randomness more reasonably, as shown in Figure 1. In the MRF, two impurity-based multinomial distributions are used as the basis for randomly selecting a split feature and a specific split value respectively. Accordingly, the “best” split point has the highest probability to be chosen, while other candidate split points that are nearly as good as the “best” one will also have a good chance to be selected. This randomized splitting process is more reasonable and makes up the accuracy drop with almost no extra computational complexity. Besides, the introduced impurity-based randomness is essentially an exponential mechanism satisfying differential privacy, and the randomized prediction of each tree proposed in this paper also adopts the exponential mechanism. Accordingly, we can also analyze the privacy-preservation of MRF under the differential privacy framework. To the best of our knowledge, this privacy-preservation property, which is important since the training data may well contains sensitive information, has never been analyzed by previous consistent RF variants.
The main contributions of this work are three-fold: 1) we propose a multinomial-based method to improve the greedy split process; 2) we propose a new random forests variant, dubbed multinomial random forest (MRF), based on which we analyze its consistency and privacy-preservation; 3) extensive experiments demonstrate that the performance of MRF is on par with Breiman’s original RF and is better than all existing consistent RF variants. MRF is the first consistent RF variant that simultaneously has performance comparable to the standard RF.
2 Related Work
2.1 Consistent Random Forests
Random forest Breiman (2001) is a distinguished ensemble learning algorithm inspired by the random subspace method Ho (1998) and random split selection Dietterich (2000). In the original method, decision trees are built upon bootstrap datasets from the training set using the CART methodology Breiman (2017). Its various variants, such as quantile regression forests Meinshausen (2006) and deep forests Zhou and Feng (2017), were proposed and used in a wide range of applications Cootes et al. (2012); Kontschieder et al. (2015); Bifet et al. (2009); Xiong et al. (2012) for their effective training process and great performance. Despite the widespread use of random forests in practice, theoretical analysis of their success has yet been fully established. Breiman Breiman (2001) showed the first theoretical result indicating that the generalization error is bounded by the performance of individual tree and the diversity of the whole forest. After that, the relationship between random forests and a type of nearest neighbor-based estimator was studied by Lin and Jeon Lin and Jeon (2006).
One of the important properties, the consistency, has yet to be established for random forests. Consistency ensures that the result of RF converges to the optimum as the sample size increases, which was first discussed in Breiman’s mathematical heuristics report Breiman (2004). As an important milestone, Biau Biau et al. (2008) proved the consistency of two directly simplified random forest. Subsequently, several consistent RF variants were proposed for various purposes; for example, random survival forests Ishwaran and Kogalur (2010), an online version of random forests variant Denil et al. (2013) and a generalized regression forests Athey et al. (2019). Recently, Haghiri Haghiri et al. (2018) proposed CompRF, whose split process relied on triplet comparisons rather than information gain. To ensure the consistency, Biau (2012) suggested that an independent dataset is needed to fit in the leaf. This approach is called data partition. Under this framework, Denil Denil et al. (2014) developed a consistent RF variant (called Denil14 in this paper) narrowing the gap between theory and practice. Following Denil14, Wang Wang et al. (2017) introduced Bernoulli random forests (BRF), which reached state-of-the-art performance. The comparison of the MRF with the standard RF and with these two most advanced consistent RFs ( Denil14 and BRF) is shown in Figure 2, a more comprehensive comparison is in the Appendix.
Although several consistent RF variants are proposed, due to the relatively poor performance compared with standard RF, how to fulfill the gap between theoretical consistency and the performance in practice is still an important open problem.
In addition to the exploration of consistency, some schemes Mohammed et al. (2011); Patil and Singh (2014) were also presented to address privacy concerns. Among those schemes, differential privacy Dwork (2006), as a new and promising privacy-preservation model, has been widely adopted in recent years. In what follows, we outline the basic content of differential privacy.
Let denotes a dataset consisting of observations, where represents -dimensional features and is the corresponding label of the observation. Let represents the feature set. The formal definition of differential privacy is detailed as follow.
Definition 1 (-Differential Privacy).
A randomized mechanism gives -differential privacy for every set of outputs and any neighboring datasets and differing in one record, if satisfies:
where denotes the privacy budget that restricts the privacy guarantee level of . A smaller represents a stronger privacy level.
Currently, two basic mechanisms are widely used to guarantee differential privacy: the Laplace mechanism Dwork et al. (2006) and the Exponential mechanism McSherry and Talwar (2007), where the former one is suitable for numeric queries and the later is suitable for non-numeric queries. Since the MRF mainly involves selection operations, we adopt the exponential mechanism to preserve privacy.
Definition 2 (Exponential Mechanism).
Let be a score function of dataset that measures the quality of output . The exponential mechanism satisfies -differential privacy, if it outputs with probability proportional to , ,
where is the sensitivity of the quality function, defined as
3 Multinomial Random Forests
3.1 Training Set Partition
Compared to the standard RF, the MRF replaces the bootstrap technique by a partition of the training set, which is necessary for consistency, as suggested in Biau (2012). Specifically, to build a tree, the training set is divided randomly into two non-overlapping subsets and , which play different roles. One subset will be used to build the structure of a tree; we call the observations in this subset the structure points. Once a tree is built, the labels of its leaves will be re-determined on the basis of another subset ; we call the observations in the second subset the estimation points. The illustration of this process is shown in Fig. 3.
The ratio of two subsets is parameterized by partition rate = . To build another tree, the training set is re-partitioned randomly and independently.
3.2 Tree Construction
The construction of a tree relies on a recursive partitioning algorithm, which is shown in Algorithm 1. Specifically, to split a node, we introduce two impurity-based multinomial distributions: one for split feature selection and another for split value selection. The specific split point is a pair of a split feature and a split value. In the classification problem, the impurity decrease at a node caused by a split point is defined as
where is the subset of at a node , and generated by splitting with , are two subsets in the left child and right child of the node , respectively, and is the impurity criterion (, Shannon entropy or Gini index). Unless other specification, we ignore the subscript of each symbol, and use to denote for shorthand, in the rest of this paper.
Let denote the set of all possible split points for the node and is the corresponding impurity decrease, where is -th value on the -th feature. In what follows, we first introduce the feature selection mechanism for a node, and then describe the split value selection mechanism corresponding to the selected feature.
-based split feature selection. At first, we obtain a vector based on each , where , , is the largest possible impurity decrease of the feature . Then, the following three steps need to be performed:
Normalize : ;
Compute the probabilities , where is a hyper-parameter related to privacy budget;
Randomly select a feature according to the multinomial distribution .
-based split value selection. After selecting the feature for a node, we need to determine the corresponding split value to construct two children. Suppose has possible split values, we need to perform the following steps:
Normalize as , where identifies the feature ;
Compute the probabilities , where is another hyper-parameter related to privacy budget;
Randomly select a split value according to the multinomial distribution .
We repeat the above processes to split nodes until the stopping criterion is met. Similar to the standard RF, the MRF’s stopping criterion relates to the minimum leaf size , , for every leaf, the number of estimation points is required to be at least . The specific training process of trees in MRF is shown in Algorithm 1.
Once a tree was grown based on , we re-determine the predicted values for leaves according to . Similar to Breiman (2001), given an unlabeled sample , we can easily know which leaf of it falls, and the empirical probability that sample has label () is estimated to be
where is the set of estimation points in the leaf containing , and is an indicator function.
In contrast to the standard RF and consistent RF variants, the predicted label of is randomly selected with a probability proportional to , where is also related to the privacy budget.
The final prediction of the MRF is the majority vote over all the trees, which is the same as the one used in Breiman (2001):
3.4 The Discussion of Parameter Settings
Firstly, we discuss the parameters settings from the aspect of privacy-preservation by setting , and to ensure that MRF satisfies -differential privacy.
Suppose the number of trees and the depth of each tree are and , respectively. To fulfill the -differential privacy, we can evenly allocate the total privacy budget to each tree, , the privacy budget of each tree is . For each tree, the upper bound of depth is approximately , , . Accordingly, we can directly set and evenly allocate the total privacy budget to each layer, , the privacy budget of each layer is . As such, we can set , and to and to ensure that MRF satisfies -differential privacy. The specific privacy-preservation is theoretically discussed in Section 4.2.
From another perspective, the hyper-parameters play a role in regulating the relative probabilities under which the ’best’ candidate is selected. Specifically, the larger , the smaller noise will be added in , and thus the split feature with the largest impurity decrease () has a higher probability of being selected. Similarly, and control the noises added to the split value selection and the label selection, respectively. In addition, if , regardless of the training set partitioning, when , all features and split values have the same probability to be selected, and therefore the MRF would become a completely random forest Liu et al. (2008); when , there is no added noise. In this case, the MRF would become the Breiman’s original RF, whose set of candidate features always contains all the features.
4 Consistency and Privacy-Preservation
In this section, we theoretically analyze the consistency and privacy-preservation of the MRF.
We first describe the definition of consistency and two previously proven lemmas. We then state two new lemmas and the consistency theorem for the MRF.
When the dataset is given, for a certain distribution of , a sequence of classifiers are consistent if the error probability satisfies
where denotes the Bayes risk, denotes the randomness involved in the construction of the tree, such as the selection of candidate features.
The voting classifier which takes the majority vote over copies of with different randomizing variables has consistency if those classifiers have consistency.
Consider a partitioning classification rule building a prediction by a majority vote method in each leaf node. If the labels of the voting data have no effect on the structure of the classification rule, then as provided that
The diameter of as in probability,
as in probability,
where is the leaf containing and denotes the number of estimation points in .
Lemma 1 Biau et al. (2008) states that the consistency of individual trees leads the consistency of the forest. Lemma 2 Devroye et al. (2013) implies that the consistency of a tree can be ensured as , every hypercube at a leaf is sufficiently small but still contains infinite number of estimation points.
Sketch Proof of the Consistency
In general, the proof of consistency has three main steps: (1) each feature has a non-zero probability to be selected, (2) each split reduces the expected size of the split feature, and (3) split process can go on indefinitely. We first propose two lemmas for step (1) and (2) respectively, and then the consistency theorem of the MRF. All omitted proofs are shown in the Appendix.
In the MRF, the probability that any given feature is selected to split at each node has lower bound .
Suppose that features are all supported on . In the MRF, once a split feature is selected, if this feature is divided into equal partitions from small to large (, ), for any split point ,
Lemma 3 states that the MRF fulfills the first aforementioned requirement. Lemma 4 states that second condition is also met by showing that the specific split value has a large probability that it is not near the two endpoints of the feature interval.
Suppose that is supported on and have non-zero density almost everywhere, the cumulative distribution function of the split points is right-continuous at 0 and left-continuous at 1. If , MRF is consistent when and as .
In this part, we prove that the MRF satisfies -differential privacy based on two composition properties McSherry (2010). Suppose we have a set of privacy mechanisms and each provides privacy guarantee, then the sequential composition and parallel composition are described as follows:
Property 1 (Sequential Composition).
Suppose are sequentially performed on a dataset , then will provide -differential privacy.
Property 2 (Parallel Composition).
Suppose are performed on a disjointed subsets of the entire dataset, , , respectively, then will provide -differential privacy.
The impurity-based multinomial distribution of feature selection is essentially the exponential mechanism of differential privacy, and satisfies -differential privacy.
The impurity-based multinomial distribution of split value selection is essentially the exponential mechanism of differential privacy, and satisfies -differential privacy.
The label selection of each leaf in a tree satisfies -differential privacy.
Based on the above properties, lemmas and the parameter settings in Section 3.4, we can obtain the following theorem:
The proposed MRF satisfies -differential privacy when the hyper-parameters , and satisfy and , where is the number of trees and is the depth of a tree such that .
The omitted proof is shown in the Appendix.
Dataset Selection. We conduct experiments on multiple datasets used in previous consistent RF works Denil et al. (2014); Wang et al. (2017); Haghiri et al. (2018). Although all datasets are from UCI repository Asuncion and Newman (2007), these datasets cover a wide range of sample size and feature dimensions, and therefore they are representative for evaluating the performance of different algorithms. The description of used datasets is shown in Table 1.
Baselines. We select Denil14 Denil et al. (2014), BRF Wang et al. (2017) and CompRF Haghiri et al. (2018) as the baseline methods in the following evaluations. Those methods are the state-of-the-art consistent random forests variants. Specifically, we evaluate two different CompRF variants proposed in Haghiri et al. (2018), including consistent CompRF (CompRF-C) and inconsistent CompRF (CompRF-I). Besides, we provide the results of standard RF (Breiman) Breiman (2001) as another important baseline for comparison.
Training Setup. We carry out 10 times 10-fold cross validation to generate 100 forests for each method. All forests have trees, minimum leaf size . Gini index is used as the impurity measure except for CompRF. In Denil14, BRF, CompRF, and RF, we set the size of the set of candidate features . The partition rate of all consistent RF variants is set to . All settings stated above are based on those used in Denil et al. (2014); Wang et al. (2017). In MRF, we set and in all datasets, and the hyper-parameters of baseline methods are set according to their paper.
5.2 Performance Analysis
Table 2 shows the average test accuracy. Among the four consistent RF variants, the one with the highest accuracy is indicated in boldface. In addition, we carry out Wilcoxon’s signed-rank test Demšar (2006) to test for the difference between the results from the MRF and the standard RF at significance level 0.05. Those for which the MRF is significantly better than the standard RF are marked with ””. Conversely, those for which RF is significantly better are marked with ””. Moreover, the last line shows the average rank of different methods across all datasets.
As shown in Table 2, MRF significantly exceeds all existing consistent RF variants. For example, MRF achieves more than improvement in most cases, compared with the current state-of-the-art method. Besides, the performance of the MRF even surpasses Breiman’s original random forest in twelve of the datasets, and the advantage of the MRF is statistically significant in ten of them. To the best of our knowledge, this has never been achieved by any other consistent random forest methods. Note that we have not fine-tuned the hyper-parameters such as , and . The performance of the MRF might be further improved with the tuning of these parameters, which would bring additional computational complexity.
5.3 The Effect of Hyper-parameters
In this part, we evaluate the performance of the consistent MRF under different hyper-parameters and . Specifically, we consider a range of for both and , and other hyper-parameters are the same as those stated in Section 5.1. Besides, the performance of each tree in MFR with respect to the privacy budget is shown in the Appendix.
Figure 4 displays the results for six datasets representing small, medium and large datasets. It shows that the performance of the MRF is significantly improved as increases from zero, and it further becomes relatively stable when . Similarly, the performance also improves as increases from zero, but the effect is not obvious. When is too small, the resulting multinomial distributions would allow too much randomness, leading to the poor performance of the MRF. Besides, as shown in the figure, although the optimal values of and may depend on the specific characteristics of a dataset, such as the outcome scale and the dimension of the impurity decrease vector, at our default setting (), the MRF achieves competitive performance in all datasets.
5.4 Performance on Individual Trees
We further investigated why the MRF achieves such good performance by studying the performance of MRF on individual trees. In our 10 times 10-fold cross-validation, 10,000 trees were generated for each method. We compared the distribution of prediction accuracy over those trees between the MRF and the standard RF. Figure 5 displays the distributions for six datasets.
The tree-level performance of the MRF is generally better than that of the standard RF, which verifies the superiority of the multinomial-based random split process. However, we also have to notice that a good performance over individual trees does not necessarily lead to a good performance of a forest, since the performance of a forest may also be affected by the diversity of the trees. For example, the MRF has a significantly better tree-level performance on the CAR dataset, whereas its forest-level performance is not significantly different from the standard RF.
Although we have not been able to make a direct connection between the overall performance and the performance on individual trees, understanding the complexity of the relationship is still meaningful. The specific connection will be explored in the future work.
In this paper, we propose a new random forest framework, dubbed multinomial random forest (MRF), based on which we analyze its consistency and privacy-preservation. In the MRF, we propose two impurity-based multinomial distributions for the selection of split feature and split value. Accordingly, the best split point has the highest probability to be chosen, while other candidates that are nearly as good as the best one will also have a good chance to be selected. This split process is more reasonable, compared with the greedy split criterion used in existing methods. Besides, we also introduce the exponential mechanism of differential privacy for selecting the label of a leaf to discuss the privacy-preservation of MRF. Experiments and comparisons demonstrate that the MRF remarkably surpasses existing consistent random forest variants, and its performance is on par with Breiman’s random forest. It is by far the first random forest variant that is consistent and has comparable performance to the standard random forest.
- UCI machine learning repository. Cited by: §5.1.
- Generalized random forests. The Annals of Statistics 47 (2), pp. 1148–1178. Cited by: §2.1.
- Consistency of random forests and other averaging classifiers. Journal of Machine Learning Research 9 (Sep), pp. 2015–2033. Cited by: §1, §2.1, §4.1.
- Analysis of a random forests model. Journal of Machine Learning Research 13 (Apr), pp. 1063–1095. Cited by: §1, §1, §2.1, §3.1.
- New ensemble methods for evolving data streams. In SIGKDD, Cited by: §1, §2.1.
- Random forests. Machine learning 45 (1), pp. 5–32. Cited by: §1, §2.1, §3.3, §3.3, §5.1.
- Consistency for a simple model of random forests. Technical report Technical Report 670, Statistical Department, University of California at Berkeley. Cited by: §1, §2.1.
- Classification and regression trees. Routledge. Cited by: §2.1.
- Robust and accurate shape model fitting using random forest regression voting. In ECCV, Cited by: §1, §2.1.
- Statistical comparisons of classifiers over multiple data sets. Journal of Machine learning research 7 (Jan), pp. 1–30. Cited by: §5.2.
- Narrowing the gap: random forests in theory and in practice. In ICML, Cited by: §1, §2.1, §5.1, §5.1, §5.1.
- Consistency of online random forests. In ICML, Cited by: §2.1.
- A probabilistic theory of pattern recognition. Vol. 31, Springer Science & Business Media. Cited by: §4.1.
- An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Machine learning 40 (2), pp. 139–157. Cited by: §2.1.
- Calibrating noise to sensitivity in private data analysis. In TCC, Cited by: §2.2.
- Differential privacy. In ICALP, Cited by: §2.2.
- Variance reduction in purely random forests. Journal of Nonparametric Statistics 24 (3), pp. 543–562. Cited by: §1.
- Comparison-based random forests. In ICML, Cited by: §2.1, §5.1, §5.1.
- The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (8), pp. 832–844. Cited by: §2.1.
- Consistency of random survival forests. Statistics & probability letters 80 (13-14), pp. 1056–1064. Cited by: §2.1.
- Deep neural decision forests. In ICCV, Cited by: §1, §2.1.
- Random forests and adaptive nearest neighbors. Journal of the American Statistical Association 101 (474), pp. 578–590. Cited by: §2.1.
- Spectrum of variable-random trees. Journal of Artificial Intelligence Research 32, pp. 355–384. Cited by: §3.4.
- Mechanism design via differential privacy. In FOCS, Cited by: §2.2.
- Privacy integrated queries: an extensible platform for privacy-preserving data analysis. Communications of the ACM 53 (9), pp. 89–97. Cited by: §4.2.
- Quantile regression forests. Journal of Machine Learning Research 7 (Jun), pp. 983–999. Cited by: §2.1.
- Differentially private data release for data mining. In SIGKDD, Cited by: §2.2.
- Differential private random forest. In ICACCI, Cited by: §2.2.
- A novel consistent random forest framework: bernoulli random forests. IEEE transactions on neural networks and learning systems 29 (8), pp. 3510–3523. Cited by: §1, §2.1, §5.1, §5.1, §5.1.
- Random forests for metric learning with implicit pairwise position dependence. In SIGKDD, Cited by: §1, §2.1.
- Deep forest: towards an alternative to deep neural networks. In AAAI, Cited by: §2.1.