Exact Distributed Training: Random Forest with Billions of Examples

Exact Distributed Training: Random Forest with Billions of Examples

Abstract

We introduce an exact distributed algorithm to train Random Forest models as well as other decision forest models without relying on approximating best split search. We explain the proposed algorithm and compare it to related approaches for various complexity measures (time, ram, disk, and network complexity analysis). We report its running performances on artificial and real-world datasets of up to 18 billions examples. This figure is several orders of magnitude larger than datasets tackled in the existing literature. Finally, we empirically show that Random Forest benefits from being trained on more data, even in the case of already gigantic datasets. Given a dataset with 17.3B examples with 82 features (3 numerical, other categorical with high arity), our implementation trains a tree in 22h.

Machine Learning, Random Forest, Big Data, Distributed Algorithms

Mathieu Guillame-Bert and Olivier Teytaud

Google, Zurich, Switzerland

{gbm,oteytaud}@google.com


1 Introduction

Two families of approaches have been studied and sometimes combined to tackle the problem of training Decision Trees (DT) and Decision Forests (DF) on large datasets: (i) Approximating the building of the tree by using a subset of the dataset and/or approximating the computation of the optimal splits with a cheaper or more easily distributable computation, and (ii) using different but exact algorithms (building the same models) that allow distributing the dataset and the computation. Various works(Chan & Stolfo, 1993; Mehta et al., 1996) have shown that (i) typically leads to bigger forests and lower precision. We focus on the latter family of approaches: we propose a distributed method which is exactly equivalent to the original DT algorithm. We compare our work to the two existing methods that fall in this category: Sprint (Shafer et al., 1996) and distributed versions of Sliq (Mehta et al., 1996). Our proposed method, inspired by Sliq, aims to reach: (1) Removal of the random access memory requirement. (2) Distributed training (distribution even of a single tree). (3) Distribution of the training dataset (i.e. no worker requires access to the entire dataset). (4) Minimal number of passes in terms of reading/writing on disk and network communication. (5) Distributed computing of feature importance. While this paper mainly focuses on Random Forests, the proposed algorithm can be applied to other DF models, notably Gradient Boosted Trees (Ye et al., 2009). Our contributions in this work are as follows: (1) A distributed and exact implementation of Random Forest able to train on datasets larger than in any such past work. (2) A theoretical and numerical complexity comparison (CPU, RAM, network, disk access, disk reading, disk writing) of Sliq, Sprint, RF and our distributed version of Random Forest.

2 Distributed Random Forest

In this section, we describe the proposed Distributed Random Forest algorithm (DRF). The structure of this algorithm is different from the classical recursive Random Forest algorithm; nonetheless, as well as Sliq (Mehta et al., 1996) and Sprint (Shafer et al., 1996), the proposed algorithm is guaranteed to produce the same model as RF. DRF computation is distributed among computing units called “workers”, and coordinated by a “manager”. The manager and the workers communicate through a network. DRF is relatively insensitive to the latency of communication (Section 3). DRF also distributes the dataset between workers: each worker is assigned to a subset of columns of the dataset. Each worker only needs to read their assigned part of the dataset sequentially, i.e. no random access and no writing are needed. Workers can be configured to load the dataset in memory, or to access the dataset on drive/through network access. Finally, each worker can host a certain number of threads (details of multithreading in the supplementary material (SM)). Several types of workers are responsible for different operations. The splitter workers look for optimal candidate splits. Each splitter has access to a subset of dataset columns. The tree builder workers hold the structure of one DT being trained (one DT per tree builder) and coordinate the work of the splitters. Tree builders do not have access to the dataset. One tree builder can control several splitters, and one splitter can be controlled by several tree builders. The manager manages the tree builders. The manager is responsible for the fully trained trees. The manager does not have access to the dataset. Like Sliq, and unlike the generic DT learning algorithm, DRF builds DTs “depth level by depth level” i.e. all the nodes at a given depth are split together. The training of a single tree is distributed among the workers. Additionally, as trees of a Random Forest are independent, DRF trains all the trees in parallel. DRF can also be used to train co-dependent sets of trees (e.g. Boosted Decision Trees). In this case, while trees cannot be trained in parallel, the training of each individual tree is still distributed.

2.1 Dataset Preparation

Consistently with existing works (Mehta et al., 1996; Shafer et al., 1996), we use presorting for numerical attributes. In the present work we do not consider other categories of attributes than categorical or numerical. The most expensive operation when preparing the dataset is the sorting of the numerical attributes. In case of large datasets, this operation is done using external sorting. In this phase, the manager distributes the dataset among the splitters. Each splitter is assigned with a subset of the dataset columns. In case several DTs are trained in parallel (e.g. RF), DRF benefits from having workers replicated i.e. several workers own the same part of the dataset and are able to perform the same computation.

2.2 Seeding

RF “bags” samples (i.e. sampling with replacement, out of records) used to build each tree. Instead of sending indices over the network, DRF uses a deterministic pseudo-random generator so that all workers agree on the set of bagged examples without network communication. With this method, all workers are aware of the selected samples, without the cost of transmitting or storing this information.

2.3 Mapping Sample Indices to Node Indices

At any point during training, each bagged sample is attached to a single leaf - initially the root node. When a leaf is derived into two children, each sample of this node is re-assigned to one of its child nodes according to the result of the node condition (condition = chosen split). As in Sliq (Mehta et al., 1996), DRF splitters and tree builders need to represent the mapping from a sample index to a leaf node. DRF monitors the number of active leaves (i.e. number of leaf nodes which can be further split). Therefore, bits of information are needed to index a leaf. If there is at least one non-active leaf, bits are needed to encore the case of a sample being in a closed leaf. Therefore, this mapping requires bits of memory to store in which leaf each sample is. Depending on the size of the dataset, this mapping can either be stored entirely in memory, or the mapping can be distributed among several chunks such that only one chunk is in memory at any time. Unlike Sliq (Mehta et al., 1996), DRF does not store the label values in memory.

2.4 Finding the Best Split

During training, each splitter is searching for the optimal split among the candidate attributes it owns. The final optimal split is the best optimal split (e.g. for information gain or Gini index) among all the splitters. A split is defined as a column index and a condition over the values of this column. For numerical columns, the condition is of the form with . For categorical columns, the condition is of the form with and the support of column . In case of attribute sampling (e.g. RF), only a random subset of attributes are considered. We call supersplit a set of splits mapped one-to-one with the open leaves at a given depth of a tree. The following subsections present how DRF computes the optimal splits for all the nodes at a given depth, i.e. the optimal supersplit at a given depth, in a single pass per feature. Computing optimal splits on categorical attributes is easily parallelized, whereas computing optimal splits in the case of numerical attributes needs presorting. Details are given in the SM for all cases, and Alg. 1 presents the algorithm for finding the optimal splits for a given numerical feature for all nodes of a given depth in one pass.

    will be the histogram of the already traversed labels for the leaf (initially empty).
    is the last tested threshold (initially null) for the leaf .
    is the list of records sorted according to the attribute i.e. is a list of tuples , sorted in increasing order of , where is the numerical attribute value, is the label value, and is the sample index.
    will be the best threshold for leaf (initially null).
    will be the score of (initially 0).
   for all  in  do
       
       if is a closed node then continue
       if is false then continue
       
       if then continue
       
        the score of (computed using )
       if  then
           
           
       end if
       Add label weighted by to
       
   end for
   return and
Algorithm 1 Find the best supersplits for numerical attribute and tree
1:   Create a decision tree with only a root. Initially, the root is the only open leaf.
2:   Initialize the mapping from sample index to node index so that all samples are assigned to the root.
3:   Query the splitters for the optimal supersplit. Each splitter returns a partial optimal supersplit computed only from the columns it has access to (using Alg. 1 in the case of numerical splits). The (global) optimal supersplit is chosen by the tree builder by comparing the answers of the splitters.
4:   Update the tree structure with the optimal supersplit.
5:   Query the splitters for the evaluation of all the conditions in the best supersplit. Each splitter only evaluates the conditions it has found (if any). Each splitter sends the results to the tree builder as a dense bitmap. In total, all the splitters are sending one bit of information for each sample selected at least once in the bagging and still in an open leaf.
6:   Compute the number of active leaves and update the mapping from sample index to node index.
7:   Broadcast the evaluation of conditions to all the splitters so they can also update their sample index to node index mapping.
8:   Close leaves with not enough records or no good conditions.
9:   If at least one leaf remains open, go to step 3.
10:   Send the DT to the manager.
Algorithm 2 Tree builder algorithm for DRF.

2.5 Training a Random Forest

Each decision tree is built by the tree builder with Alg. 2. To train a Random Forest, the manager queries in parallel the tree builders. This query contains the index of the requested tree (the tree index is used in the seeding, Section 2.2) as well as a list of splitters such that each column of the dataset is owned by at least one splitter. The answer by the tree builder is the decision tree.

3 Complexity Analysis

We present and compare the theoretical complexities (memory, parallel time, I/O and network) of generic DT, generic RF, DRF, Sprint, Sliq, Sliq/R and Sliq/D. All these algorithms operate differently, and benefit from different situations in term of time complexity: Sprint prunes records in closed leaves: a tree with a large amount of records in shallow closed leaves is fast to train. However, Sprint scans and writes continuously both the candidate and non-candidate features i.e. Sprint does not benefit from the small size of the set of candidate features. Compared to Sprint, DRF benefits from records being in closed leaves differently: records in closed leaves are not pruned, but since Sliq and DRF only scan candidate features (i.e. features randomly chosen and not closed in earlier conditions), a smaller number of records leads to a smaller number of candidate features. Although our experiments focus on the classical case of features randomly drawn at each node, we point out that Sliq and DRF benefit greatly (by a factor proportional to the number of features) from limiting the number of unique candidate features at a given depth. In particular, the trend (see Section 3.2) consisting in using the same set of features for all nodes at a given depth leads to a fast DRF with a number of machines proportional to the number of randomly drawn features instead of the total number of features. We also study the impact of equipping DRF with a mechanism to prune records similarly to Sprint: when DRF detects that this pruning becomes beneficial, the algorithm can prune the records in closed leaves. This operation is not triggered during the experimentation on the large dataset reported in Section 4.

3.1 Distributed Forests: Complexity Overview

The present section compares the complexities of variants of distributed random forests; the formalization of our results is in Section 3.2. We mention Sprint(Shafer et al., 1996), Sliq(Mehta et al., 1996), but also Sliq/D and Sliq/R which were proposed in (Shafer et al., 1996) as baselines for comparisons with Sprint. Sliq/D and Sliq/R are not designed by the author of Sliq; DRF is another solution (as opposed to Sliq/R and Sliq/D) for distributing Sliq. We use workers (with distinct memories) for parallelizing the analysis of features per node for a given depth, these features being randomly drawn (as usual in RF) out of the features of the dataset. In RF, the best split is chosen among splits of these features. There are variants of RFs for which these features are the same for each node at a given depth; this is in particular the case in the implementation Xgboost (He, 2015) (which covers both GBTs and RFs). This has a big impact on the complexity in the distributed case, as discussed in Section 3.2. Although our theoretical analysis and preliminary results suggest a clear improvement with such a variant, for the sake of an exact comparison with the most classical variant of random forest, we will present experiments in the original case. In the mathematical analysis, we use a variable counting the number of distinct subsets of features drawn at a given depth; corresponds to the case in which all nodes in a given depth level of the tree use the same set of features, whereas is the total number of open nodes of the current level in the case of independently drawn subsets of features. We call the number of samples in the dataset, the number of attributes in the dataset, the minimum number of records in a leaf, the user-chosen maximum depth of a tree, the effective depth of the tree i.e. depth of the deepest leaf ( on average), is the average depth of the leaves (weighted by the number of training samples) (by definition we have ; is chosen by the user while and are known after the tree is built; equality means that the tree is perfectly balanced and the user limit is reached), number of workers, the number of trees to train, is equal the maximum number of attributes owned by a worker when there is no redundancy (we might have features on a same worker in case of -redundancies), the number of training samples that reach node , the total number of nodes of a tree.

Data structures: comparing Sliq, Sprint and DRF. Distributed versions of RF require to store (i) features and label values (i.e. the initial data, possibly after preprocessing such as distributing and presorting) and (ii) class lists that maps training examples to open nodes. For storing the data, Sprint and Sliq/D divide the data per rows (e.g. one shard per worker); Sliq/R and DRF divide the data per feature (i.e. a subset of features per worker). For storing the class lists, Sliq/R and DRF duplicate the class list in each worker. To do so, DRF requires bits per training example. In practice this figure is significantly smaller than using a full integer (i.e. 64 bits). Sliq/D and Sprint store a class list restricted to open nodes only. This saves up space/computation for later stages of the tree building, in particular when a large part of records are already in (closed) leaves. On the other hand, the class list contains the label (and possibly the weight) for each record, which is expensive, and we need many passes of writing, namely for each level of the tree. Sprint stores the class list as a distributed hashmap.

Communications in Sprint, in Sliq and in DRF. In Sliq/D, the class list is distributed over workers; this forces workers to query each others continuously, once per example. Sprint requests communications for updating the distributed hashmap, once per node. In DRF, for each supersplit (i.e. for each level of the tree), one bit of data is broadcasted for each training example in an open leaf. DRF and Sliq work with two passes per depth level, whereas Sprint, Sliq/D and Sliq/R work at the level of nodes. In addition, for computing a random forest (compared to only a tree), we need bagging; instead of sending massive lists of record indices over the network, DRF just sends the seed of the randomized sampling (Section 2.2).

Computations and passes over the data in Sprint, in Sliq and in DRF. For deciding the split, Sprint, for categorical attributes, builds count tables “attribute value class number of records”. For numerical attributes, incrementally compute the quality of splits (e.g. information gain), thanks to one histogram per class computed incrementally for each candidate threshold (i.e. each unique numerical attribute value) in order thanks to the presorted attributes - the histogram stores the number of individuals below the considered threshold, for each class. Sliq does the same, but for all open nodes of a given depth before broadcasting. For Sliq/D, based on shards, a large part of the cost is due to communications for combining histograms obtained on different shards.  For adapting the data to the splits, Sprint splits the attribute list for the chosen feature; collects split info in a hashmap; broadcasts this hashmap. Sliq: updates the mapping row id node. DRF can be seen as an alternate solution for distributing Sliq (compared to Sliq/D and Sliq/R); as mentioned above, thanks to the same distribution of the data as in Sliq/R (per feature), we do not have to write anything regarding features (except during the initialization); features are simply read in one pass per level of the tree (and not per node of the tree!); and we broadcast one bit per record for updating the class lists, which are themselves stored with cost logarithmic in the number of open leaves.

3.2 Complexity Analysis: Formalization

: a critical quantity for the performance of DRF. We distribute the features uniformly over the workers (i.e. at most features per worker) - though redundancies could be considered as detailed later. At each node, we randomly draw features for which optimal splits are computed; the total number of drawn features is with the number of independent subsets of features drawn in a given depth. Except in the USB case detailed below, is the number of open nodes for the current depth. The computational cost associated to computing splits for a node is therefore, for a worker, proportional to the number of features which are attributed to this worker. We define as the maximum number of features attributed to a given worker. depends on whether we apply USB; on the number of features used in each node; on how many (and possibly which) workers have access to each feature (i.e. redundancies); and on the number of workers. Several questions naturally arise in the analysis of .

Random forest with unique set of bagged features per depth (USB). Importantly, the number of independently randomly drawn subsets of features, has a big impact on and therefore on the overall complexity - we might consider a variant of RF in which all nodes at a given depth consider a same subset of features, in which case we can set and independently of the number of nodes. This feature, referred to as USB in the present document, was already explored by (He, 2015).

Can we have small at a given depth without having a narrow tree or USB ? Equivalently, we check if, without USB, features selected over the different open nodes in a single supersplit can have sufficient redundancies for reducing the total number of features to some . Lemmas in the SM show that there is no hope - the number of selected features, up to a constant factor, verifies .

Correctness of the approximation “nearly the same number of features are drawn on each worker”. If we have full redundancies (all features stored on all workers and optimal allocation of tasks to workers) or if then we clearly have a maximum number of features to be tackled on a given worker, for a given depth level, of the form ( if ; and with full redundancies () and optimal allocations); is there a risk, in the case and no redundancies, that the number of selected features is very unbalanced, so that one worker will take much more time than others ? Essentially, Lemmas in the SM, based on VC-type inequalities for independent sampling(Lafferty et al., 2010) for the independent sampling case, and on VC-type inequalities for rejective sampling(Clémençon et al., 2016) for the non-independent case, show that this is not the case and the cost will remain if increases “faster” than ; and remains with even without redundancies(Gonnet, 1981); details in SM.

In the case , a redundant storage of features improves the complexity in particular in the USB case. is for sure an ideal case (each worker deals with one and only one feature), but this might be impossible when is large; could we save up computational resources while preserving high performance when is large ? The answer, proved under the assumption of feature sampling with replacement, is yes. Assume , and let us store each feature on workers. Then, instead of a complexity of order , we get (detailed proof in (Azar et al., 1999), more details in the SM). This, in conjunction with USB, leads to fast computation speed ( if ) with instead of - a significant improvement in the classical case .

The key advantage of DRF is the moderate number of passes over the data. In case, for some hardware, this would not matter and the pruning of Sprint might perform particularly well because of many closed leaves early in the tree, we can implement a rule for switching to Sprint’s pruning mode, and this rule detects the issue early enough for preserving the complexity of Sprint in such a case (details in SM).

Algorithm Max. memory (per worker) Computational cost (max per worker), i.e. parallel time complexity Writing on disk & number of passes (per worker) Network & number of passes Reading on disk & number of passes (per worker)
Generic sequential recursive tree, all in memory 0 0 (m+1)n in pass
Sliq (on one machine) 0 0 in passes
Sprint (nb: ) PS + passes of total size row indices for bagging + row indices in broadcasts; if we use bitmaps for saving up communication we will pay for sorting. in passes
Sliq/D and coordination PS row indices for bagging and coordination and broadcasts of bits. in passes
Sliq/R PS row indices for bagging + bits in allreduce. in passes
DRF PS bits in allreduce. in passes
DRF-USB, , PS bits in allreduce. in passes
Table 1: Complexities of some discussed algorithms in the context of bagged features (i.e. features are randomly drawn instead of all features, with typically scaling as ) and bagging (bagged records). Sliq/D contains a complex expensive implementation-dependent coordination between workers (in particular for numerical features) which is not detailed here. is the maximum number of nodes per depth. We assume classification; is the number of nodes in the tree. refers to the complexity of presorting features. refers to the size of storage for an (e.g. refers to the number of bits per integer and refers to the number of bits for storing one entry of one feature or label). is defined in Section 3.2, and depends on , , and ; it should be averaged over the depth levels. If conditions of Section 3.2 are met, then . is .

4 Experiments: Artificial Datasets

In this section, we report the performance of DRF on a set of families of synthetic binary classification datasets published specifically for large scale machine learning(P. Geurts, 2018). Each family is associated with a ground truth function (e.g. XOR, Majority). The members of each family differ in the number of training samples as well as the number of informative and uninformative features. The datasets include various numbers of useless variables (UV), with no correlation with the labels. Rote learning is used as a baseline for comparison; it consists in just labelling a test sample correctly if it was in the training set, and randomly otherwise. We test DRF with hyperparameters as follows: 1, 3 or 10 trees; unbounded depth; minimum number of examples per leaf equal to ; number of splitters equal to the number of features. Given the large number of datasets, we did not replicate any of the experiments; but each data point is obtained independently. These runs are performed with a low priority - this shows that the approach remains reliable in spite of interruptions (workers can be killed by tasks with higher priority). Figure 1 and SM show the AUC as a function of the training set size, while Figure 2 shows the training time.

Figure 1: Impact of the number of trees and training set sizes. Area Under ROC Curve (AUC) on artificial datasets. DRF, randomly drawn features per node, unlimited depth, at least one record per node. Similarly to real world data in Section 5, even for gigantic datasets, increasing the training set size and/or adding trees helps - in particular with many UV (compare rows, for each column). Rote learning fails (AUC=) when we have UV. Random labelling, or labelling according to majority class, leads to , hence . One independent run per point in the plot; hence the highly imbalanced “needle” (dashed line) leads to irregular curves, others are more stable.
Figure 2: Training time in seconds as a function of training set size. Exact random forest, randomly drawn features per node, unlimited depth, at least one record per node. We have e.g. 1900s - 3000s for building one random forest tree on 3e8 examples in dimension 18. This is in an environment with preemptions; hence irregular results. The number of workers is equal to the dimension, independently of the number of trees - the different trees are built sequentially, only the presorting is amortized.
Figure 3: Tree and RF metrics while depth-by-depth training. Shows the training time, number of open leaves, node density, sample density, individual trees’ AUC and TF AUC for depth level between 0 and 20.

5 Experiments: Real World Dataset

The Leo dataset is a large unbalanced proprietary binary classification dataset containing 18 billions records. Each record is defined by 3 numerical features and 69 categorical features with respective arities ranging from 2 to 10’000. To put the size of this dataset into perspective, storing a single 8 byte integer for each of the samples (e.g. the index of a sample) would requires 114 gigabytes of memory. By comparison, high end consumer computer stations have between 8 and 16 gigabytes of memory. If densely represented and uncompressed (and assuming zero overhead for the dataset structure), the dataset occupies 6 terabytes of memory. As far as we know, the Leo dataset is the largest dataset ever used to train an (exact or approximate) Random Forest. We consider three versions of this dataset: Leo 1% and Leo 10%, and Leo 100%, respectively 1%, 10%, and 98% of the full dataset. We reached an UAC of .847. The best AUC on this dataset (obtained by deep learning) was 0.81. We apply the DRF algorithm on the datasets described above. The number of workers is set to 82. Small and moderate size subsets of datasets can be run with the dataset loaded entirely in memory (distributed across the workers). Reading datasets from memory is significantly faster than reading from drive. However, for the sake of the comparison, all experiments have been run with the datasets remaining on drive. In all the runs, the hyperparameters of the Random Forest have been set to some reasonable default values: the number of candidate attributes for each split is equal to the square root of the total number of attributes, the minimum number of records in each node is set to 10, 100 and 1000 respectively, and the maximum depth is set to 20. In the case of subsets of dataset, the minimum number of records has been reduced proportionally with the relative size of the subset to the original set. Table 2 shows training time, number of nodes, node density and sample density of each tree (averaged over all the trees in the model). The node density is the ratio between the number of leaves of the tree and the number of leaf of a dense tree of similar depth (i.e. with the depth of the tree). The sample density is the ratio of training samples that reached the bottom leaves of the tree (i.e. the leaves at depth 20). Both density measures are expected to decrease with the depth of the tree.

Leo Samples Train time (h) Leaves Node density Sample density
1% 0.838 0.134 0.766
10% 3.156 0.305 0.904
100% 22.29 0.415 0.969
Table 2: Average training time, number of nodes, node density and sample density for the various datasets.

Figure 3 shows the average training time, number of leaves, node density and sample density, when varying the maximum tree depth between 0 (i.e. a tree is a single root node attached to the majority class) and 20. Figures have been averaged across the trees of a RF. The total training time of a tree is the sum of the training times of each depth levels. As expected, the number of leaves and the training time increases with the depth of the trees as the number of leaves grows. However, while the number of leaves increases exponentially with the depth of the trees, the computation time does not. This is explained by the fact that most of computation is spent on scanning the dataset, and this step does not depend on the number of leaves. A depth 20, in the case of Leo 100%, 96.9 % of the training samples are still in an open leaf (i.e. a leaf that could be split if the maximum depth was increased). This indicates that the tree can still grow if the maximum depth is increased. This also shows that pruning the dataset by removing samples in closed leaves (as in Sprint) would not speed up the computation since the cost of pruning would exceed its gains (using a pruning as in Sprint would not provide any significant improvement given this 96.9%). Figure 3 also shows the AUC (Area Under the Receiver Operating Characteristic Curve; computed on a test set) of individual trees (averaged over several trees) and of the entire RF model when varying the maximum tree depth. We see that using several billions of examples is useful for improving the AUC. Non-pruned DTs are highly susceptible to overfitting. Among other causes, DT overfitting appears when the number of training samples in a node becomes too small. Then, the tree starts “learning the noise” of the training dataset, the test AUC of an individual tree decreases, and the AUC of the overall RF plateaus. The depth of a tree is the main factor leading to nodes with few training samples. Therefore, we expect for the effect described above to be correlated with the depth of the trees: in the case of Leo 1% and Leo 10%, the overfitting of individual trees starts respectively at depth 13 and 17; in the case of Leo 100%, the overfitting of individual trees has not yet started at depth 20. The overfitting of individual trees does not indicate that the overall RF is not expected to benefit from deeper training: the AUC of the corresponding RFs plateaus at depth 16 for Leo 1%, and keeps increasing after depth 20 for Leo 10% and Leo 100%. We also observe that the RF trained on more data are plateauing to greater AUCs (0.823, 0.837, and 0.847, respectively for Leo 1%, Leo 10%, and Leo 100%). This indicates that RF benefits from using large datasets and training deeper trees.

6 Conclusions

We introduced DRF, an exact distributed Random Forest (Breiman, 2001). Our method stands out from existing exact distributed approaches by a smaller space, disk and network complexity. We demonstrate its application to 17.3B training examples with dense features. This figure is 1000x larger than any exact decision tree from RF or gradient-boosted trees found in the literature (He, 2015; Chen & Guestrin, 2016; Lo et al., 2014), and at least 10x larger than any related approximate work (Chan & Stolfo, 1993; Gehrke et al., 1999; Genuer et al., 2015; Ristin-Kaufmann, 2015). It is known that more training data improves accuracy or AUC; however, it is not obvious that this is the case when increasing the number of training examples from 1 to 10 billions. This is also central to learning algorithms working on subsets of the dataset (Gehrke et al., 1999). The present results show examples of datasets (both artificial and real world) for which more training data is beneficial in case of datasets of billions of examples. Further work. The mathematical analysis has shown the importance of USB and of redundant storage of features, to be experimentally investigated; we might also switch to pure memory for nodes with small numbers of records.

References

  • Azar et al. (1999) Azar, Yossi, Broder, Andrei Z., Karlin, Anna R., and Upfal, Eli. Balanced allocations. SIAM Journal on Computing, 29(1):180–200, 1999. doi: 10.1137/S0097539795288490. URL https://doi.org/10.1137/S0097539795288490.
  • Breiman (2001) Breiman, Leo. Random forests. Mach. Learn., 45(1):5–32, October 2001. ISSN 0885-6125. doi: 10.1023/A:1010933404324. URL https://doi.org/10.1023/A:1010933404324.
  • Chan & Stolfo (1993) Chan, Philip K. and Stolfo, Salvatore J. Experiments on multistrategy learning by meta-learning. In Proceedings of the Second International Conference on Information and Knowledge Management, CIKM ’93, pp. 314–323, New York, NY, USA, 1993. ACM. ISBN 0-89791-626-3. doi: 10.1145/170088.170160. URL http://doi.acm.org/10.1145/170088.170160.
  • Chen & Guestrin (2016) Chen, Tianqi and Guestrin, Carlos. Xgboost: A scalable tree boosting system. In Krishnapuram, Balaji, Shah, Mohak, Smola, Alexander J., Aggarwal, Charu C., Shen, Dou, and Rastogi, Rajeev (eds.), Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, pp. 785–794. ACM, 2016. ISBN 978-1-4503-4232-2. doi: 10.1145/2939672.2939785. URL http://doi.acm.org/10.1145/2939672.2939785.
  • Clémençon et al. (2016) Clémençon, Stéphan, Bertail, Patrice, and Papa, Guillaume. Learning from survey training samples: Rate bounds for horvitz-thompson risk minimizers. In Durrant, Robert J. and Kim, Kee-Eung (eds.), Proceedings of The 8th Asian Conference on Machine Learning, ACML 2016, Hamilton, New Zealand, November 16-18, 2016., volume 63 of JMLR Workshop and Conference Proceedings, pp. 142–157. JMLR.org, 2016. URL http://jmlr.org/proceedings/papers/v63/clemencon64.html.
  • Gehrke et al. (1999) Gehrke, Johannes, Ganti, Venkatesh, Ramakrishnan, Raghu, and Loh, Wei-Yin. Boat;optimistic decision tree construction. SIGMOD Rec., 28(2):169–180, June 1999. ISSN 0163-5808. doi: 10.1145/304181.304197. URL http://doi.acm.org/10.1145/304181.304197.
  • Genuer et al. (2015) Genuer, Robin, Poggi, Jean-Michel, Tuleau-Malot, Christine, and Villa-Vialaneix, Nathalie. Random forests and big data. In 47ème Journées de Statistique de la SFdS, Lille, France, June 2015. Société Française de Statistique. URL https://hal.archives-ouvertes.fr/hal-01160643.
  • Gonnet (1981) Gonnet, Gaston H. Expected length of the longest probe sequence in hash code searching. J. ACM, 28(2):289–304, April 1981. ISSN 0004-5411. doi: 10.1145/322248.322254. URL http://doi.acm.org/10.1145/322248.322254.
  • He (2015) He, Tong. Xgboost: extreme gradient boosting. Talk at the NYC Data Science Academy, 2015. URL http://www.saedsayad.com/docs/xgboost.pdf.
  • Lafferty et al. (2010) Lafferty, John, Liu, Han, and Wasserman, Larry. Concentration of measure. Handout, 2010. URL http://www.stat.cmu.edu/~larry/=sml/Concentration.pdf.
  • Lo et al. (2014) Lo, Win-Tsung, Chang, Yue-Shan, Sheu, Ruey-Kai, Chiu, Chun-Chieh, and Yuan, Shyan-Ming. CUDT: A CUDA based decision tree algorithm. The Scientific World Journal, July 2014. doi: 10.1155/2014/745640.
  • Mehta et al. (1996) Mehta, Manish, Agrawal, Rakesh, and Rissanen, Jorma. Sliq: A fast scalable classifier for data mining. In Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology, EDBT ’96, pp. 18–32, London, UK, UK, 1996. Springer-Verlag. ISBN 3-540-61057-X. URL http://dl.acm.org/citation.cfm?id=645337.650384.
  • P. Geurts (2018) P. Geurts, Mathieu Guillame-Bert, Olivier Teytaud. Synthetic vectorized datasets for large scale machine learning experiments. https://github.com/oteytaud/hugeml/blob/master/datasets.pdf, 2018.
  • Ristin-Kaufmann (2015) Ristin-Kaufmann, Marko. Large-Scale Image Recognition with Random Forests. PhD thesis, ETH Zürich, 2015.
  • Shafer et al. (1996) Shafer, John C., Agrawal, Rakesh, and Mehta, Manish. Sprint: A scalable parallel classifier for data mining. In Proceedings of the 22th International Conference on Very Large Data Bases, VLDB ’96, pp. 544–555, San Francisco, CA, USA, 1996. Morgan Kaufmann Publishers Inc. ISBN 1-55860-382-4. URL http://dl.acm.org/citation.cfm?id=645922.673491.
  • Ye et al. (2009) Ye, Jerry, Chow, Jyh-Herng, Chen, Jiang, and Zheng, Zhaohui. Stochastic gradient boosted distributed decision trees. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM ’09, pp. 2061–2064, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-512-3. doi: 10.1145/1645953.1646301. URL http://doi.acm.org/10.1145/1645953.1646301.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
229657
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description