GBDT-MO: Gradient Boosted Decision Trees for Multiple Outputs

GBDT-MO: Gradient Boosted Decision Trees for Multiple Outputs

Zhendong Zhang and Cheolkon Jung,  This work was supported by the National Natural Science Foundation of China (No. 61872280) and the International S&T Cooperation Program of China (No. 2014DFG12780).Z. Zhang and C. Jung (corresponding author) are with the School of Electronic Engineering, Xidian University, Xi’an 710071, China e-mail: zhd.zhang.ai@gmail.com, zhengzk@xidian.edu.cnManuscript received September 2, 2019.
Abstract

Gradient boosted decision trees (GBDTs) are widely used in machine learning, and the output of current GBDT implementations is a single variable. When there are multiple outputs, GBDT constructs multiple trees corresponding to the output variables. In this case, the correlations between variables are ignored by such a strategy causing redundancy of the learned tree structures. In this paper, we propose a general method to learn GBDT for multiple outputs, called GBDT-MO. Each leaf of GBDT-MO constructs predictions of all variables or a subset of automatically selected variables. This is achieved by considering the summation of objective gains over all output variables. Moreover, we extend histogram approximation into multiple output case and speed up the training process by the extended one. Various experiments on synthetic and real-world datasets verify that the learning mechanism of GBDT-MO plays a role in indirect regularization. Our code is available online.

gradient boosting, decision tree, multiple outputs, variable correlations, indirect regularization.

I Introduction

Machine learning and data-driven approaches have achieved great success in recent years. Gradient boosted decision tree (GBDT) [1] [2] is a powerful machine learning tool widely used in many applications such as multi-class classification [3], learning to rank [4] and click prediction [5]. It also produces state-of-the-art results for many data mining competitions such as the Netflix prize [6]. GBDT uses decision trees as the base learner and sums the predictions of a series of trees. At each step, a new decision tree is trained to fit the residual between ground truth and current prediction. GBDT is popular due to its accuracy, efficiency and interpretability. Many improvements have been proposed after [1]. XGBoost [7] used the second order gradient to guide the boosting process and improve the accuracy. LightGBM [8] aggregated gradient information in histograms and significantly improved the training efficiency. CatBoost [9] proposed a novel strategy to deal with categorical features.

A limitation of current GBDT implementations is that the output of each decision tree is a single variable. This is because each leaf of a decision tree produces a single variable. However, multiple outputs are required for many machine learning problems including multi-class classification, multi-label classification [10] and multi-output regression [11]. Other machine learning methods, such as neural networks [12], can adapt to any dimension of outputs straightforwardly by changing the number of neurons in the last layer. The flexibility for the output dimension may be one of the reasons why neural networks are popular. However, it is somewhat strange to handle multiple outputs by current GBDT implementations. At each step, they construct multiple decision trees each of which corresponds to an individual variable of the output, then concatenates the predictions of all trees to obtain multiple outputs. This strategy is used in the most popular open-sourced GBDT libraries: XGBoost [7], LightGBM [8], and CatBoost [9].

The major drawback of the abovementioned strategy is that correlations between variables are ignored during the training process because those variables are treated in isolation and they are learned independently. However, correlations more or less exist between output variables. For example, there are correlations between classes for multi-class classification. It is verified in [13] that such correlations improve the generalization ability of neural networks. Ignoring variable correlations also leads to redundancy of the learned tree structures. Thus, it is necessary to learn GBDT for multiple outputs via better strategies. Up to now, a few works have explored: [14] [15]. [14] transformed the multiple output problem into a single output problem by kernelizing the output space. However, this method was not scalable because the space complexity of its kernel matrix was where is the number of training samples. [15] proposed GBDT for sparse output. [15] mainly focused on extreme multi-label classification problems, and the outputs were represented in sparse format. A sparse split finding algorithm was designed for square hinge loss. [14] and [15] worked for specific loss and they did not employ the second order gradient and histogram approximation.

In this paper, we propose a novel and general method to learn GBDT for multiple outputs, which is scalable and efficient, named GBDT-MO. Unlike previous works, we employ the second order gradient and histogram approximation to improve GBDT-MO. The learning mechanism is designed based on them to jointly fit all variables in a single tree. Each leaf of a decision tree constructs multiple outputs at once. This is achieved by maximizing the summation of objective gains over all output variables. Sometimes, only a subset of the output variables is correlated. It is expected that the proposed method automatically selects those variables and constructs predictions for them at a leaf. We achieve this by adding constraint to the objective function. Since the learning mechanism of GBDT-MO enforces the learned trees to capture variable correlations, it plays a role in indirect regularization. Experiments on both synthesis and real-world datasets show that GBDT-MO achieves better generalization ability than the standard GBDT. They indicate that the learning mechanism plays a role in indirect regularization. Moreover, GBDT-MO achieves a fast training speed, especially when the number of outputs is large.

Fig. 1: Illustration of GBDT and GBDT-MO that construct 3 dimension outputs. The number within the box means the value of a leaf. Each of the first three trees constructs a single variable and their tree structures are the same. However, GBDT-MO constructs three variables at the same time in the last (rightmost) tree.

Compared with existing methods, main contributions of this paper are as follows:

  • We formulate the problem of learning multiple outputs for GBDT, and propose a split finding algorithm by deriving a general approximate objective for this problem.

  • To learn a subset of outputs, we add a sparse constraint to the objective and provide two sparse split finding algorithms.

  • We extend histogram approximation [16] into the multiple output case and speed up the training process by the extended one.

The rest of this paper is organized as follows. First, we review GBDT for single output and introduce basic definitions in Section II. Then, we describe the details of GBDT-MO in Section III. We address related work in Section IV. Finally, we perform experiments and conclude in Sections V and VI, respectively.

Ii GBDT for Single Output

In this section, we review GBDT for single output. First, we show how to derive the objective of GBDT based on the second order Taylor expansion of the loss, which is used in XGBoost. The objective to multiple variable cases will be generalized in Section III. Then, we explain the split finding algorithms which exactly or approximately minimize the objective.

Ii-a Objective

Denote as a dataset with samples, where is an dimension input. Denote as the function of a decision tree which maps into a scalar. Based on the construction mechanism of decision trees, can be further expressed as follows:

(1)

where is the number of leaves of a decision tree, is a function which selects a leaf given and is the value of -th leaf. That is, once a decision tree is constructed, it first maps input into a leaf, then returns the value of that leaf. Since GBDT integrates decision trees with an additive manner, the prediction of GBDT is as follows:

(2)

where is the function of -th decision tree.

Now, we consider the objective of -th decision tree given .

(3)

where the first term is the fidelity term, is the regularization term of , and controls the trade-off between two terms. We suppose is a second order differentiable loss. Based on the space of , i.e. a constant value for each leaf, the fidelity term of (3) is separable w.r.t. each leaf. Then (3) is rewritten as follows:

(4)

Although there are many choices of , we set , which is commonly used. Because (4) is separable w.r.t each leaf, we only consider the objective of a single leaf as follows:

(5)

where is the value of a leaf and is enumerated over the samples belonging to that leaf. can be approximated by the second order Taylor expansion of . Then, we have

(6)

where and are the first and second order derivatives of w.r.t . By setting to , we obtain the optimal value of as follows:

(7)

Substituting (7) into (6), we get the optimal objective as follows:

(8)

We ignore since it is a constant term given .

  Input: set of used samples , column index , and number of bins
  Output: histogram of th column
  Initialize Hist. , Hist. and Hist.
  for  in  do
      bin value of
     Hist. Hist. +
     Hist. Hist. +
     Hist. Hist. +
  end for
Algorithm 1 Histogram

Ii-B Split Finding

One of the most important problems in decision tree learning is to find the best split given a set of samples. Specifically, samples are divided into left and right parts based on the following rule:

(9)

where is -th element of and is the threshold. The goal of split finding algorithms is to find the best column and threshold such that the gain between the optimal objectives before split and after split is maximized. The optimal objective after split is defined as the sum of the optimal objectives of left and right parts.

(10)

where maximizing is equivalent of minimizing , i.e. the optimal objective after split, because is fixed for a given set of samples. is used to determine whether a tree is grown. That is, if is smaller than a threshold, we stop the growth to avoid over-fitting.

Exact and approximate split finding algorithms have been developed. The exact algorithm enumerates over all possible splits on all columns. For efficiency, it first sorts the samples according to the values of each column and then visit the samples in the sorted order to accumulate and which are used to compute the optimal objective. Once samples are divided into two parts, one may resort the samples in both parts, which is time-consuming. Fortunately, this can be avoided by storing the sorting in-memory. The relative sorting of left and right parts are contained in the sorting before split. We obtain the sorting of both parts by scanning the sorting before split once, i.e. linear time complexity.

The exact split finding algorithm is accurate. However, when the number of samples is large, it is time-consuming to enumerate over all of the possible splits. Approximate algorithms are necessary as the number of samples increases. The key idea of approximate algorithms is that they divide samples into buckets and enumerate over these buckets instead of individual samples. The gradient statistics are accumulated within each bucket. The complexity of the enumeration process for split is independent of the number of samples. In literature, there are two strategies for bucketing: quantile-based and histogram-based. Since the latter is significantly faster than the former [8], we focus on the latter in this work. For histogram-based bucketing, a bucket is called a bin. Samples are divided into bins by adjacent intervals.

(11)

where and are usually set to and respectively. These intervals are constructed based on the distribution of th input column of the whole dataset. Once constructed, they are keeping unchanged. Given a sample , it belongs to th bin if and only if . The bin value of is obtained by binary-search with complexity . Given bin values, the histogram of -th input column is constructed by a single fast scanning as described in Algorithm 1. When samples are divided into two parts, it may be unnecessary to construct the histograms for both parts. One can store the histogram of their parent node in memory and construct the histogram of one part. Then, the histogram of another part is obtained based on the following relation:

(12)

This trick can reduce the running time of histogram construction by at least half.

Iii GBDT for Multiple Outputs

In this section, we describe GBDT-MO in detail. We first formulate the general problem of learning GBDT for multiple outputs. Specifically, we derive the objective for learning multiple outputs based on the second order Taylor expansion of loss. We approximate this objective and connect it with the objective for single output. We also formulate the problem of learning a subset of variables and derive its objective. This is achieved by adding constraints. Then, we propose split finding algorithms that minimize the corresponding objectives. Finally, we discuss our implementations and analyze the complexity of our proposed split finding algorithms. In this work, we denote is -th row of a matrix , is -th column and is its element of -th row and -th column.

Iii-a Objective

We derive the objective of GBDT-MO. Each leaf of a decision tree constructs multiple outputs. Denote as a dataset with samples, where is an dimensional input and is a dimension output instead of a scalar. Denote as the function of a decision tree which maps into the output space. Based on the construction mechanism of decision trees, can be further expressed as follows:

(13)

where is the number of leaves of a decision tree, is a function which selects a leaf given and is the values of -th leaf. That is, once a decision tree is constructed, it first maps an input into a leaf, then returns the dimension vector of that leaf. Then, the prediction of the first trees is .

We consider the objective of the -th tree given . Because it is separable w.r.t each leaf (see (3) and (4)), we only consider the objective of a single leaf as follows:

(14)

We highlight that is a vector with elements which belongs to a leaf. Again, we suppose is a second order differentiable function. can be approximated by the second order Taylor expansion of . Set , we have:

(15)

where and . To avoid notation conflicts with the subscript of vectors or matrices, we use to indicate that an object belongs to -th sample. This notation is omitted when there is no ambiguity. By setting for (15), we obtain the optimal leaf values:

(16)

where is an identity matrix. By substituting into (15) and ignoring the constant term , we get the optimal objective as follows:

(17)

We have derived the optimal leaf values and the optimal objective for multiple outputs. Comparing (16) with (7) and (17) with (8), it is easy to see that this is a natural generalization of the single output case. In fact, when the loss function is separable w.r.t different output dimensions, or equivalently, when its hessian matrix is diagonal, each element of is obtained by the same way as in (7).

(18)

where is the diagonal elements of . And the optimal objective in (17) can be expressed as the sum of objectives over all output dimensions.

(19)

GBDT-MO and GBDT are different even when is diagonal because GBDT-MO considers the objectives of all output variables at the same time.

However, it is problematic when is not separable or equivalently is non-diagonal. First, it is difficult to store for every sample when the output dimension is large. Second, to get the optimal objective for each possible split, it is required to compute the inverse of a matrix, which is time-consuming. Thus, it is impractical to learn GBDT-MO using the exact objective in (17) and the exact leaf values in (16). Fortunately, it is shown in [17] that and are good approximations of the exact leaf values and the exact objective when the diagonal elements of are dominated. and are derived from an upper bound of . See appendix A for details and further discussions. Although it is possible to derive better approximations for some specific loss functions, we use the above diagonal approximation in this work. We leave better approximations as our future work.

Iii-B Sparse Objective

In Section III-A, we define the objective as the sum over all output variables because there are correlations between variables. However, not all variables, but only a subset of variables are correlated in practice. Thus, we may only learn the values of a suitable subset of while others set to . This can be achieved by adding constraint to the objective. Based on the diagonal approximation, the optimal sparse leaf values are as follows:

(20)

where is the maximum non-zero elements of . We denote and to simplify the notation. Note the difference between and .

For -th element of , its value is either or 0. Accordingly, the objective contributed by -th column is either or 0. Thus, to minimize (20), we select columns with largest . Let be the sorted order of such that:

(21)

Then, the solution of (20) is as follows:

(22)

That is, columns with largest keep their values while others set to . The corresponding optimal objective is as follows:

(23)

Our sparse objective is similar to the objective in [15]. The difference is that our derivations are based on second order gradient statistics.

  Input: the set of used samples , column index , output dimension and number of bins
  Output: histogram of th column
  Initialize Hist. , Hist. and Hist.
  for  in  do
      bin value of
     Hist. Hist. +
     for  to  do
        Hist. Hist. +
        Hist. Hist. +
     end for
  end for
Algorithm 2 Histogram for Multiple Outputs
  Input: set of used samples , input dimension and output dimension
  Output: split with maximum
  
  ,
  for  to  do
     ,
     for  in , by  do
        ,
        ,
        
        
     end for
  end for
Algorithm 3 Exact Split Finding for Multiple Outputs
  Input: histograms of current node, input dimension and output dimension
  Output: split with maximum
  
  for  to  do
     Hist histogram of th column
      number of bins of Hist
      Hist., Hist.
     ,
     for  to  do
         Hist., Hist.
        ,
        
        
     end for
  end for
Algorithm 4 Approximate Split Finding for Multiple Outputs

Iii-C Split Finding

Split finding algorithms for single output maximize the gain of objective before and after split. When dealing with multiple outputs, the objective is defined over all output variables. To find the maximum gain, it is required to scan all columns of outputs. We summarize the exact split finding algorithm for multiple outputs in Algorithm 3. The exact algorithm is inefficient because it enumerates all possible splits. In this work, we use the histogram approximation based one to speed up the training process.

To deal with multiple outputs, one should extend the histogram construction algorithm from a single variable case into multiple variable case. Such an extension is straightforward. Denote as the number of bins of a histogram. Then, the gradient information is stored in a matrix and its -th column corresponds to the gradient information of -th variable of outputs. We describe the histogram construction algorithm for multiple outputs in Algorithm 2. Once the histogram is constructed, we scan its bins to find the best split. We describe this algorithm in Algorithm 4. Compared with the single output one, the objective gain is the sum of gains over all outputs.

  Input: gradient statistics of current split, output dimension and sparse constraint .
  Output: gain of current split
   top-k priority queue
  for  to  do
     append to
     append to
  end for
  
Algorithm 5 Gain for Sparse Split Finding
  Input: gradient statistics of current split, output dimension and sparse constraint .
  Output: gain of current split
   top-k priority queue
  for  to  do
     append to
  end for
  
Algorithm 6 Gain for Restricted Sparse Split Finding
(a) Non-sparse split finding
(b) Sparse split finding
(c) Restricted sparse split finding
Fig. 2: Examples of different split algorithms for samples in rows and outputs in columns. The sparse constraint is set to . The dashed line divides samples into two parts. Numbers denote improvements of the objectives for each column. Selected columns of left part are marked in blue and selected columns of right part are marked in red. Best viewed on the screen.

Iii-D Sparse Split Finding

We propose histogram approximation based sparse split finding algorithms. Compared with the non-sparse one, the key difference is to compute their objective gain given a possible split as follows:

(24)

where is the sorted order of and is the sorted order of respectively. means that the objective before split is fixed for every possible split. When we scan over columns, we maintain the top-k columns for both parts whose is the largest. This can be achieved by a top-k priority queue. We provide the algorithm in Algorithm 5.

In Algorithm 5, the sets of the selected columns of left and right parts are not completely overlapping. We restrict those two sets to be completely overlapping. In other word, the selected columns of two parts are shared. Then, the objective gain in such a case becomes:

(25)

where is the sorted order of . We call it the restricted sparse split finding algorithm. We describe the gain computing for it in Algorithm 6. There are two advantages of the restricted one:

  • it has lower computational complexity because it only maintains a single top-k priority queue.

  • it introduces smoothness prior into the function space because it makes two child nodes with the same parent more similar.

We provide an example to show the differences between non-sparse split finding, sparse split finding and restricted sparse split finding in Fig. 2.

  Input: set of samples , gradient statistics, and hyper-parameters for tree learning.
  Output: a decision tree
   Histograms of
  
   priority queue sorted by
  
  repeat
     
     
     if stop condition is not meet then
        
        
        
        
     end if
  until  is empty
Algorithm 7 Tree Growth

Iii-E Implementation Details

We implement GBDT-MO from scratch by C++. We also provide a Python interface. We speed up our algorithm using multi-core parallelism, implemented with OpenMP. Several open-sourced GBDT libraries such as XGBoost and LightGBM, are highly optimized by including advanced features such as distributed training and GPU training. Those advanced features are not included in our implementations. Our code is currently used for academic exploration. A decision tree grows up in the best-first manner the same as LightGBM. Specifically, we store the information of nodes that has not been divided in memory. At each time when we need to add a node, we select the node whose objective gain is the maximum from all stored nodes. This is achieved by a priority queue. We describe the algorithm for growth of a tree in Algorithm 7. In practice, we store up to nodes in memory to reduce the memory cost.

Iii-F Complexity Analysis

We analyze the complexity of GBDT-MO and compare it with GBDT for a single variable. For training complexity, we focus on the complexity of split finding algorithms. Recall that the dimension of input is and the dimension of output is . is the number of bins. We suppose is fixed for all input dimensions. Then, the complexity for non-sparse split finding is and the complexity for sparse split finding is , where is the sparse constraint. Because the complexity of inserting an element into a top-k priority queue is . Thus, non-sparse split finding has lower time complexity than the sparse one. The complexity of GBDT for a single variable is , the same as the non-sparse split finding. However, this does not mean they are just as fast in practice. Beyond split finding, GBDT has more overhead which slows its training speed down. For example, samples are divided into two parts times for GBDT while once for GBDT-MO. We also analyze the inference complexity given a sample. We suppose there are boosting rounds and the maximum height of each tree is . For non-sparse split finding, it requires comparisons and additions. For the sparse one, it requires comparisons and additions. For standard GBDT, it requires comparisons and additions because trees are constructed for each output variable.

In summary, non-sparse split finding of GBDT-MO should be faster than standard GBDT in practice due to its less overhead during training. For inference, GBDT-MO should be faster than standard GBDT in theory because the latter requires more comparisons.

Iv Related Work

Gradient boosted decision tree (GBDT) proposed in [1] has received much attention due to its accuracy, efficiency and interpretability. GBDT has two characteristics: it uses decision trees as the base learner and its boosting process is guided by the gradient of some loss function. Many variants of [1] have been proposed. Instead of the first order gradient, XGBoost [7] also uses the second order gradient to guide its boost process and derives the corresponding objective for split finding. The histogram approximation of split finding is proposed in [16] which is used as the base algorithm in LightGBM [8]. Because the second order gradient improves the accuracy and histogram approximation improves the training efficiency, those two improvements are also used in the proposed GBDT-MO.

Since machine learning problems with multiple outputs become common, many tree based or boosting based methods have been proposed to deal with multiple outputs. [18] [19] generalize the impurity measures defined for binary classification and ranking tasks to a multi-label scenario for splitting a node. However, they are random forest based methods. That is, new trees are not constructed in a boosting manner. Several works extend adaptive boost (AdaBoost) into multi-label cases such as AdaBoost.MH [20] and AdaBoost.LC [21]. The spirits of AdaBoost.MH and AdaBoost.LC are different from GBDT. At each step, a new base learner is trained from scratch on the re-weighted samples. Moreover, AdaBoost.MH only works for Hamming loss, while AdaBoost.LC only works for the covering loss [21].

They do not belong to GBDT families. Two works which belong to GBDT families have been proposed for learning multiple outputs [14] [15]. [14] transforms the multiple output problem into the single output problem by kernelizing the output space. To achieve this, an kernel matrix should be constructed where is the number of training samples. Thus, this method is not scalable. Moreover, it works only for square loss. GBDT for sparse output (GBDT-sparse) is proposed in [15]. The outputs are represented in sparse format. A sparse split finding algorithm is designed by adding constraint to the objective. The sparse split finding algorithms of GBDT-MO are inspired by this work. There are several differences between GBDT-MO and GBDT-sparse:

  • GBDT-sparse focuses on extreme multi-label classification problems, whereas GBDT-MO focuses on general multiple output problems. GBDT-sparse requires the loss is separable over output variables. It also requires its gradient is sparse, i.e. if . During training, it introduces the clipping operator into the loss to maintain the sparsity of gradient. The facts limit the types of loss.

  • GBDT-sparse does not employ the second order gradient. Its objective for split finding is derived based on the first order Taylor expansion of the loss as in [1].

  • GBDT-sparse does not employ histogram approximation to speed up the training process. It is worthwhile because it is not clear how to construct sparse histograms, especially when combining with the second order gradient.

  • The main motivation of GBDT-sparse is to reduce the space complexity of training and the size of models, whereas the main motivation of GBDT-MO is to improve its generalization ability by capturing correlations between output variables.

  • We propose a variant of sparse split finding that shares the selected columns of both parts.

It may be hard to store the outputs in memory without sparse format when the number of classes is large. In such a situation, GBDT-sparse is a better choice.

V Experiments

We evaluate GBDT-MO on problems of multi-output regression, multi-class classification and multi-label classification. First, we show the benefits of GBDT-MO using two synthetic problems. Then, we evaluate GBDT-MO on six real-world datasets. Finally, we evaluate our sparse split finding algorithms. We denote GBDT-SO as our own implementations of GBDT for single output. Except for the split finding algorithm, all implementation details are the same as GBDT-MO for a fair comparison. The purpose of our experiments is not pushing state-of-the-art results on specific datasets. Instead, we would show that GBDT-MO has better generalization ability than GBDT-SO. On synthetic datasets, we compare GBDT-MO with GBDT-SO. On real-world datasets, we compare GBDT-MO with GBDT-SO and LightGBM because LightGBM is a representative open-source GBDT library and some algorithm components of our implementations are similar to LightGBM such as the histogram approximation and the best-first leaf growth. We provide hyper-parameter settings in Appendix B. The training process is stopped when the performance does not improve within 25 rounds.

All experiments are done on a workstation with Intel Xeon CPU E5-2698 v4. On synthetic datasets, we use 4 threads. On real-world datasets, we use 8 threads. The source code of GBDT-MO is available on https://github.com/zzd1992/GBDTMO. Experiments of this paper are available on https://github.com/zzd1992/GBDTMO-EX.

friedman1 random projection
GBDT-SO 0.1540 0.0204
GBDT-MO 0.1429 0.0180
TABLE I: RMSE on synthetic datasets
(a) friedman1
(b) random projection
Fig. 3: RMSE curves on friedman1 and random project. RMSE curves for training samples are drawn in solid lines, while RMSE curves for test samples are drawn in dashed lines. Best viewed on the screen.
(a) MNIST
(b) Caltech101
Fig. 4: Cross entropy curves and accuracy curves on MNIST and Caltech101. Cross entropy curves for training samples are drawn in solid lines, while accuracy curves for test samples are drawn in dashed lines. Best viewed on the screen.

V-a Synthetic Datasets

The first dataset is derived from the friedman1 regression problem [22]. Its target is generated by the following process:

(26)
(27)

where and . Each element of is sampled from . The last five elements of is irrelevant to the target. We extend this problem into multiple output case by adding independent noise to . That is where .

The second dataset is generated by random projection.

(28)

where , and . Each element of and is independently sampled from .

For both datasets, we generate samples for training and samples for test. We train them via mean square error (MSE) and evaluate the performance on test samples via root mean square error (RMSE). We repeat the experiments 5 times with different seeds and average the RMSE. As shown in Table I, GBDT-MO is better than GBDT-SO. We provide the training curves in Fig. 3. For fairness, curves of different methods are plotted with the same learning rate and maximum tree depth. It can be observed that GBDT-MO has better generalization ability on both datasets. This is because its RMSE on test samples is lower and its performance gap is smaller. Here, performance gap means the differences of performance between training samples and unseen samples which is usually used to measure the generalization ability of machine learning algorithms.

The output variables of friedman1 are correlated because they are observations of the same underlying variable corrupted by Gaussian noise. The output variables of random projection are also correlated because this projection is over-complete, i.e. the output dimension is larger than the input dimension. This supports our claim that GBDT-MO has better generalization abilities because its learning mechanism encourages it to capture variable correlations. However, GBDT-MO suffers from slow convergence speed based on its RMSE curves.

Dataset # Training samples # Test samples # Features # Outputs Problem type
MNIST 50,000 10,000 784 10 classification
Yeast 1,038 446 8 10 classification
Caltech101 6,073 2,604 324* 101 classification
MNIST-inpainting 50,000 10,000 200* 24 regression
Student-por 454 159 41* 3 regression
NUS-WIDE 161,789 107,859 128* 81 multi-label
TABLE II: Dataset statistics. * means the features are pre-processed.
MNIST Yeast Caltech101 NUS-WIDE
LightGBM 0.9803 0.6184 0.5610 0.4399
GBDT-SO 0.9808 0.6188 0.5649 0.4410
GBDT-MO 0.9830 0.6193 0.5846 0.4421
TABLE III: Accuracy on real-world datasets
MNIST-inpaining Student-por
LightGBM 0.26090 0.23976
GBDT-SO 0.26157 0.23872
GBDT-MO 0.26025 0.23829
TABLE IV: RMSE on real-world datasets
Fig. 5: Average training time in second for each boost round. Four relatively larger datasets are used.
Restricted/Unrestricted MNIST MNIST-inpaining Caltech101 NUS-WIDE
0.619/0.599 - - -
0.737/0.758 0.149/0.179 - -
0.721/0.778 0.161/0.195 0.578/0.663 1.248/1.266
- 0.153/0.208 0.620/0.738 1.237/1.365
- - 0.628/0.866 1.256/1.453
- - 0.650/0.937 1.263/1.493
TABLE V: Training time for sparse split finding
Restricted/Unrestricted MNIST MNIST-inpaining Caltech101 NUS-WIDE
accuracy RMSE accuracy top-1 accuracy
0.9754/0.9779 - - -
0.9819/0.9818 0.26424/0.26368 - -
0.9807/0.9806 0.26245/0.26391 0.5422/0.5379 0.4420/0.4420
- 0.26232/0.26248 0.5621/0.5500 0.4422/0.4427
- - 0.5721/0.5706 0.4407/0.4404
- - 0.5863/0.5721 0.4423/0.4422
TABLE VI: Performance for sparse split finding

V-B Real-world Datasets

In this subsection, we evaluate GBDT-MO on six real-world datasets and compare with GBDT-SO in terms of test performance, generalization ability and training speed.

MNIST111yann.lecun.com/exdb/mnist/ is widely used for classification whose samples are gray images of handwritten digits from 0 to 9. There are 50,000 samples for training and 10,000 samples for test. Each sample is converted into a vector with 784 elements.

Yeast222archive.ics.uci.edu/ml/datasets/Yeast has 8 input attributions each of which is a measurement of the protein sequence. The goal is to predict protein localization sites with 10 possible choices. There are 1484 samples in total.

Caltech101333www.vision.caltech.edu/Image_Datasets/Caltech101/ contains images of objects belonging to 101 categories. There are 8677 samples in total and most categories have about 50 samples. The size of each image is roughly . To obtain fixed length features, we resize each image into and compute the HOG descriptor [23] of the resized image. Each sample is finally converted to a vector with 324 elements.

MNIST-inpainting is a regression task based on MNIST. We crop the central patch from the original image because most boundary pixels are 0. Then, the cropped image is divided into upper and lower halves each of which is . The upper half is used as the input. We further crop a small patch at the top center of the lower half. This small patch is used as the target. The task is to predict the pixels of this patch given upper half image.

Student-por444archive.ics.uci.edu/ml/datasets/Student+Performance predicts the Portuguese language scores in three different grades of students based on their demographic, social and school related features. There are 613 samples in total. The original scores range from 0 to 20. We linearly transform them into . We use one-hot coding to deal with the categorical features on this dataset.

NUS-WIDE555mulan.sourceforge.net/datasets-mlc.html is a dataset for real-world web image retrieval [24]. [10] selects a subset label of this dataset and uses it for multi-label classification. There are 161,789 samples for training and 107,859 for test. Images are represented using 128-D cVLAD+ features described in [25].

We summarize the statistics of the above datasets in Table II. Those datasets are diverse in terms of scale and complexity. For MNIST, MNIST-inpainting and NUS-WIDE, we use the official training-test split. For others, we randomly select samples for training and the rest for test. We repeat this strategy 5 times with different seeds and report the average results. For multi-class classification, we use cross-entropy loss. For regression and multi-label classification, we use MSE loss. For regression, the performance is measured by RMSE. For multi-class classification and multi-label classification, the performance is measured by top-1 accuracy.

RMSE and top-1 accuracy on real-world datasets are shown in Tables IV and III, respectively. GBDT-MO is consistently better than GBDT-SO. The performance of LightGBM is a reference. We suggest that readers focus on the comparison between GBDT-SO and GBDT-MO. Since we care about the generalization ability, we show the loss curves and the accuracy curves of MNIST and Caltech101 in Fig. 4. We conclude that GBDT-MO has better generalization ability than GBDT-SO. This is because its training loss is higher while its test performance is better.

We also compare the training speed of GBDT-SO and GBDT-MO. Specifically, we run 10 boost rounds and record the average training time for each boost round. We repeat this process three times and report the average training time in seconds as shown in Fig. 5. The training speed for Yeast and Student-por is not reported here because those two datasets are too small. GBDT-MO is remarkably faster than GBDT-SO, especially when the number of outputs is large.

LightGBM is also much faster than GBDT-SO due to its better engineering and optimization. Its training speed is not reported because we focus on the speed difference caused by the learning mechanism for multiple outputs.

V-C Sparse Split Finding

We perform all experiments on GBDT-MO in Sections V-A and V-B using non-sparse split algorithm. We evaluate our unrestricted sparse split finding algorithm (Algortithm 5) and restricted sparse split finding algorithm (Algorithm 6) in this section. We compare their performance and training speed on MNIST, MNIST-inpainting, Caltech101 and NUS-WIDE with different sparse factor .

Their training speed in seconds is shown in Table V. The restricted one is consistently faster than the unrestricted one. Moreover, their speed differences are remarkable when is large. The performance on test samples is shown in Table VI. The restricted one is slightly better than the unrestricted one. Note that the performance of our sparse split algorithms is sometimes better than the non-sparse one’s with a proper (for example, on Caltech101). We do not further adjust their hyper-parameters. Those hyper-parameters are the same as the corresponding non-sparse one’s.

Vi Conclusions

In this paper, we have proposed a general method to learn GBDT for multiple outputs. The motivation of GBDT-MO is to capture the correlations between output variables. We have derived the approximated learning objective for in both non-sparse case and sparse case based on the second order Taylor expansion of loss. For sparse case, we have proposed the restricted algorithm which restricts the subsets of left and right parts to be the same and the unrestricted algorithm which has no such a restriction. We have extended the histogram approximation into multiple output case and speed up the training process by the extended one. We have evaluated that GBDT-MO is remarkably and consistently better in generalization ability and faster in training speed compared with GBDT for single output. We have also evaluated that the restricted sparse split finding algorithm is slightly better than the unrestricted one by considering both performance and training speed. However, GBDT-MO suffers from slower convergence speed, especially at the beginning of training. In our future work, we would improve its convergence speed.

Learning rate/Maximum depth LightGBM GBDT-SO GBDT-MO
MNIST 0.25/6 0.10/6 0.10/8
Yeast 0.10/4 0.10/5 0.25/6
Caltech101 0.10/8 0.05/8 0.10/9
NUS-WIDE 0.05/8 0.05/8 0.10/8
MNIST-inpaining 0.10/8 0.10/8 0.10/7
Student-por 0.10/4 0.05/4 0.10/5
TABLE VII: Grid searched learning rate and maximum depth
MNIST Yeast Caltech101 NUS-WIDE MNIST-inpaining Student-por
Gain threshold 1e-3 1e-3 1e-3 1e-6 1e-6 1e-6
Min samples in leaf 16 4 16 8 4 4
Max bins 8 32 32 64 16 8
Histogram cache 16 16 16 48 48 48
TABLE VIII: Preliminarily searched hyper-parameters

References

  • [1] Jerome H Friedman, “Greedy function approximation: A gradient boosting machine.,” Annals of Statistics, vol. 29, no. 5, pp. 1189–1232, 2001.
  • [2] Jerome H Friedman, “Stochastic gradient boosting,” Computational Statistics and Data Analysis, vol. 38, no. 4, pp. 367–378, 2002.
  • [3] Ping Li, “Robust logitboost and adaptive base class (abc) logitboost,” pp. 302–311, 2010.
  • [4] Christopher JC Burges, “From ranknet to lambdarank to lambdamart: An overview,” Learning, vol. 11, no. 23-581, pp. 81, 2010.
  • [5] Matthew Richardson, Ewa Dominowska, and Robert J Ragno, “Predicting clicks: estimating the click-through rate for new ads,” the web conference, pp. 521–530, 2007.
  • [6] James Bennett, Stan Lanning, et al., “The netflix prize,” in Proceedings of KDD cup and workshop. New York, NY, USA., 2007, vol. 2007, p. 35.
  • [7] Tianqi Chen and Carlos Guestrin, “Xgboost: A scalable tree boosting system,” Knowledge Discovery and Data Mining, pp. 785–794, 2016.
  • [8] Guolin Ke, Qi Meng, Thomas William Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tieyan Liu, “Lightgbm: a highly efficient gradient boosting decision tree,” in Advances in Neural Information Processing Systems, 2017, pp. 3149–3157.
  • [9] Liudmila Ostroumova Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin, “Catboost: unbiased boosting with categorical features,” 2018, pp. 6638–6648.
  • [10] Grigorios Tsoumakas, Eleftherios Spyromitros-Xioufis, Jozef Vilcek, and Ioannis Vlahavas, “Mulan: A java library for multi-label learning,” Journal of Machine Learning Research, vol. 12, pp. 2411–2414, 2011.
  • [11] Hanen Borchani, Gherardo Varando, Concha Bielza, and Pedro Larranaga, “A survey on multi-output regression,” Data Mining and Knowledge Discovery, vol. 5, no. 5, pp. 216–233, 2015.
  • [12] Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep Learning, MIT Press, 2016, http://www.deeplearningbook.org.
  • [13] Geoffrey E Hinton, Oriol Vinyals, and Jeffrey Dean, “Distilling the knowledge in a neural network,” arXiv: Machine Learning, 2015.
  • [14] Pierre Geurts, Louis Wehenkel, and Florence Dalchebuc, “Gradient boosting for kernelized output spaces,” in Proceedings of the International Conferenece on Machine Learning, 2007, pp. 289–296.
  • [15] Si Si, Huan Zhang, S Sathiya Keerthi, Dhruv Mahajan, Inderjit S Dhillon, and Chojui Hsieh, “Gradient boosted decision trees for high dimensional sparse output,” in Proceedings of the International Conferenece on Machine Learning, 2017, pp. 3182–3190.
  • [16] Stephen Tyree, Kilian Q Weinberger, Kunal Agrawal, and Jennifer Paykin, “Parallel boosted regression trees for web search ranking,” in Proceedings of the International Conference on World Wide Web, 2011, pp. 387–396.
  • [17] Tianqi Chen, Sameer Singh, Ben Taskar, and Carlos Guestrin, “Efficient second-order gradient boosting for conditional random fields,” in Proceedings of the International Conference on Artificial Intelligence and Statistics, 2015, pp. 147–155.
  • [18] Rahul Agrawal, Archit Gupta, Yashoteja Prabhu, and Manik Varma, “Multi-label learning with millions of labels: recommending advertiser bid phrases for web pages,” in Proceedings of the International Conference on World Wide Web, 2013, pp. 13–24.
  • [19] Yashoteja Prabhu and Manik Varma, “Fastxml: a fast, accurate and stable tree-classifier for extreme multi-label learning,” in Proceedings of the International Conference on Knowledge Discovery and Data Mining, 2014, pp. 263–272.
  • [20] Robert E Schapire and Yoram Singer, “Improved boosting algorithms using confidence-rated predictions,” in Proceedings of the International Conference on Learning Theory, 1998, vol. 37, pp. 80–91.
  • [21] Yonatan Amit, Ofer Dekel, and Yoram Singer, “A boosting algorithm for label covering in multilabel problems,” in Proceedings of the International Conference on Artificial Intelligence and Statistics, 2007, pp. 27–34.
  • [22] Jerome H Friedman, “Multivariate adaptive regression splines,” Annals of Statistics, vol. 19, no. 1, pp. 1–67, 1991.
  • [23] Navneet Dalal and Bill Triggs, “Histograms of oriented gradients for human detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2005, vol. 1, pp. 886–893.
  • [24] Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yan-Tao Zheng, “Nus-wide: A real-world web image database from national university of singapore,” in Proceedings of the ACM Conference on Image and Video Retrieval, Santorini, Greece, July 8-10, 2009.
  • [25] Eleftherios Spyromitrosxioufis, Symeon Papadopoulos, Ioannis Kompatsiaris, Grigorios Tsoumakas, and Ioannis P Vlahavas, “A comprehensive study over vlad and product quantization in large-scale image retrieval,” IEEE Transactions on Multimedia, vol. 16, no. 6, pp. 1713–1728, 2014.

Appendix A Approximated Objective

We suppose there exists such that

(29)

That is, is dominated by its diagonal elements. Then, we have:

(30)

The last inequality holds based on the assumption in (29). Substituting this result into the Taylor expansion of loss

(31)

Substituting this upper bound of into (15), we get the optimal value of .

(32)

Recall that is the diagonal elements of . This solution is the same as (18), except the coefficient . In practice, the actual leaf weight is where is the so-called learning rate. Then, the effect of can be canceled out by adjusting and . Thus, we do not need to consider the exact value of in practice.

Appendix B Hyper-parameter Settings

We discuss our hyper-parameter settings. We find the best maximum depth and learning rate via a grid search. is searched from and learning rate is searched from . We provide the selected and learning rate in Table VII. We set regularization to 1.0 and the maximum leaves to in all experiments. We preliminarily search other hyper-parameters and fix them for all methods. We list them on real-world datasets in Table VIII. Note that the hyper-parameters in Table VIII may not be optimal. When comparing with the gain threshold, we use the average gain over the involved output variables.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
389510
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description