GBDTMO: Gradient Boosted Decision Trees for Multiple Outputs
Abstract
Gradient boosted decision trees (GBDTs) are widely used in machine learning, and the output of current GBDT implementations is a single variable. When there are multiple outputs, GBDT constructs multiple trees corresponding to the output variables. In this case, the correlations between variables are ignored by such a strategy causing redundancy of the learned tree structures. In this paper, we propose a general method to learn GBDT for multiple outputs, called GBDTMO. Each leaf of GBDTMO constructs predictions of all variables or a subset of automatically selected variables. This is achieved by considering the summation of objective gains over all output variables. Moreover, we extend histogram approximation into multiple output case and speed up the training process by the extended one. Various experiments on synthetic and realworld datasets verify that the learning mechanism of GBDTMO plays a role in indirect regularization. Our code is available online.
I Introduction
Machine learning and datadriven approaches have achieved great success in recent years. Gradient boosted decision tree (GBDT) [1] [2] is a powerful machine learning tool widely used in many applications such as multiclass classification [3], learning to rank [4] and click prediction [5]. It also produces stateoftheart results for many data mining competitions such as the Netflix prize [6]. GBDT uses decision trees as the base learner and sums the predictions of a series of trees. At each step, a new decision tree is trained to fit the residual between ground truth and current prediction. GBDT is popular due to its accuracy, efficiency and interpretability. Many improvements have been proposed after [1]. XGBoost [7] used the second order gradient to guide the boosting process and improve the accuracy. LightGBM [8] aggregated gradient information in histograms and significantly improved the training efficiency. CatBoost [9] proposed a novel strategy to deal with categorical features.
A limitation of current GBDT implementations is that the output of each decision tree is a single variable. This is because each leaf of a decision tree produces a single variable. However, multiple outputs are required for many machine learning problems including multiclass classification, multilabel classification [10] and multioutput regression [11]. Other machine learning methods, such as neural networks [12], can adapt to any dimension of outputs straightforwardly by changing the number of neurons in the last layer. The flexibility for the output dimension may be one of the reasons why neural networks are popular. However, it is somewhat strange to handle multiple outputs by current GBDT implementations. At each step, they construct multiple decision trees each of which corresponds to an individual variable of the output, then concatenates the predictions of all trees to obtain multiple outputs. This strategy is used in the most popular opensourced GBDT libraries: XGBoost [7], LightGBM [8], and CatBoost [9].
The major drawback of the abovementioned strategy is that correlations between variables are ignored during the training process because those variables are treated in isolation and they are learned independently. However, correlations more or less exist between output variables. For example, there are correlations between classes for multiclass classification. It is verified in [13] that such correlations improve the generalization ability of neural networks. Ignoring variable correlations also leads to redundancy of the learned tree structures. Thus, it is necessary to learn GBDT for multiple outputs via better strategies. Up to now, a few works have explored: [14] [15]. [14] transformed the multiple output problem into a single output problem by kernelizing the output space. However, this method was not scalable because the space complexity of its kernel matrix was where is the number of training samples. [15] proposed GBDT for sparse output. [15] mainly focused on extreme multilabel classification problems, and the outputs were represented in sparse format. A sparse split finding algorithm was designed for square hinge loss. [14] and [15] worked for specific loss and they did not employ the second order gradient and histogram approximation.
In this paper, we propose a novel and general method to learn GBDT for multiple outputs, which is scalable and efficient, named GBDTMO. Unlike previous works, we employ the second order gradient and histogram approximation to improve GBDTMO. The learning mechanism is designed based on them to jointly fit all variables in a single tree. Each leaf of a decision tree constructs multiple outputs at once. This is achieved by maximizing the summation of objective gains over all output variables. Sometimes, only a subset of the output variables is correlated. It is expected that the proposed method automatically selects those variables and constructs predictions for them at a leaf. We achieve this by adding constraint to the objective function. Since the learning mechanism of GBDTMO enforces the learned trees to capture variable correlations, it plays a role in indirect regularization. Experiments on both synthesis and realworld datasets show that GBDTMO achieves better generalization ability than the standard GBDT. They indicate that the learning mechanism plays a role in indirect regularization. Moreover, GBDTMO achieves a fast training speed, especially when the number of outputs is large.
Compared with existing methods, main contributions of this paper are as follows:

We formulate the problem of learning multiple outputs for GBDT, and propose a split finding algorithm by deriving a general approximate objective for this problem.

To learn a subset of outputs, we add a sparse constraint to the objective and provide two sparse split finding algorithms.

We extend histogram approximation [16] into the multiple output case and speed up the training process by the extended one.
The rest of this paper is organized as follows. First, we review GBDT for single output and introduce basic definitions in Section II. Then, we describe the details of GBDTMO in Section III. We address related work in Section IV. Finally, we perform experiments and conclude in Sections V and VI, respectively.
Ii GBDT for Single Output
In this section, we review GBDT for single output. First, we show how to derive the objective of GBDT based on the second order Taylor expansion of the loss, which is used in XGBoost. The objective to multiple variable cases will be generalized in Section III. Then, we explain the split finding algorithms which exactly or approximately minimize the objective.
Iia Objective
Denote as a dataset with samples, where is an dimension input. Denote as the function of a decision tree which maps into a scalar. Based on the construction mechanism of decision trees, can be further expressed as follows:
(1) 
where is the number of leaves of a decision tree, is a function which selects a leaf given and is the value of th leaf. That is, once a decision tree is constructed, it first maps input into a leaf, then returns the value of that leaf. Since GBDT integrates decision trees with an additive manner, the prediction of GBDT is as follows:
(2) 
where is the function of th decision tree.
Now, we consider the objective of th decision tree given .
(3) 
where the first term is the fidelity term, is the regularization term of , and controls the tradeoff between two terms. We suppose is a second order differentiable loss. Based on the space of , i.e. a constant value for each leaf, the fidelity term of (3) is separable w.r.t. each leaf. Then (3) is rewritten as follows:
(4) 
Although there are many choices of , we set , which is commonly used. Because (4) is separable w.r.t each leaf, we only consider the objective of a single leaf as follows:
(5) 
where is the value of a leaf and is enumerated over the samples belonging to that leaf. can be approximated by the second order Taylor expansion of . Then, we have
(6) 
where and are the first and second order derivatives of w.r.t . By setting to , we obtain the optimal value of as follows:
(7) 
Substituting (7) into (6), we get the optimal objective as follows:
(8) 
We ignore since it is a constant term given .
IiB Split Finding
One of the most important problems in decision tree learning is to find the best split given a set of samples. Specifically, samples are divided into left and right parts based on the following rule:
(9) 
where is th element of and is the threshold. The goal of split finding algorithms is to find the best column and threshold such that the gain between the optimal objectives before split and after split is maximized. The optimal objective after split is defined as the sum of the optimal objectives of left and right parts.
(10) 
where maximizing is equivalent of minimizing , i.e. the optimal objective after split, because is fixed for a given set of samples. is used to determine whether a tree is grown. That is, if is smaller than a threshold, we stop the growth to avoid overfitting.
Exact and approximate split finding algorithms have been developed. The exact algorithm enumerates over all possible splits on all columns. For efficiency, it first sorts the samples according to the values of each column and then visit the samples in the sorted order to accumulate and which are used to compute the optimal objective. Once samples are divided into two parts, one may resort the samples in both parts, which is timeconsuming. Fortunately, this can be avoided by storing the sorting inmemory. The relative sorting of left and right parts are contained in the sorting before split. We obtain the sorting of both parts by scanning the sorting before split once, i.e. linear time complexity.
The exact split finding algorithm is accurate. However, when the number of samples is large, it is timeconsuming to enumerate over all of the possible splits. Approximate algorithms are necessary as the number of samples increases. The key idea of approximate algorithms is that they divide samples into buckets and enumerate over these buckets instead of individual samples. The gradient statistics are accumulated within each bucket. The complexity of the enumeration process for split is independent of the number of samples. In literature, there are two strategies for bucketing: quantilebased and histogrambased. Since the latter is significantly faster than the former [8], we focus on the latter in this work. For histogrambased bucketing, a bucket is called a bin. Samples are divided into bins by adjacent intervals.
(11) 
where and are usually set to and respectively. These intervals are constructed based on the distribution of th input column of the whole dataset. Once constructed, they are keeping unchanged. Given a sample , it belongs to th bin if and only if . The bin value of is obtained by binarysearch with complexity . Given bin values, the histogram of th input column is constructed by a single fast scanning as described in Algorithm 1. When samples are divided into two parts, it may be unnecessary to construct the histograms for both parts. One can store the histogram of their parent node in memory and construct the histogram of one part. Then, the histogram of another part is obtained based on the following relation:
(12) 
This trick can reduce the running time of histogram construction by at least half.
Iii GBDT for Multiple Outputs
In this section, we describe GBDTMO in detail. We first formulate the general problem of learning GBDT for multiple outputs. Specifically, we derive the objective for learning multiple outputs based on the second order Taylor expansion of loss. We approximate this objective and connect it with the objective for single output. We also formulate the problem of learning a subset of variables and derive its objective. This is achieved by adding constraints. Then, we propose split finding algorithms that minimize the corresponding objectives. Finally, we discuss our implementations and analyze the complexity of our proposed split finding algorithms. In this work, we denote is th row of a matrix , is th column and is its element of th row and th column.
Iiia Objective
We derive the objective of GBDTMO. Each leaf of a decision tree constructs multiple outputs. Denote as a dataset with samples, where is an dimensional input and is a dimension output instead of a scalar. Denote as the function of a decision tree which maps into the output space. Based on the construction mechanism of decision trees, can be further expressed as follows:
(13) 
where is the number of leaves of a decision tree, is a function which selects a leaf given and is the values of th leaf. That is, once a decision tree is constructed, it first maps an input into a leaf, then returns the dimension vector of that leaf. Then, the prediction of the first trees is .
We consider the objective of the th tree given . Because it is separable w.r.t each leaf (see (3) and (4)), we only consider the objective of a single leaf as follows:
(14) 
We highlight that is a vector with elements which belongs to a leaf. Again, we suppose is a second order differentiable function. can be approximated by the second order Taylor expansion of . Set , we have:
(15) 
where and . To avoid notation conflicts with the subscript of vectors or matrices, we use to indicate that an object belongs to th sample. This notation is omitted when there is no ambiguity. By setting for (15), we obtain the optimal leaf values:
(16) 
where is an identity matrix. By substituting into (15) and ignoring the constant term , we get the optimal objective as follows:
(17) 
We have derived the optimal leaf values and the optimal objective for multiple outputs. Comparing (16) with (7) and (17) with (8), it is easy to see that this is a natural generalization of the single output case. In fact, when the loss function is separable w.r.t different output dimensions, or equivalently, when its hessian matrix is diagonal, each element of is obtained by the same way as in (7).
(18) 
where is the diagonal elements of . And the optimal objective in (17) can be expressed as the sum of objectives over all output dimensions.
(19) 
GBDTMO and GBDT are different even when is diagonal because GBDTMO considers the objectives of all output variables at the same time.
However, it is problematic when is not separable or equivalently is nondiagonal. First, it is difficult to store for every sample when the output dimension is large. Second, to get the optimal objective for each possible split, it is required to compute the inverse of a matrix, which is timeconsuming. Thus, it is impractical to learn GBDTMO using the exact objective in (17) and the exact leaf values in (16). Fortunately, it is shown in [17] that and are good approximations of the exact leaf values and the exact objective when the diagonal elements of are dominated. and are derived from an upper bound of . See appendix A for details and further discussions. Although it is possible to derive better approximations for some specific loss functions, we use the above diagonal approximation in this work. We leave better approximations as our future work.
IiiB Sparse Objective
In Section IIIA, we define the objective as the sum over all output variables because there are correlations between variables. However, not all variables, but only a subset of variables are correlated in practice. Thus, we may only learn the values of a suitable subset of while others set to . This can be achieved by adding constraint to the objective. Based on the diagonal approximation, the optimal sparse leaf values are as follows:
(20)  
where is the maximum nonzero elements of . We denote and to simplify the notation. Note the difference between and .
For th element of , its value is either or 0. Accordingly, the objective contributed by th column is either or 0. Thus, to minimize (20), we select columns with largest . Let be the sorted order of such that:
(21) 
Then, the solution of (20) is as follows:
(22) 
That is, columns with largest keep their values while others set to . The corresponding optimal objective is as follows:
(23) 
Our sparse objective is similar to the objective in [15]. The difference is that our derivations are based on second order gradient statistics.
IiiC Split Finding
Split finding algorithms for single output maximize the gain of objective before and after split. When dealing with multiple outputs, the objective is defined over all output variables. To find the maximum gain, it is required to scan all columns of outputs. We summarize the exact split finding algorithm for multiple outputs in Algorithm 3. The exact algorithm is inefficient because it enumerates all possible splits. In this work, we use the histogram approximation based one to speed up the training process.
To deal with multiple outputs, one should extend the histogram construction algorithm from a single variable case into multiple variable case. Such an extension is straightforward. Denote as the number of bins of a histogram. Then, the gradient information is stored in a matrix and its th column corresponds to the gradient information of th variable of outputs. We describe the histogram construction algorithm for multiple outputs in Algorithm 2. Once the histogram is constructed, we scan its bins to find the best split. We describe this algorithm in Algorithm 4. Compared with the single output one, the objective gain is the sum of gains over all outputs.
IiiD Sparse Split Finding
We propose histogram approximation based sparse split finding algorithms. Compared with the nonsparse one, the key difference is to compute their objective gain given a possible split as follows:
(24) 
where is the sorted order of and is the sorted order of respectively. means that the objective before split is fixed for every possible split. When we scan over columns, we maintain the topk columns for both parts whose is the largest. This can be achieved by a topk priority queue. We provide the algorithm in Algorithm 5.
In Algorithm 5, the sets of the selected columns of left and right parts are not completely overlapping. We restrict those two sets to be completely overlapping. In other word, the selected columns of two parts are shared. Then, the objective gain in such a case becomes:
(25) 
where is the sorted order of . We call it the restricted sparse split finding algorithm. We describe the gain computing for it in Algorithm 6. There are two advantages of the restricted one:

it has lower computational complexity because it only maintains a single topk priority queue.

it introduces smoothness prior into the function space because it makes two child nodes with the same parent more similar.
We provide an example to show the differences between nonsparse split finding, sparse split finding and restricted sparse split finding in Fig. 2.
IiiE Implementation Details
We implement GBDTMO from scratch by C++. We also provide a Python interface. We speed up our algorithm using multicore parallelism, implemented with OpenMP. Several opensourced GBDT libraries such as XGBoost and LightGBM, are highly optimized by including advanced features such as distributed training and GPU training. Those advanced features are not included in our implementations. Our code is currently used for academic exploration. A decision tree grows up in the bestfirst manner the same as LightGBM. Specifically, we store the information of nodes that has not been divided in memory. At each time when we need to add a node, we select the node whose objective gain is the maximum from all stored nodes. This is achieved by a priority queue. We describe the algorithm for growth of a tree in Algorithm 7. In practice, we store up to nodes in memory to reduce the memory cost.
IiiF Complexity Analysis
We analyze the complexity of GBDTMO and compare it with GBDT for a single variable. For training complexity, we focus on the complexity of split finding algorithms. Recall that the dimension of input is and the dimension of output is . is the number of bins. We suppose is fixed for all input dimensions. Then, the complexity for nonsparse split finding is and the complexity for sparse split finding is , where is the sparse constraint. Because the complexity of inserting an element into a topk priority queue is . Thus, nonsparse split finding has lower time complexity than the sparse one. The complexity of GBDT for a single variable is , the same as the nonsparse split finding. However, this does not mean they are just as fast in practice. Beyond split finding, GBDT has more overhead which slows its training speed down. For example, samples are divided into two parts times for GBDT while once for GBDTMO. We also analyze the inference complexity given a sample. We suppose there are boosting rounds and the maximum height of each tree is . For nonsparse split finding, it requires comparisons and additions. For the sparse one, it requires comparisons and additions. For standard GBDT, it requires comparisons and additions because trees are constructed for each output variable.
In summary, nonsparse split finding of GBDTMO should be faster than standard GBDT in practice due to its less overhead during training. For inference, GBDTMO should be faster than standard GBDT in theory because the latter requires more comparisons.
Iv Related Work
Gradient boosted decision tree (GBDT) proposed in [1] has received much attention due to its accuracy, efficiency and interpretability. GBDT has two characteristics: it uses decision trees as the base learner and its boosting process is guided by the gradient of some loss function. Many variants of [1] have been proposed. Instead of the first order gradient, XGBoost [7] also uses the second order gradient to guide its boost process and derives the corresponding objective for split finding. The histogram approximation of split finding is proposed in [16] which is used as the base algorithm in LightGBM [8]. Because the second order gradient improves the accuracy and histogram approximation improves the training efficiency, those two improvements are also used in the proposed GBDTMO.
Since machine learning problems with multiple outputs become common, many tree based or boosting based methods have been proposed to deal with multiple outputs. [18] [19] generalize the impurity measures defined for binary classification and ranking tasks to a multilabel scenario for splitting a node. However, they are random forest based methods. That is, new trees are not constructed in a boosting manner. Several works extend adaptive boost (AdaBoost) into multilabel cases such as AdaBoost.MH [20] and AdaBoost.LC [21]. The spirits of AdaBoost.MH and AdaBoost.LC are different from GBDT. At each step, a new base learner is trained from scratch on the reweighted samples. Moreover, AdaBoost.MH only works for Hamming loss, while AdaBoost.LC only works for the covering loss [21].
They do not belong to GBDT families. Two works which belong to GBDT families have been proposed for learning multiple outputs [14] [15]. [14] transforms the multiple output problem into the single output problem by kernelizing the output space. To achieve this, an kernel matrix should be constructed where is the number of training samples. Thus, this method is not scalable. Moreover, it works only for square loss. GBDT for sparse output (GBDTsparse) is proposed in [15]. The outputs are represented in sparse format. A sparse split finding algorithm is designed by adding constraint to the objective. The sparse split finding algorithms of GBDTMO are inspired by this work. There are several differences between GBDTMO and GBDTsparse:

GBDTsparse focuses on extreme multilabel classification problems, whereas GBDTMO focuses on general multiple output problems. GBDTsparse requires the loss is separable over output variables. It also requires its gradient is sparse, i.e. if . During training, it introduces the clipping operator into the loss to maintain the sparsity of gradient. The facts limit the types of loss.

GBDTsparse does not employ the second order gradient. Its objective for split finding is derived based on the first order Taylor expansion of the loss as in [1].

GBDTsparse does not employ histogram approximation to speed up the training process. It is worthwhile because it is not clear how to construct sparse histograms, especially when combining with the second order gradient.

The main motivation of GBDTsparse is to reduce the space complexity of training and the size of models, whereas the main motivation of GBDTMO is to improve its generalization ability by capturing correlations between output variables.

We propose a variant of sparse split finding that shares the selected columns of both parts.
It may be hard to store the outputs in memory without sparse format when the number of classes is large. In such a situation, GBDTsparse is a better choice.
V Experiments
We evaluate GBDTMO on problems of multioutput regression, multiclass classification and multilabel classification. First, we show the benefits of GBDTMO using two synthetic problems. Then, we evaluate GBDTMO on six realworld datasets. Finally, we evaluate our sparse split finding algorithms. We denote GBDTSO as our own implementations of GBDT for single output. Except for the split finding algorithm, all implementation details are the same as GBDTMO for a fair comparison. The purpose of our experiments is not pushing stateoftheart results on specific datasets. Instead, we would show that GBDTMO has better generalization ability than GBDTSO. On synthetic datasets, we compare GBDTMO with GBDTSO. On realworld datasets, we compare GBDTMO with GBDTSO and LightGBM because LightGBM is a representative opensource GBDT library and some algorithm components of our implementations are similar to LightGBM such as the histogram approximation and the bestfirst leaf growth. We provide hyperparameter settings in Appendix B. The training process is stopped when the performance does not improve within 25 rounds.
All experiments are done on a workstation with Intel Xeon CPU E52698 v4. On synthetic datasets, we use 4 threads. On realworld datasets, we use 8 threads. The source code of GBDTMO is available on https://github.com/zzd1992/GBDTMO. Experiments of this paper are available on https://github.com/zzd1992/GBDTMOEX.
friedman1  random projection  

GBDTSO  0.1540  0.0204 
GBDTMO  0.1429  0.0180 
Va Synthetic Datasets
The first dataset is derived from the friedman1 regression problem [22]. Its target is generated by the following process:
(26) 
(27) 
where and . Each element of is sampled from . The last five elements of is irrelevant to the target. We extend this problem into multiple output case by adding independent noise to . That is where .
The second dataset is generated by random projection.
(28) 
where , and . Each element of and is independently sampled from .
For both datasets, we generate samples for training and samples for test. We train them via mean square error (MSE) and evaluate the performance on test samples via root mean square error (RMSE). We repeat the experiments 5 times with different seeds and average the RMSE. As shown in Table I, GBDTMO is better than GBDTSO. We provide the training curves in Fig. 3. For fairness, curves of different methods are plotted with the same learning rate and maximum tree depth. It can be observed that GBDTMO has better generalization ability on both datasets. This is because its RMSE on test samples is lower and its performance gap is smaller. Here, performance gap means the differences of performance between training samples and unseen samples which is usually used to measure the generalization ability of machine learning algorithms.
The output variables of friedman1 are correlated because they are observations of the same underlying variable corrupted by Gaussian noise. The output variables of random projection are also correlated because this projection is overcomplete, i.e. the output dimension is larger than the input dimension. This supports our claim that GBDTMO has better generalization abilities because its learning mechanism encourages it to capture variable correlations. However, GBDTMO suffers from slow convergence speed based on its RMSE curves.
Dataset  # Training samples  # Test samples  # Features  # Outputs  Problem type 

MNIST  50,000  10,000  784  10  classification 
Yeast  1,038  446  8  10  classification 
Caltech101  6,073  2,604  324*  101  classification 
MNISTinpainting  50,000  10,000  200*  24  regression 
Studentpor  454  159  41*  3  regression 
NUSWIDE  161,789  107,859  128*  81  multilabel 
MNIST  Yeast  Caltech101  NUSWIDE  

LightGBM  0.9803  0.6184  0.5610  0.4399 
GBDTSO  0.9808  0.6188  0.5649  0.4410 
GBDTMO  0.9830  0.6193  0.5846  0.4421 
MNISTinpaining  Studentpor  

LightGBM  0.26090  0.23976 
GBDTSO  0.26157  0.23872 
GBDTMO  0.26025  0.23829 
Restricted/Unrestricted  MNIST  MNISTinpaining  Caltech101  NUSWIDE 

0.619/0.599        
0.737/0.758  0.149/0.179      
0.721/0.778  0.161/0.195  0.578/0.663  1.248/1.266  
  0.153/0.208  0.620/0.738  1.237/1.365  
    0.628/0.866  1.256/1.453  
    0.650/0.937  1.263/1.493 
Restricted/Unrestricted  MNIST  MNISTinpaining  Caltech101  NUSWIDE 

accuracy  RMSE  accuracy  top1 accuracy  
0.9754/0.9779        
0.9819/0.9818  0.26424/0.26368      
0.9807/0.9806  0.26245/0.26391  0.5422/0.5379  0.4420/0.4420  
  0.26232/0.26248  0.5621/0.5500  0.4422/0.4427  
    0.5721/0.5706  0.4407/0.4404  
    0.5863/0.5721  0.4423/0.4422 
VB Realworld Datasets
In this subsection, we evaluate GBDTMO on six realworld datasets and compare with GBDTSO in terms of test performance, generalization ability and training speed.
MNIST^{1}^{1}1yann.lecun.com/exdb/mnist/ is widely used for classification whose samples are gray images of handwritten digits from 0 to 9. There are 50,000 samples for training and 10,000 samples for test. Each sample is converted into a vector with 784 elements.
Yeast^{2}^{2}2archive.ics.uci.edu/ml/datasets/Yeast has 8 input attributions each of which is a measurement of the protein sequence. The goal is to predict protein localization sites with 10 possible choices. There are 1484 samples in total.
Caltech101^{3}^{3}3www.vision.caltech.edu/Image_Datasets/Caltech101/ contains images of objects belonging to 101 categories. There are 8677 samples in total and most categories have about 50 samples. The size of each image is roughly . To obtain fixed length features, we resize each image into and compute the HOG descriptor [23] of the resized image. Each sample is finally converted to a vector with 324 elements.
MNISTinpainting is a regression task based on MNIST. We crop the central patch from the original image because most boundary pixels are 0. Then, the cropped image is divided into upper and lower halves each of which is . The upper half is used as the input. We further crop a small patch at the top center of the lower half. This small patch is used as the target. The task is to predict the pixels of this patch given upper half image.
Studentpor^{4}^{4}4archive.ics.uci.edu/ml/datasets/Student+Performance predicts the Portuguese language scores in three different grades of students based on their demographic, social and school related features. There are 613 samples in total. The original scores range from 0 to 20. We linearly transform them into . We use onehot coding to deal with the categorical features on this dataset.
NUSWIDE^{5}^{5}5mulan.sourceforge.net/datasetsmlc.html is a dataset for realworld web image retrieval [24]. [10] selects a subset label of this dataset and uses it for multilabel classification. There are 161,789 samples for training and 107,859 for test. Images are represented using 128D cVLAD+ features described in [25].
We summarize the statistics of the above datasets in Table II. Those datasets are diverse in terms of scale and complexity. For MNIST, MNISTinpainting and NUSWIDE, we use the official trainingtest split. For others, we randomly select samples for training and the rest for test. We repeat this strategy 5 times with different seeds and report the average results. For multiclass classification, we use crossentropy loss. For regression and multilabel classification, we use MSE loss. For regression, the performance is measured by RMSE. For multiclass classification and multilabel classification, the performance is measured by top1 accuracy.
RMSE and top1 accuracy on realworld datasets are shown in Tables IV and III, respectively. GBDTMO is consistently better than GBDTSO. The performance of LightGBM is a reference. We suggest that readers focus on the comparison between GBDTSO and GBDTMO. Since we care about the generalization ability, we show the loss curves and the accuracy curves of MNIST and Caltech101 in Fig. 4. We conclude that GBDTMO has better generalization ability than GBDTSO. This is because its training loss is higher while its test performance is better.
We also compare the training speed of GBDTSO and GBDTMO. Specifically, we run 10 boost rounds and record the average training time for each boost round. We repeat this process three times and report the average training time in seconds as shown in Fig. 5. The training speed for Yeast and Studentpor is not reported here because those two datasets are too small. GBDTMO is remarkably faster than GBDTSO, especially when the number of outputs is large.
LightGBM is also much faster than GBDTSO due to its better engineering and optimization. Its training speed is not reported because we focus on the speed difference caused by the learning mechanism for multiple outputs.
VC Sparse Split Finding
We perform all experiments on GBDTMO in Sections VA and VB using nonsparse split algorithm. We evaluate our unrestricted sparse split finding algorithm (Algortithm 5) and restricted sparse split finding algorithm (Algorithm 6) in this section. We compare their performance and training speed on MNIST, MNISTinpainting, Caltech101 and NUSWIDE with different sparse factor .
Their training speed in seconds is shown in Table V. The restricted one is consistently faster than the unrestricted one. Moreover, their speed differences are remarkable when is large. The performance on test samples is shown in Table VI. The restricted one is slightly better than the unrestricted one. Note that the performance of our sparse split algorithms is sometimes better than the nonsparse one’s with a proper (for example, on Caltech101). We do not further adjust their hyperparameters. Those hyperparameters are the same as the corresponding nonsparse one’s.
Vi Conclusions
In this paper, we have proposed a general method to learn GBDT for multiple outputs. The motivation of GBDTMO is to capture the correlations between output variables. We have derived the approximated learning objective for in both nonsparse case and sparse case based on the second order Taylor expansion of loss. For sparse case, we have proposed the restricted algorithm which restricts the subsets of left and right parts to be the same and the unrestricted algorithm which has no such a restriction. We have extended the histogram approximation into multiple output case and speed up the training process by the extended one. We have evaluated that GBDTMO is remarkably and consistently better in generalization ability and faster in training speed compared with GBDT for single output. We have also evaluated that the restricted sparse split finding algorithm is slightly better than the unrestricted one by considering both performance and training speed. However, GBDTMO suffers from slower convergence speed, especially at the beginning of training. In our future work, we would improve its convergence speed.
Learning rate/Maximum depth  LightGBM  GBDTSO  GBDTMO 

MNIST  0.25/6  0.10/6  0.10/8 
Yeast  0.10/4  0.10/5  0.25/6 
Caltech101  0.10/8  0.05/8  0.10/9 
NUSWIDE  0.05/8  0.05/8  0.10/8 
MNISTinpaining  0.10/8  0.10/8  0.10/7 
Studentpor  0.10/4  0.05/4  0.10/5 
MNIST  Yeast  Caltech101  NUSWIDE  MNISTinpaining  Studentpor  

Gain threshold  1e3  1e3  1e3  1e6  1e6  1e6 
Min samples in leaf  16  4  16  8  4  4 
Max bins  8  32  32  64  16  8 
Histogram cache  16  16  16  48  48  48 
References
 [1] Jerome H Friedman, “Greedy function approximation: A gradient boosting machine.,” Annals of Statistics, vol. 29, no. 5, pp. 1189–1232, 2001.
 [2] Jerome H Friedman, “Stochastic gradient boosting,” Computational Statistics and Data Analysis, vol. 38, no. 4, pp. 367–378, 2002.
 [3] Ping Li, “Robust logitboost and adaptive base class (abc) logitboost,” pp. 302–311, 2010.
 [4] Christopher JC Burges, “From ranknet to lambdarank to lambdamart: An overview,” Learning, vol. 11, no. 23581, pp. 81, 2010.
 [5] Matthew Richardson, Ewa Dominowska, and Robert J Ragno, “Predicting clicks: estimating the clickthrough rate for new ads,” the web conference, pp. 521–530, 2007.
 [6] James Bennett, Stan Lanning, et al., “The netflix prize,” in Proceedings of KDD cup and workshop. New York, NY, USA., 2007, vol. 2007, p. 35.
 [7] Tianqi Chen and Carlos Guestrin, “Xgboost: A scalable tree boosting system,” Knowledge Discovery and Data Mining, pp. 785–794, 2016.
 [8] Guolin Ke, Qi Meng, Thomas William Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tieyan Liu, “Lightgbm: a highly efficient gradient boosting decision tree,” in Advances in Neural Information Processing Systems, 2017, pp. 3149–3157.
 [9] Liudmila Ostroumova Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin, “Catboost: unbiased boosting with categorical features,” 2018, pp. 6638–6648.
 [10] Grigorios Tsoumakas, Eleftherios SpyromitrosXioufis, Jozef Vilcek, and Ioannis Vlahavas, “Mulan: A java library for multilabel learning,” Journal of Machine Learning Research, vol. 12, pp. 2411–2414, 2011.
 [11] Hanen Borchani, Gherardo Varando, Concha Bielza, and Pedro Larranaga, “A survey on multioutput regression,” Data Mining and Knowledge Discovery, vol. 5, no. 5, pp. 216–233, 2015.
 [12] Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep Learning, MIT Press, 2016, http://www.deeplearningbook.org.
 [13] Geoffrey E Hinton, Oriol Vinyals, and Jeffrey Dean, “Distilling the knowledge in a neural network,” arXiv: Machine Learning, 2015.
 [14] Pierre Geurts, Louis Wehenkel, and Florence Dalchebuc, “Gradient boosting for kernelized output spaces,” in Proceedings of the International Conferenece on Machine Learning, 2007, pp. 289–296.
 [15] Si Si, Huan Zhang, S Sathiya Keerthi, Dhruv Mahajan, Inderjit S Dhillon, and Chojui Hsieh, “Gradient boosted decision trees for high dimensional sparse output,” in Proceedings of the International Conferenece on Machine Learning, 2017, pp. 3182–3190.
 [16] Stephen Tyree, Kilian Q Weinberger, Kunal Agrawal, and Jennifer Paykin, “Parallel boosted regression trees for web search ranking,” in Proceedings of the International Conference on World Wide Web, 2011, pp. 387–396.
 [17] Tianqi Chen, Sameer Singh, Ben Taskar, and Carlos Guestrin, “Efficient secondorder gradient boosting for conditional random fields,” in Proceedings of the International Conference on Artificial Intelligence and Statistics, 2015, pp. 147–155.
 [18] Rahul Agrawal, Archit Gupta, Yashoteja Prabhu, and Manik Varma, “Multilabel learning with millions of labels: recommending advertiser bid phrases for web pages,” in Proceedings of the International Conference on World Wide Web, 2013, pp. 13–24.
 [19] Yashoteja Prabhu and Manik Varma, “Fastxml: a fast, accurate and stable treeclassifier for extreme multilabel learning,” in Proceedings of the International Conference on Knowledge Discovery and Data Mining, 2014, pp. 263–272.
 [20] Robert E Schapire and Yoram Singer, “Improved boosting algorithms using confidencerated predictions,” in Proceedings of the International Conference on Learning Theory, 1998, vol. 37, pp. 80–91.
 [21] Yonatan Amit, Ofer Dekel, and Yoram Singer, “A boosting algorithm for label covering in multilabel problems,” in Proceedings of the International Conference on Artificial Intelligence and Statistics, 2007, pp. 27–34.
 [22] Jerome H Friedman, “Multivariate adaptive regression splines,” Annals of Statistics, vol. 19, no. 1, pp. 1–67, 1991.
 [23] Navneet Dalal and Bill Triggs, “Histograms of oriented gradients for human detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2005, vol. 1, pp. 886–893.
 [24] TatSeng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and YanTao Zheng, “Nuswide: A realworld web image database from national university of singapore,” in Proceedings of the ACM Conference on Image and Video Retrieval, Santorini, Greece, July 810, 2009.
 [25] Eleftherios Spyromitrosxioufis, Symeon Papadopoulos, Ioannis Kompatsiaris, Grigorios Tsoumakas, and Ioannis P Vlahavas, “A comprehensive study over vlad and product quantization in largescale image retrieval,” IEEE Transactions on Multimedia, vol. 16, no. 6, pp. 1713–1728, 2014.
Appendix A Approximated Objective
We suppose there exists such that
(29) 
That is, is dominated by its diagonal elements. Then, we have:
(30)  
The last inequality holds based on the assumption in (29). Substituting this result into the Taylor expansion of loss
(31)  
Substituting this upper bound of into (15), we get the optimal value of .
(32) 
Recall that is the diagonal elements of . This solution is the same as (18), except the coefficient . In practice, the actual leaf weight is where is the socalled learning rate. Then, the effect of can be canceled out by adjusting and . Thus, we do not need to consider the exact value of in practice.
Appendix B Hyperparameter Settings
We discuss our hyperparameter settings. We find the best maximum depth and learning rate via a grid search. is searched from and learning rate is searched from . We provide the selected and learning rate in Table VII. We set regularization to 1.0 and the maximum leaves to in all experiments. We preliminarily search other hyperparameters and fix them for all methods. We list them on realworld datasets in Table VIII. Note that the hyperparameters in Table VIII may not be optimal. When comparing with the gain threshold, we use the average gain over the involved output variables.