We present Natural Gradient Boosting (NGBoost), an algorithm which brings probabilistic prediction capability to gradient boosting in a generic way. Predictive uncertainty estimation is crucial in many applications such as healthcare and weather forecasting. Probabilistic prediction, which is the approach where the model outputs a full probability distribution over the entire outcome space, is a natural way to quantify those uncertainties. Gradient Boosting Machines have been widely successful in prediction tasks on structured input data, but a simple boosting solution for probabilistic prediction of real valued outputs is yet to be made. NGBoost is a gradient boosting approach which uses the Natural Gradient to address technical challenges that makes generic probabilistic prediction hard with existing gradient boosting methods. Our approach is modular with respect to the choice of base learner, probability distribution, and scoring rule. We show empirically on several regression datasets that NGBoost provides competitive predictive performance of both uncertainty estimates and traditional metrics.
NGBoost: Natural Gradient Boosting for Probabilistic Prediction
Tony Duan* &Anand Avati* &Daisy Yi Ding
email@example.com &firstname.lastname@example.org &email@example.com
Sanjay Basu &Andrew Y. Ng &Alejandro Schuler \aistatsaddress firstname.lastname@example.org &email@example.com &firstname.lastname@example.org
Many real-world supervised machine learning problems have tabular features and real-valued targets. Weather forecasting (predicting temperature of the next day based on today’s atmospheric variables (gneiting_probabilistic_2014)) and clinical prediction (predicting time to mortality with survival prediction on structured medical records of the patient (avati_improving_2018)) are important examples. But rarely should the model be absolutely confident about a prediction. In such tasks it is crucial to estimate the uncertainty in predictions. This is especially the case when the predictions are directly linked to automated decision making, as probabilistic uncertainty estimates are important in determining manual fall-back alternatives in the workflow (ml_meet_econ).
Bayesian methods naturally generate predictive uncertainty estimates by integrating predictions over the posterior, but they also have practical shortcomings when one is only interested in predictive uncertainty (as opposed to inference over parameters of interest). Exact solutions to Bayesian models are limited to simple models, and calculating the posterior distribution of more powerful models such as Neural Networks (NN) (Neal:1996:BLN:525544) and Bayesian Additive Regression Trees (BART) (chipman_bart:_2010) is difficult. Inference in these models requires computationally expensive approximation via, for example, MCMC sampling. Moreover, sampling-based inference requires some statistical expertise and thus limits the ease-of-use of Bayesian methods. Among non-parametric Bayesian approaches, scalability to very large datasets can be a challenge (rasmussen_gaussian_2005). Bayesian Deep Learning is gaining popularity (gal_dropout_2016; lakshminarayanan_simple_2017), but, while neural networks have empirically excelled at perception tasks (such as with visual and audio input), they perform on par with traditional methods when data are tabular. Small training datasets and informative prior specification are also challenges for Bayesian neural nets.
Alternatively, meteorology has adopted probabilistic forecasting as the preferred methodology for weather prediction. In this setting, the output of the model, given the observed features, is a probability distribution over the entire outcome space. The models are trained to maximize sharpness, subject to calibration, by optimizing scoring rules such as Maximum Likelihood Estimation (MLE) or the more robust Continuous Ranked Probability Score (CRPS) (gneiting_weather_2005; gneiting_strictly_2007). This yields calibrated uncertainty estimates. Outside of meteorology, sharp and calibrated probabilistic prediction of time-to-event outcomes (such as mortality) has recently been explored in healthcare (avati_countdown_2019).
Meanwhile, Gradient Boosting Machines (GBMs) (chen_xgboost:_2016; friedman_greedy_2001) are a set of highly-modular methods that work well on structured input data, even with relatively small datasets. This can be seen in their empirical success on Kaggle and other competitions (chen_xgboost:_2016). In classification tasks, their predictions are probabilistic by default (by use of the sigmoid or softmax link function). But in regression tasks, they output only a scalar value. Under a squared-error loss these scalars can be interpreted as the mean of a conditional Gaussian distribution with some (unknown) constant variance. However, such probabilistic interpretations have little use if the variance is assumed constant. The predicted distribution needs to have at least two degrees of freedom (two parameters) to effectively convey both the magnitude and the uncertainty of the prediction, as illustrated in Figure 1. It is precisely this problem of simultaneous boosting of multiple parameters from the base learners which makes probabilistic forecasting with GBMs a challenge, and NGBoost addresses this with the use of natural gradients (amari_natural_1998).
Summary of Contributions
We present Natural Gradient Boosting, a modular boosting algorithm for probabilistic forecasting (sec 2.4) which uses natural gradients to integrate any choice of:
Base learner (e.g. Decision Tree),
Parametric probability distribution (Normal, Laplace, etc.), and
Scoring rule (MLE, CRPS, etc.).
We present a generalization of the Natural Gradient to other scoring rules such as CRPS (sec 2.2).
We demonstrate empirically that NGBoost performs competitively relative to other models in its predictive uncertainty estimates as well as on traditional metrics (sec 3).
2 Natural Gradient Boosting
In standard prediction settings, the object of interest is an estimate of the scalar function , where is a vector of observed features and is the prediction target. In our setting we are interested in producing a probabilistic forecast with probability density , by predicting parameters . We denote the corresponding cumulative densities as .
2.1 Proper Scoring Rules
We start with an overview of proper scoring rules and their corresponding induced divergences, which lays the groundwork for describing the natural gradient.
A proper scoring rule takes as input a forecasted probability distribution and one observation (outcome), and assigns a score to the forecast such that the true distribution of the outcomes gets the best score in expectation (gneiting_strictly_2007). In mathematical notation, a scoring rule is a proper scoring rule if and only if it satisfies
where represents the true distribution of outcomes , and is any other distribution (such as the probabilistic forecast from a model). Proper scoring rules encourage the model to output calibrated probabilities when used as loss functions during training.
We limit ourselves to a parametric family of probability distributions, and identify a particular distribution by its parameters .
The most commonly used proper scoring rule is the log score , also known as MLE:
The CRPS is another proper scoring rule, which is generally considered a robust alternative to MLE. The CRPS applies only to real valued probability distributions and outcomes. While the MLE enjoys better asymptotic properties in the well specified case, empirically the CRPS tends to produce sharper prediction distributions when the noise model is mis-specified (gebetsberger_estimation_2018). The CRPS (denoted ) is defined as
where is the cumulative distribution function of .
Every proper scoring rule satisfies the inequality of Eqn 1. The excess score of the right hand side over the left is the divergence induced by that proper scoring rule (dawid_theory_2014):
which is necessarily non-negative, and can be interpreted as a measure of difference from one distribution to another .
The MLE scoring rule induces the Kullback-Leibler divergence (KL divergence, or ):
while CRPS induces the divergence (dawid_geometry_2007):
where and are the CDFs of the two distributions. A more detailed derivation of the above can be found in (machete_contrasting_2013).
The divergences and are invariant to the choice of parametrization. Though divergences in general are not symmetric (and hence not a measure of distance), at small changes of parameter they are almost symmetric and can serve as a distance measure locally. When used as local measure of distance, the divergences induce a statistical manifold where each point in the manifold corresponds to a probability distribution (dawid_theory_2014).
2.2 The Generalized Natural Gradient
The (ordinary) gradient of a scoring rule over a parameterized probability distribution with parameter and outcome with respect to the parameters is denoted . It is the direction of steepest ascent, such that moving the parameters an infinitesimally small amount in that direction of the gradient (as opposed to any other direction) will increase the scoring rule the most. That is,
It should be noted that this gradient is not invariant to reparametrization. Consider reparameterizing to so for all when . If the gradient is calculated relative to and an infinitesimal step is taken in that direction, say from to the resulting distribution will be different than if the gradient had been calculated relative to and a step was taken from to . In other words, . Thus the choice of parametrization can drastically impact the training dynamics, even though the minima are unchanged. Because of this, it behooves us to instead update the parameters in a way that reflects how we are moving in the space of distributions, which is ultimately what is of interest.
This motivates the natural gradient (denoted ), whose origins can be traced to the field of information geometry (amari_natural_1998). While the natural gradient was originally defined for the statistical manifold with the distance measure induced by (martens_new_2014), we provide a more general treatment here that applies to any divergence that corresponds to some proper scoring rule. The generalized natural gradient is the direction of steepest ascent in Riemannian space, which is invariant to parametrization, and is defined:
If we solve the corresponding the optimization problem, we obtain the natural gradient of the form
where is the Riemannian metric of the statistical manifold at , which is induced by the scoring rule .
By choosing (i.e. MLE) and solving the above optimization problem, we get:
where is the Fisher Information carried by an observation about , which is defined as:
Similarly, by choosing (i.e. CRPS) and solving the above optimization problem, we get:
where is the Riemannian metric of the statistical manifold that uses as the local distance measure, given by (dawid_geometry_2007):
Using the natural gradient for learning the parameters makes the optimization problem invariant to parametrization, and tends to have a much more efficient and stable learning dynamics than using just the gradients (amari_natural_1998). Figure 3 shows the vector field of gradients and natural gradients for both MLE and CRPS on the parameter space of a Normal distribution parameterized by (mean) and (logarithm of the standard deviation).
2.3 Gradient Boosting
Gradient boosting (friedman_greedy_2001) is a supervised learning technique where several weak learners (or base learners) are combined in an additive ensemble. The model is learnt sequentially, where the next base learner is fit against the training objective residual of the current ensemble. The output of the fitted base learner is then scaled by a learning rate and added into the ensemble.
Gradient boosting is effectively a functional gradient descent algorithm. The residual at each stage is the functional gradient of the loss function with respect to the current model. The gradient is then projected onto the range of the base learner class by fitting the learner to the gradient.
The boosting framework can be generalized to any choice of base learner but most popular implementations use shallow decision trees because they work well in practice (chen_xgboost:_2016; ke_lightgbm:_2017).
When fitting a decision tree to the gradient, the algorithm partitions the data into axis-aligned slices. Each slice of the partition is associated with a leaf node of the tree, and is made as homogeneous in its response variable (the gradients at that set of data points) as possible. The criterion of homogeneity is typically the sample variance. The prediction value of the leaf node (which is common to all the examples ending up in the leaf node) is then set to be the additive component to the predictions that minimizes the loss the most. This is equivalent to doing a “line search” in the functional optimization problem for each leaf node, and, for some losses, closed form solutions are available. For example, for squared error, the result of the line search will yield the sample mean of the response variables in the leaf.
We now consider adapting gradient boosting for prediction of parameters in the probabilistic forecasting context. We highlight two opportunities for improvements.
Splitting mismatch. Even when the loss chosen to be optimized is not the squared error, the splitting criterion of the decision tree algorithm is generally the sample variance of the gradients for efficiency of implementation. But the following linesearch at each leaf node is then performed to minimize the chosen loss. Two examples are the algorithms described in friedman_greedy_2001 for least absolute deviation regression and second-order likelihood classification. Grouping training examples based on similarity of gradient while eventually assigning them a common curvature adjusted gradient can be suboptimal. Ideally we seek a partition where the objective for the linesearch and the splitting criteria in the decision trees are consistent, and yet retains the computational efficiency of the mean squared error criterion.
Multi-parameter boosting. It is a challenge to implement gradient boosting to estimate multiple parameters instead of a single conditional mean . Using a single tree per stage with multiple parameter outputs per leaf node would not be ideal since the splitting criteria based on the gradient of one parameter might be suboptimal with respect to the gradient of another parameter. We thus require one tree per parameter at each boosting stage. But with multiple trees, the differences between the resulting partitions would make per-leaf line searches impossible.
Gradient boosting for -class classification effectively estimates multiple parameters at each stage. Since the parameters in this case are all symmetric, most implementations minimize a second order approximation of the loss function with a diagonal Hessian. This breaks down the line search problem into independent 1-parameter subproblems, where a separate line search can be implemented at each leaf of each of the trees, thereby sidestepping the issues of multiparameter boosting. In a more general setting, like boosting for the two parameters of a Normal distribution, such an approximation would not work well.
While addressing the splitting mismatch is an incremental improvement (and has also been addressed by chen_xgboost:_2016; ke_lightgbm:_2017), allowing for multi-parameter boosting is a radical innovation that enables boosting-based probabilistic prediction in a generic way (including probabilistic regression and survival prediction). Our approach provides both of these improvements, as we describe next.
2.4 NGBoost: Natural Gradient Boosting
The NGBoost algorithm is a supervised learning method for probabilistic forecasting that approaches boosting from the point of view of predicting parameters of the conditional probability distribution as a function of . Here could be one of several types (, , , , , etc.) and is a vector in . In our experiments we focus mostly on real valued outputs, though all of our methods are applicable to other modalities such as classification and time to event prediction.
The algorithm has three modular components, which are chosen upfront as a configuration:
Base learner (),
Parametric probability distribution (), and
Proper scoring rule ().
A prediction on a new input is made in the form of a conditional distribution , whose parameters are obtained by an additive combination of base learner outputs (corresponding to the gradient boosting stages) and an initial . Note that can be a vector of parameters (not limited to be scalar valued), and they completely determine the probabilistic prediction . For example, when using the Normal distribution, in our experiments. To obtain the predicted parameter for some , each of the base learners take as their input. Here collectively refers to the set of base learners, one per parameter, of stage . For example, for a normal distribution with parameters and , there will be two base learners, and per stage, collectively denoted as . The predicted outputs are scaled with a stage-specific scaling factor , and a common learning rate :
The scaling factor is a single scalar, even if the distribution has multiple parameters. The model is learnt sequentially, a set of base learners and a scaling factor per stage. The learning algorithm starts by first estimating a common such that it minimizes the sum of the scoring rule over the response variables from all training examples, essentially fitting the marginal distribution of . This becomes the initial predicted parameter for all examples.
In each iteration , the algorithm calculates, for each example , the natural gradients of the scoring rule with respect to the predicted parameters of that example up to that stage, . Note that has the same dimension as . A set of base learners for that iteration are fit to predict the corresponding components of the natural gradients of each example .
The output of the fitted base learner is the projection of the natural gradient on to the range of the base learner class. This projected gradient is then scaled by a scaling factor since local approximations might not hold true very far away from the current parameter position. The scaling factor is chosen to minimize the overall true scoring rule loss along the direction of the projected gradient in the form of a line search. In practice, we found that implementing this line search by successive halving of (starting with ) until the scaled gradient update results in a lower overall loss relative to the previous iteration works reasonably well and is easy to implement.
Once the scaling factor is determined, the predicted per-example parameters are updated to by adding to each the scaled projected gradient which is further scaled by a small learning rate (typically 0.1 or 0.01). The small learning rate helps fight overfitting by not stepping too far along the direction of the projected gradient in any particular iteration.
The pseudo-code is presented in Algorithm 1. For very large datasets computational performance can be easily improved by simply randomly sub-sampling mini-batches within the operation.
2.5 Qualitative Analysis and Discussion
The splitting mismatch phenomenon can occur when using decision trees as the base learner. NGBoost uses the natural gradients as the response variable to fit the base learner. This is in contrast with friedman_greedy_2001 where the ordinary gradient is used, and with chen_xgboost:_2016 where a second order Taylor approximation is minimized in a Newton-Raphson step. When NGBoost uses regression trees as the base learner, the sample variance of the natural gradient serves as the splitting criteria and the sample mean in each leaf node as the prediction. This is another way of making the splitting and linesearch criteria consistent, and comes ”for free” as a consequence of using the natural gradient in a somewhat less decision tree-specific way. This is a crucial aspect of NGBoost’s modularity.
When the probability distribution is in the exponential family and the parameters are the natural parameters of that family, then a Newton-Raphson step is equivalent to a natural gradient descent step. However, in other parametrizations and distributions, the equivalence need not hold. This is especially important in the boosting context because, depending on the inductive biases of the base learners, certain parametrization choices may result in more suitable model spaces than others. For example, one setting we are particularly interested in is the two-parameter Normal distribution. Though it is in the exponential family, we use a mean () and log-scale () parametrization for both ease of implementation and modeling convenience (to disentangle magnitude of predictions from uncertainty estimates). Since natural gradients are invariant to parametrization this does not pose a problem, whereas the Newton-Raphson method would fail as the problem is no longer convex in this parametrization.
For predicting multiple parameters of a probability distribution, overall line search (as opposed to per-leaf line search) is an inevitable consequence. NGBoost’s use of natural gradient makes this less of a problem as the gradients of all the examples come “optimally pre-scaled” (in both the relative magnitude between parameters, and across examples) due to the inverse Fisher Information factor. The use of ordinary gradients instead would be sub-optimal, as shown in Figure 4. With the natural gradient the parameters converge at approximately the same rate despite different conditional means and variances and different “distances” from the initial marginal distribution, even while being subjected to a common scaling factor in each iteration. We attribute this stability to the “optimal pre-scaling” property of the natural gradient.
|MC dropout||Deep Ensembles||NGBoost||MC dropout||Deep Ensembles||NGBoost|
|Boston||506||2.97 0.85||3.28 1.00||2.94 0.53||2.46 0.25||2.41 0.25||2.43 0.15|
|Concrete||1030||5.23 0.53||6.03 0.58||5.06 0.61||3.04 0.09||3.06 0.18||3.04 0.17|
|Energy||768||1.66 0.19||2.09 0.29||0.46 0.06||1.99 0.09||1.38 0.22||0.60 0.45|
|Kin8nm||8192||0.10 0.00||0.09 0.00||0.16 0.00||-0.95 0.03||-1.20 0.02||-0.49 0.02|
|Naval||11934||0.01 0.00||0.00 0.00||0.00 0.00||-3.80 0.05||-5.63 0.05||-5.34 0.04|
|Power||9568||4.02 0.18||4.11 0.17||3.79 0.18||2.80 0.05||2.79 0.04||2.79 0.11|
|Protein||45730||4.36 0.04||4.71 0.06||4.33 0.03||2.89 0.01||2.83 0.02||2.81 0.03|
|Wine||1588||0.62 0.04||0.64 0.04||0.63 0.04||0.93 0.06||0.94 0.12||0.91 0.06|
|Yacht||308||1.11 0.38||1.58 0.48||0.50 0.20||1.55 0.12||1.18 0.21||0.20 0.26|
|Year MSD||515345||8.85 NA||8.89 NA||8.94 NA||3.59 NA||3.35 NA||3.43 NA|
Our experiments use datasets from the UCI Machine Learning Repository, and follow the protocol first proposed in hernandez-lobato_probabilistic_2015. For all datasets, we hold out a random 10% of the examples as a test set. From the other 90% we initially hold out 20% as a validation set to select (the number of boosting stages) that gives the best log-likelihood, and then re-fit the entire 90% using the chosen . The re-fit model is then made to predict on the held-out 10% test set. This entire process is repeated 20 times for all datasets except Protein and Year MSD, for which it is repeated 5 times and 1 time respectively. The Year MSD dataset, being extremely large relative to the rest, was fit using a learning rate of 0.1 while the rest of the datasets were fit with a learning rate of 0.01. In general we recommend small learning rates, subject to computational feasibility.
Traditional predictive performance is captured by the root mean squared-error (RMSE) of the forecasted means (i.e. ) as estimated on the test set. The quality of predictive uncertainty is captured in the average negative log-likelihood (NLL) (i.e. ) as measured on the test set.
Though our methods are most similar to gradient boosting approaches, the problem we are trying to address is probabilistic prediction. Hence, our empirical results are on probabilistic prediction tasks and datasets, and likewise our comparison is against other probabilistic prediction methods. We compare our results against MC dropout (gal_dropout_2016) and Deep Ensembles (lakshminarayanan_simple_2017), as they are the most comparable in simplicity and approach. MC dropout fits a neural network to the dataset and interprets Bernoulli dropout as a variational approximation, obtaining predictive uncertainty by integrating over Monte Carlo samples. Deep Ensembles fit an ensemble of neural networks to the dataset and obtain predictive uncertainty by making an approximation to the Gaussian mixture arising out of the ensemble.
Our results are summarized in Table 1. For all results, NGBoost was configured with the Normal distribution, decision tree base learner with a maximum depth of three levels, and MLE scoring rule.
4 Related Work
Calibrated uncertainty estimation. Approaches to probabilistic prediction can broadly be distinguished as Bayesian or non-Bayesian. Bayesian approaches (which include a prior and perform posterior inference) that leverage decision trees for tabular datasets include chipman_bart:_2010 and lakshminarayanan_mondrian_2016. A non-Bayesian approach similar to our work is lakshminarayanan_simple_2017 which takes a parametric approach to training heteroskedastic uncertainty models. Such a heteroskedastic approach to capturing uncertainty has also been called aleatoric uncertainty estimation (kendall_what_2017). Uncertainty that arises due to dataset shift or out-of-distribution inputs (ovadia_can_2019) is not in the scope of our work. Post-hoc calibration techniques such as Platt scalaing have also been proposed (guo_calibration_2017; kuleshov_accurate_2018; kumar_learning_2019), though we focus on learning models that are naturally calibrated. However, we note that such post-hoc calibration techniques are not incompabile with our proposed methods.
Proper scoring rules. gneiting_strictly_2007 proposed the paradigm of maximizing sharpness subject to calibration in probabilistic forecasting, and introduced the CRPS scoring rule for meteorology. Recently avati_countdown_2019 extended the CRPS to the time-to-event setting and gebetsberger_estimation_2018 empirically examined its tradeoffs relative to MLE. We extend this line of work by introducing the natural gradient to CRPS and making practical recommendations for regression tasks.
Gradient Boosting. friedman_greedy_2001 proposed the gradient boosting framework, though its use has primarily been in homoskedastic regression as opposed to probabilistic forecasting. We are motivated in part by the empirical success of decision tree base learners, which have favorable inductive biases for tabular datasets (such as Kaggle competitions and electronic health records). Popular scalable implementations of tree-based boosting methods include chen_xgboost:_2016; ke_lightgbm:_2017.
We propose a method for probabilistic prediction (NGBoost) and demonstrate state-of-the-art performance on a variety of datasets. NGBoost combines a multi-parameter boosting algorithm with the natural gradient to efficiently estimate how parameters of the presumed outcome distribution vary with the observed features. NGBoost is flexible, modular, easy-to-use, and fast relative to existing methods for probabilistic prediction.
There are many avenues for future work. The natural gradient loses its invariance property with finite step sizes, which we can address with differential equation solvers for higher-order invariance (song_accelerating_2018). Better tree-based base learners and regularization (e.g. chen_xgboost:_2016; ke_lightgbm:_2017) are worth exploring. Many time-to-event predictions are made from tabular datasets, and we expect NGBoost to perform well in such settings as well (schmid_flexible_2008).
This work was funded in part by the National Institutes of Health.