Interpretability is Harder in the Multiclass Setting: Axiomatic Interpretability for Multiclass Additive Models

Interpretability is Harder in the Multiclass Setting: Axiomatic Interpretability for Multiclass Additive Models

Abstract

Generalized additive models (GAMs) are favored in many regression and binary classification problems because they are able to fit complex, nonlinear functions while still remaining interpretable. In the first part of this paper, we generalize a state-of-the-art GAM learning algorithm based on boosted trees to the multiclass setting, and show that this multiclass algorithm outperforms existing GAM fitting algorithms and sometimes matches the performance of full complex models.

In the second part, we turn our attention to the interpretability of GAMs in the multiclass setting. Surprisingly, the natural interpretability of GAMs breaks down when there are more than two classes. Drawing inspiration from binary GAMs, we identify two axioms that any additive model must satisfy to not be visually misleading. We then develop a post-processing technique (API) that provably transforms pretrained additive models to satisfy the interpretability axioms without sacrificing accuracy. The technique works not just on models trained with our algorithm, but on any multiclass additive model. We demonstrate API on a 12-class infant-mortality dataset.

1 Introduction

Interpretable models, though sometimes less accurate than black-box models, are preferred in many real-world applications. In criminal justice, finance, hiring, and other domains that impact people’s lives, interpretable models are often used because their transparency helps determine if a model is biased or unsafe [\citeauthoryearZeng, Ustun, and Rudin2016, \citeauthoryearTan et al.2017]. And in critical applications such as healthcare, where human experts and machine learning models often work together, being able to understand, learn from, edit and trust the learned model is also important [\citeauthoryearCaruana et al.2015].

Generalized additive models (GAMs) are among the most powerful interpretable models when individual features play major effects [\citeauthoryearHastie and Tibshirani1990, \citeauthoryearLou, Caruana, and Gehrke2012]. In the binary classification setting, we consider standard GAMs with logistic probabilities: , where the logit is an additive function of individual features, . Here, is the -th feature of data point , and we denote the shape function of feature for the positive class. Previously, \citeauthorlou2012intelligible evaluated various GAM fitting algorithms, and found that gradient boosting with shallow trees restricted to one feature at a time outperformed other methods on a number of regression and binary classification datasets. They call their model iGAM. The first part of this paper generalizes iGAM to the multiclass setting. We consider standard GAMs with softmax probabilities:

(1)

where the logit of class , , is also an additive function of individual features, and is the shape function of feature for class . We present our multiclass GAM fitting algorithm in section 3.1. In section 3.2, we empirically evaluate its performance on large scale real world datasets.

(a) Binary GAM age shape
(b) Multiclass GAM age shape
Figure 1: Age in pneumonia example [\citeauthoryearCaruana et al.2015].
(a) Toy model 1
(b) Toy model 2

(c) Toy model 3
(d) Toy model 4

(e) Toy models after API
(f) Toy class probabilities
Figure 2: GAM shape functions for a toy 3-class problem.

Binary GAMs are readily interpretable because the influence of each feature on the outcome is captured by a single 1-d shape function that is easily visualized. For example, Figure 1 shows the relationship between risk of pneumonia death and age.1 When interpreting shape functions like this, practitioners often focus on two key factors: the trend of the curve and the existence of jumps (if the feature is continuous). For example, the ‘age’ plot in Figure 1 might be interpreted by a physician as: “Risk is low and constant from age 18-50, rises slowly from age 50-67, and then rises quickly from age 67-90. There is a small jump in risk at about age 67, a few years after typical retirement age…” In a binary logistic function, the rising, falling and jumps in each shape function faithfully correspond to the rising, falling and jumps in the predicted probability, so this kind of summary is a faithful representation of the model’s predictions.

In the multiclass setting, however, the influence of feature on class is no longer captured by a single shape function , but through the interaction of all ’s, . In particular, even if the logit for class is increasing, the probability for class might still decrease if the logits for other classes are increasing more rapidly. As a result, the learned shape functions can be visually misleading. For example, Figure 2a-d show the shape functions of 3-class toy GAM models with only one feature. Each model appears to have very different shape functions: (2a) some falling, some rising, (2b) all decreasing, (2c) decreasing and then increasing, or (2d) oscillating. Interestingly, however, all of these models make identical predictions. Because these models have only one feature, we can plot class probabilities as functions of the feature value (this is not possible with multiple features). In Figure 2f, class probability is monotonically increasing, while class and probabilities are monotonically decreasing, which is vastly different from the shape functions a-d. This problem, if not solved, greatly reduces the intelligibility of GAMs in multiclass problems.

The second half of this paper focuses on mitigating the unintelligibility of multiclass GAMs. We start by examining how users interpret binary GAMs, and identify a set of interpretability axioms — criteria that GAM shapes should satisfy to guarantee interpretability. We then present properties of additive models that make it possible to regain interpretability. Making use of these properties, we design an Additive Post-Processing for Interpretability (API) method that provably transforms any pretrained additive model to satisfy the axioms of interpretability without sacrificing predictive accuracy. Figure 2e shows the shape functions that result from passing any of models 2a-d through API. After API, the now canonical shape functions successfully match the probability trend for the corresponding classes in Figure 2f.

2 Notation and Problem Definition

In this section, we define notation that will be used throughout the paper. We focus on multiclass classification where is the input space and is the output space, and is the number of classes. Let denote a training set of size , where is a feature vector with features and is the target. For , let denote the empirical proportion of class in , where denotes the set . Given a model , let denote the prediction of the model on data point . Our learning objective is to minimize the expected value of some loss function . In multiclass classification, the model output is a probability distribution among the classes, . We use multiclass cross entropy as our loss function:

(2)

We focus on generalized additive models of the form (1) with softmax probabilities. We denote , as the set of shape functions for a multiclass GAM model, and also as the model itself. Throughout the paper, we make the following assumptions on , the multiclass shape functions. For continuous feature , ’s domain is a continuous finite interval ; for categorical or ordinal features, ’s domain is a finite ordered discrete set. We denote the domain of feature as . For the API post-processing method (Section 4.3), we also assume that shape functions of continuous features are continuous everywhere except for a finite number of points2. Finally, we overload the operator as follows: In the continuous domain, when ’s are all continuous; when some ’s are discontinuous. In the discrete domain, , where denotes the immediate next value.

3 Multiclass GAM Learning via Cyclic Gradient Boosting

We now describe the training procedure for MC-iGAM, our generalization of binary iGAM [\citeauthoryearLou, Caruana, and Gehrke2012] to the multiclass setting. Like Lou et al., we use bagged trees as the base learner for boosting, with largest variance reduction as the splitting criterion, and control tree complexity by limiting the number of leaves .

3.1 Cyclic Gradient Boosting

Our optimization procedure is cyclic gradient boosting [\citeauthoryearBuhlmann and Yu2003, \citeauthoryearLou, Caruana, and Gehrke2012], a variant of standard gradient boosting [\citeauthoryearFriedman2001] where features are cycled through sequentially to learn individual shape functions for each feature. Algorithm 1 presents our algorithm for cyclic gradient boosting in the multiclass setting.

In standard gradient boosting, each boosting step fits a base learner to the pseudo-residual, the negative gradient in the functional space [\citeauthoryearFriedman2001]. In a multiclass setting with cross entropy loss (Equation (2)) and softmax probabilities (Equation (1)), the pseudo-residual for class is:

Adding the fitted base learner (multiplied by a typically small constant ) to the ensemble corresponds to taking an approximate gradient step in the functional space with learning rate . However, as suggested by Friedman et al., to speed up computation one can instead take an approximate Newton step using a diagonal approximation to the Hessian [\citeauthoryearFriedman, Hastie, and Tibshirani2000]. The resulting additive update to learn a multiclass GAM then becomes:

(3)
(4)

for , where is the set of training points in tree leaf for current feature . Applying the above boosting procedure cyclically on individual features gives our multiclass cyclic boosting algorithm (Algorithm 1).

1:, for ,
2:for  to  do
3:     for  to  do
4:         {varwidth}[t]200pt, , .
5:         for  to  do
6:              {varwidth}[t]175pt Create bootstrap sample from the training set .
7:              {varwidth}[t]175ptLearn tree with leaf nodes on bootstrap sample .
8:              Compute using equation (4).          
9:         {varwidth}[t]175pt, for .      
Algorithm 1 Multiclass GAM Learning via Cyclic Gradient Boosting (MC-iGAM)

Hyperparameters.

We found the following hyperparameters for MC-iGAM to be high performing across many datasets, including ones not included in this paper due to space constraints: learning rate , number of leaves in tree , number of bagged trees in each base learner , number of boosting iterations with early stopping based on held-out validation loss. These are the default hyperparameter choices in our soon-to-be-released software package.

3.2 Accuracy on Real Datasets

In this section, we evaluate MC-iGAM against other multiclass baselines. We select five datasets with interpretable features and different numbers of classes, features, and data points. Table 1 describes them. Diabetes, Covertype, Sensorless and Shuttle are from the UCI repository; SIDS is from the Centers for Disease Control and Prevention3. We use normalized Shannon entropy to report the degree of imbalance in each data set: indicates a perfectly balanced dataset (same number of points per class) while denotes a perfectly unbalanced dataset (all points in one class).

Dataset Classes Features Size
SIDS
Diabetes
Covertype
Sensorless
Shuttle
Table 1: Dataset characteristics.

Baselines.

Model SIDS Diabetes Covertype Sensorless Shuttle
Balanced Accuracy on Test Sets
GBT
MC-iGAM
MGCV
LR
Cross-Entropy Loss on Test Sets
GBT
MC-iGAM
MGCV
LR
Table 2: Accuracy of MC-iGAM compared to three baselines on five datasets.

We compare MC-iGAM to three baselines:

  • Multiclass logistic regression (LR), the simplest multiclass additive model, to tell us how much accuracy improvement is due to the non-linearity of MC-iGAM. We use the Python sklearn implementation.

  • Multiclass gradient boosted trees (GBT), an unrestricted, full-complexity model to get a sense of how much accuracy we sacrifice to gain the interpretability of GAMs. We use the xgboost implementation [\citeauthoryearChen and Guestrin2016] and tune hyperparameters using random search.

  • GAMs with splines (MGCV), a widely-used R package that fits GAMs with spline-based learners using a penalized likelihood procedure [\citeauthoryearWood2011]. Unfortunately, as noted in the documentation4 and found by us, mgcv’s multiclass GAM fitting procedure does not scale beyond several thousand data points and five classes. Therefore, we trained GAMs with binary targets to predict whether a point belongs in class , then generated multiclass predictions for each point by normalizing the probabilities to sum to one.

Experimental design.

For each dataset, we generated five train-test splits of size 80-20% to account for potential variability between test set splits, and report the mean and standard deviation of metrics over test set splits. We track two performance metrics on test-sets: balanced accuracy and cross-entropy loss. Balanced accuracy addresses the imbalance of classes in classification tasks [\citeauthoryearBrodersen et al.2010]: .

Results.

The results are shown in Table 2. The top half of the table reports the balanced accuracies of each model on the five datasets. The bottom half reports the cross-entropy loss on test set. Several clear patterns emerge in both tables:
(1) There is a large performance gap between the linear model (LR) and MC-iGAM under both metrics. On four out of five datasets, the performance of MC-iGAM is much closer to the full-complexity model (GBT) than to the linear model, suggesting MC-iGAM benefits from the additional non-linearity.
(2) MC-iGAM consistently outperforms MGCV across all five datasets over both metrics, supporting the earlier finding [\citeauthoryearLou, Caruana, and Gehrke2012] that boosted trees are more accurate than splines as GAM base-models.
(3) Interestingly, on datasets with very imbalanced classes (SIDS and Shuttle), MC-iGAM still performs reasonably well compared to GBT, even though no method countering class imbalance (e.g. loss function re-weighting) is used.

4 Interpretability of Multiclass GAMs

Multiclass GAMs are hard to interpret fundamentally because each class’s prediction necessarily involves the shape functions of all classes. However, research has found that human perception cannot effectively dissect interactions between more than a few function curves [\citeauthoryearJaved, McDonnel, and Elmqvist2010]. Therefore, we need to find a way to allow each shape function to be viewed individually, while still conveying useful and faithful information about the model’s predictions. To do so, we first revisit the binary classification setting and define what ‘useful and faithful information’ is. Throughout this section we will use notation defined in Section 2.

4.1 Axioms of Interpretability: Inspiration from Binary GAMs

What information do people gain from binary shape functions and what aspect of shape functions carries that information? As demonstrated in the pneumonia example in Figure 1, when practitioners look at a binary GAM shape plot, they try to determine which feature values contribute positively or negatively to the outcome by looking at trends in different regions of the feature’s domain. They also look for jumps in the shape function that indicate sudden increase or decrease in predicted probability. For example, one might expect the influence of age on pneumonia risk to be smooth — one’s health at age 67 should not be dramatically different than at age 66 — and the appearance of jumps may hint at the existence of hidden variables such as retirement that warrant further investigation. Because human perception naturally focuses on discontinuities in otherwise smooth curves, it is important for shape functions to be smooth when possible.

In binary GAMs, the rising and falling trends and jumps of individual shape functions faithfully represent the trend and jumps of the model’s predictions. We would like to be able to interpret multiclass GAMs the same way. To achieve this, we propose two interpretability axioms that every multiclass GAM should satisfy.
A1: The axiom of monotonicity asks that for each feature, the trend of shape functions for all classes should match the trend of the ‘average’ predicted probability of that class. Mathematically:

Definition 1 (The axiom of monotonicity).

For each class , feature and feature value , denote the marginal distribution of points satisfying as . Then, a multiclass GAM satisfies the axiom of monotonicity if

(5)

,

A2: The axiom of smoothness asks that the shape functions do not have any artificial or unnecessary jumps or unsmooth-ness. Mathematically:

Definition 2 (The axiom of smoothness).

satisfies the axiom of smoothness if

(6)

where is some smoothness metric and denote the equivalence class of , defined in the next section.

To measure the smoothness of 1-d functions such as our shape functions, we use quadratic variation:

Definition 3 (Quadratic Variation).

For functions defined on a finite ordered discrete domain of size S, quadratic variation is

For functions defined on a continuous interval with finite points of discontinuity , quadratic variation is:

Does there exist a multiclass GAM model that satisfies both axioms? Figure 1 in Section 1 is an example of one. By transforming the binary pneumonia GAM model (Figure 1) to a multiclass GAM model with two classes (Figure 1), the model changes from to . The blue curve representing risk of death is exactly the same as the binary age shape and is therefore faithful to the model prediction. The orange curve representing the ’risk’ to survive is exactly the mirror image of the risk of death. Since in the binary case the probability of death is always one minus the chance to survive, the orange curve is faithful to its own class as well.

4.2 Leveraging Key Properties of Multiclass GAMs to Regain Interpretability

We have proposed two axioms satisfied by binary GAMs that multiclass GAMs should also satisfy in order to not be visually misleading, and provided an example of a (two-class) multiclass GAM model that satisfies these axioms. We now highlight two key properties shared by all multiclass GAM models that we will leverage in Section 4.3 to post-process any multiclass GAM model to satisfy these axioms. These properties stem from the softmax formulation (Equation (1)) used by these models.

P1: Equivalence class of multiclass GAMs. Different GAMs can produce equivalent model predictions. In particular, we have the following equivalence relationship:

Corollary 1.

Let and be two GAMs defined as

for some arbitrary functions ’s. Then, and are equivalent in terms of model prediction, and we define the equivalence class of as .

Proof.

Notice that unlike the binary GAMs’ logistic probabilities, softmax probabilities are invariant with respect to a constant shift of the logits due to the softmax being overparametrized. Therefore we can add a constant to all logits without changing the predicted probability, i.e.

We will use this invariance property in our additive post-processing (API) method presented in Section 4.3 to find a more interpretable equivalent to .

P2: Ranking consistency between shape functions and class probabilities. Another characteristic of the softmax is the ranking consistency between the change in shape function values and the change in predicted class probability.

Corollary 2.

Let and be two data points sharing the exact same feature values except for one particular feature . Let be the differences between their corresponding logits due to the difference in feature . Then, the ranking of across is consistent with the ranking of the ratios of predicted probabilities across .

Proof.

Simple calculation shows that , for all . Now, suppose that for some particular , then we have

(7)

which implies that

(8)

This property holds for all pairs. ∎

This ranking consistency property will come in useful in the optimization of our API method (cf. Section 4.3).

4.3 Additive Post-Processing for Interpretability

We now describe a post-processing method, API, that leverages the softmax’s properties (cf. Section 4.2) to modify any multiclass additive model to regain interpretability (cf. Section 4.1), while keeping its predictions unchanged. Given a pretrained GAM model , API finds another equivalent additive model that satisfies the axiom of monotonicity while fulfilling the minimization condition of the axiom of smoothness. We formulate this as a constrained optimization problem in functional space to find the set defining while minimizing objective (6) and satisfying condition (5):

(9)
s.t. (10)

Before we discuss how to solve this optimization problem, we first show that there is a solution:

Theorem 1.

Condition (10) is feasible.

Proof.

Let be a feature and be a data point with . Due to space constraints, we only present the proof for the case where the domain of feature is continuous and the shape functions are differentiable at . The proofs for the other two cases are similar.

Applying the definition of , we have

The ranking consistency property (Corollary 2) therefore guarantees that the ranking among is the same as the ranking among . This is true for every individual data point with . Then, due to the invariance of the inequality under expectation, we have that the ranking among is the same as the ranking among . Therefore, there must exist a constant such that the sign of equals the sign of for all . This holds for all features and values . Therefore, Condition (10) is feasible. ∎

INPUT: A pretrained GAM .

OUTPUT: Interpretable GAM .

1:for  to  do
2:     for  to  do
3:         Define function .      
4:     Define function .
5:     Define function .
6:     Define function .
7:     .
8:     Recover via integration or summation depend on the domain type of .
9:Return .
Algorithm 2 Additive Post-Processing for Interpretability (API)

Now to solve optimization problem (9), observe that both the objective function and the constraints are separable with respect to the feature set and the feature values , and the optimization problem can be reparametrized to be a problem over . Therefore, problem (9) can be solved by individually solving

s.t.

for all and . It therefore becomes a set of 1-d quadratic programs with linear constraints, which can be solved in closed form. The closed form solution gives rise to the API post-processing method presented below.

In the next section, we apply API to improve the interpretability of shape functions of a 12-class multiclass GAM.

4.4 Interpretability in Action on Real Data: Sudden Infant Death Syndrome (SIDS)

The SIDS dataset classifies newborn infants into 12 classes: alive and 11 distinct causes of death (see Figure 3 legend). The usual way of visualizing multiclass additive models, used in packages such as mgcv [\citeauthoryearWood2011], plots the logit relative to a base class that is the majority or ‘normal’: in SIDS the class ‘alive’ is the natural base class.5

The first column in Figure 3 shows this view of the shape functions for features ‘birthweight’ and ‘gestation length’. Interpreting the model from these two plots (Figure 3a,e), one may think that the risk for almost all causes of death increases for infants with low birthweight or short gestation length. However, experts who examined these two plots found them misleading and questioned why risk did not appear to differ more by cause of death.

Figure 3: Shape functions for the SIDS data, before and after applying our API post-processing method.

The three columns on the right show the shape functions for the same two features, ‘birthweight’ and ‘gestation length’, after applying the API method. For the sake of demonstration, for each feature we split the 12 shapes into three figures. Keep in mind that after API post-processing, the trend of the shapes agrees with the trend of the corresponding class probabilities. One can see that the chance of living (class 0) is indeed monotonically decreasing as birthweight gets lower and gestation gets shorter (Figure 3b,f). Now, not all causes of death are affected in similarly by the two features.

Low birthweight infants are more likely to die from preterm low birthweight, complications of pregnancy, placenta cord membranes and respiratory distress (2,3,6,8 in Figure 3c), and less likely to die from SUID and accidents (4,5 in Figure 3b). For congenital malformation, bacterial sepsis and neonatal hemorrhage, these risks peak at birthweight 1-2kg (1,7,10 in Figure 3d). Finally for circulatory diseases and other causes, their shapes are relatively flat, suggesting that birthweight variation does not affect them as much (9,11 in Figure 3d). This finding was confirmed by experts.

For gestation, the causes of death exhibit three different patterns. As gestation length gets shorter, death is less likely from congenital malformation, SUID and accidents (1,4,5 in Figure 3f), but the risk of death from preterm low birthweight, complications of pregnancy and placenta cord membranes increases (2,3,6,8 in Figure 3g). The 3rd category (Figure 3h) is especially interesting. The risk of death from bacterial sepsis, respiratory distress, circulatory system, neonatal hemorrhage and others appear to peak around weeks 24-27, near the end of the second trimester.

This short case study demonstrates that multiclass GAM shape functions are more readily interpretable after API (three columns on right in Figure 3) compared to the traditional presentation (column on left in Figure 3). In particular, the shape plots after API successfully show the diversity between different causes of death that is not immediately apparent in the plots before API.

5 Related Work

Generalize additive models (GAMs) were first introduced to model features with more flexible base learners [\citeauthoryearHastie and Tibshirani1990, \citeauthoryearWood2006]. They are traditionally fitted using splines [\citeauthoryearEilers and Marx1996]; other base learners include trees [\citeauthoryearLou, Caruana, and Gehrke2012], trend filtering [\citeauthoryearTibshirani2014], etc.

Comparing between several GAM fitting procedures,

\citeauthor

binder2008comparison found that boosting performs well particularly in high-dimensional settings. Lou et al. developed iGAM [\citeauthoryearLou, Caruana, and Gehrke2012] using boosting with shallow bagged tree base learners. This paper generalizes iGAM to the multiclass setting.

We briefly summarize GAM software besides those already mentioned in Section 3.2. mboost [\citeauthoryearHothorn et al.2018] fits GAMs using component-wise gradient boosting [\citeauthoryearBuhlmann and Yu2003]; pyGAM [\citeauthoryearServen and Brummitt2018] fits GAMs with P-splines base learners using penalized iteratively reweighted least squares. However, neither supports multiclass classification. To the best of our knowledge, our soon-to-be-released package is the first that can learn large-scale, high-performance multiclass GAMs.

Recent research in interpretability can be roughly divided into two categories. The first category tries to explain predictions of a black-box model, either locally [\citeauthoryearRibeiro, Singh, and Guestrin2016, \citeauthoryearBaehrens et al.2010] or globally [\citeauthoryearRibeiro, Singh, and Guestrin2018, \citeauthoryearTan et al.2018]. The second category builds interpretable models from the ground up, such as rule lists [\citeauthoryearLetham et al.2015], decision sets [\citeauthoryearLakkaraju, Bach, and Leskovec2016], additive models [\citeauthoryearLou et al.2013], etc. This paper addresses the interpretability challenges of additive models in the multiclass realm.

6 Discussion and Conclusions

We have presented a comprehensive framework for training interpretable multiclass generalized additive models. The framework consists of a multiclass GAM learning algorithm, MC-iGAM, and a model-agnostic post-processing procedure, API, that transforms any multiclass additive model into a more interpretable, canonical form. The API post-processing method provably satisfies two interpretability axioms that, when satisfied, make the learned shape functions easier to interpret and prevent them from being visually misleading. The API method is very general, and can even be applied to simple additive models such as multiclass logistic regression to create a more interpretable, canonical form.

The MC-iGAM algorithm and API post-processing method are efficient and easily scale to large datasets with hundreds of thousands of points and hundreds or thousands of features. We are currently generalizing both the MC-iGAM algorithm and API post-processing method to work with GAMs that include higher-order interactions such as pairwise interactions.

The definition of interpretability is in flux [\citeauthoryearDoshi-Velez and Kim2017]. Ultimately, we hope that the two axioms of interpretability proposed in this paper for additive models will contribute to a more precise definition of interpretability.

Footnotes

  1. Thanks to Caruana et al. for providing Figure 1.
  2. This is a weak assumption, as most base learners used for fitting GAM shapes satisfy this assumption (e.g. splines are continuous and trees are piece-wise constant with a finite number of discontinuities).
  3. https://www.cdc.gov/sids/data.htm
  4. https://stat.ethz.ch/R-manual/R-devel/library/mgcv/html/multinom.html
  5. This forces the logit for class ‘alive’ to zero for all values of each feature so that the risk of other classes is relative to the ’alive’ class.

References

  1. Baehrens, D.; Schroeter, T.; Harmeling, S.; Kawanabe, M.; Hansen, K.; and Muller, K.-R. 2010. How to explain individual classification decisions. Journal of Machine Learning Research 11(Jun):1803–1831.
  2. Binder, H., and Tutz, G. 2008. A comparison of methods for the fitting of generalized additive models. Statistics and Computing 18(1):87–99.
  3. Brodersen, K. H.; Ong, C. S.; Stephan, K. E.; and Buhmann, J. M. 2010. The balanced accuracy and its posterior distribution. In ICPR.
  4. Buhlmann, P., and Yu, B. 2003. Boosting with the l2 loss: regression and classification. Journal of the American Statistical Association 98(462):324–339.
  5. Caruana, R.; Lou, Y.; Gehrke, J.; Koch, P.; Sturm, M.; and Elhadad, N. 2015. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In KDD.
  6. Chen, T., and Guestrin, C. 2016. XGBoost: A scalable tree boosting system. In KDD.
  7. Doshi-Velez, F., and Kim, B. 2017. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608.
  8. Eilers, P., and Marx, B. 1996. Flexible smoothing with b-splines and penalties. Statistical Science 11(2):89–121.
  9. Friedman, J.; Hastie, T.; and Tibshirani, R. 2000. Additive logistic regression: a statistical view of boosting. The Annals of Statistics 28(2):337–407.
  10. Friedman, J. 2001. Greedy function approximation: a gradient boosting machine. The Annals of Statistics 29(5):1189–1232.
  11. Hastie, T., and Tibshirani, R. 1990. Generalized Additive Models. Chapman and Hall/CRC.
  12. Hothorn, T.; Buhlmann, P.; Kneib, T.; Schmid, M.; and Hofner, B. 2018. mboost: Model-Based Boosting. https://CRAN.R-project.org/package=mboost.
  13. Javed, W.; McDonnel, B.; and Elmqvist, N. 2010. Graphical perception of multiple time series. IEEE Transactions on Visualization & Computer Graphics.
  14. Lakkaraju, H.; Bach, S. H.; and Leskovec, J. 2016. Interpretable decision sets: A joint framework for description and prediction. In KDD.
  15. Letham, B.; Rudin, C.; McCormick, T. H.; and Madigan, D. 2015. Interpretable classifiers using rules and bayesian analysis: Building a better stroke prediction model. The Annals of Applied Statistics 9(3):1350–1371.
  16. Lou, Y.; Caruana, R.; Gehrke, J.; and Hooker, G. 2013. Accurate intelligible models with pairwise interactions. In KDD.
  17. Lou, Y.; Caruana, R.; and Gehrke, J. 2012. Intelligible models for classification and regression. In KDD.
  18. Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. “why should i trust you?”: Explaining the predictions of any classifier. In KDD.
  19. Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2018. Anchors: High-precision model-agnostic explanations. In AAAI.
  20. Serven, D., and Brummitt, C. 2018. pygam: Generalized additive models in python. https://doi.org/10.5281/zenodo.1208723.
  21. Tan, S.; Caruana, R.; Hooker, G.; and Lou, Y. 2017. Auditing black-box models using transparent model distillation with side information. arXiv preprint arXiv:1710.06169.
  22. Tan, S.; Caruana, R.; Hooker, G.; and Gordo, A. 2018. Transparent model distillation. arXiv preprint arXiv:1801.08640.
  23. Tibshirani, R. J. 2014. Adaptive piecewise polynomial estimation via trend filtering. The Annals of Statistics 42(1):285–323.
  24. Wood, S. N. 2006. Generalized Additive Models: An Introduction with R. Chapman and Hall/CRC.
  25. Wood, S. N. 2011. Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. Journal of the Royal Statistical Society (B).
  26. Zeng, J.; Ustun, B.; and Rudin, C. 2016. Interpretable classification models for recidivism prediction. Journal of the Royal Statistical Society (A).
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
311583
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description