Interpretability is Harder in the Multiclass Setting: Axiomatic Interpretability for Multiclass Additive Models
Abstract
Generalized additive models (GAMs) are favored in many regression and binary classification problems because they are able to fit complex, nonlinear functions while still remaining interpretable. In the first part of this paper, we generalize a stateoftheart GAM learning algorithm based on boosted trees to the multiclass setting, and show that this multiclass algorithm outperforms existing GAM fitting algorithms and sometimes matches the performance of full complex models.
In the second part, we turn our attention to the interpretability of GAMs in the multiclass setting. Surprisingly, the natural interpretability of GAMs breaks down when there are more than two classes. Drawing inspiration from binary GAMs, we identify two axioms that any additive model must satisfy to not be visually misleading. We then develop a postprocessing technique (API) that provably transforms pretrained additive models to satisfy the interpretability axioms without sacrificing accuracy. The technique works not just on models trained with our algorithm, but on any multiclass additive model. We demonstrate API on a 12class infantmortality dataset.
1 Introduction
Interpretable models, though sometimes less accurate than blackbox models, are preferred in many realworld applications. In criminal justice, finance, hiring, and other domains that impact people’s lives, interpretable models are often used because their transparency helps determine if a model is biased or unsafe [\citeauthoryearZeng, Ustun, and Rudin2016, \citeauthoryearTan et al.2017]. And in critical applications such as healthcare, where human experts and machine learning models often work together, being able to understand, learn from, edit and trust the learned model is also important [\citeauthoryearCaruana et al.2015].
Generalized additive models (GAMs) are among the most powerful interpretable models when individual features play major effects [\citeauthoryearHastie and Tibshirani1990, \citeauthoryearLou, Caruana, and Gehrke2012]. In the binary classification setting, we consider standard GAMs with logistic probabilities: , where the logit is an additive function of individual features, . Here, is the th feature of data point , and we denote the shape function of feature for the positive class. Previously, \citeauthorlou2012intelligible evaluated various GAM fitting algorithms, and found that gradient boosting with shallow trees restricted to one feature at a time outperformed other methods on a number of regression and binary classification datasets. They call their model iGAM. The first part of this paper generalizes iGAM to the multiclass setting. We consider standard GAMs with softmax probabilities:
(1) 
where the logit of class , , is also an additive function of individual features, and is the shape function of feature for class . We present our multiclass GAM fitting algorithm in section 3.1. In section 3.2, we empirically evaluate its performance on large scale real world datasets.
Binary GAMs are readily interpretable because the influence of each feature on the outcome is captured by a single 1d shape function that is easily visualized.
For example, Figure 1 shows the relationship between risk of pneumonia death and age.
In the multiclass setting, however, the influence of feature on class is no longer captured by a single shape function , but through the interaction of all ’s, . In particular, even if the logit for class is increasing, the probability for class might still decrease if the logits for other classes are increasing more rapidly. As a result, the learned shape functions can be visually misleading. For example, Figure 2ad show the shape functions of 3class toy GAM models with only one feature. Each model appears to have very different shape functions: (2a) some falling, some rising, (2b) all decreasing, (2c) decreasing and then increasing, or (2d) oscillating. Interestingly, however, all of these models make identical predictions. Because these models have only one feature, we can plot class probabilities as functions of the feature value (this is not possible with multiple features). In Figure 2f, class probability is monotonically increasing, while class and probabilities are monotonically decreasing, which is vastly different from the shape functions ad. This problem, if not solved, greatly reduces the intelligibility of GAMs in multiclass problems.
The second half of this paper focuses on mitigating the unintelligibility of multiclass GAMs. We start by examining how users interpret binary GAMs, and identify a set of interpretability axioms — criteria that GAM shapes should satisfy to guarantee interpretability. We then present properties of additive models that make it possible to regain interpretability. Making use of these properties, we design an Additive PostProcessing for Interpretability (API) method that provably transforms any pretrained additive model to satisfy the axioms of interpretability without sacrificing predictive accuracy. Figure 2e shows the shape functions that result from passing any of models 2ad through API. After API, the now canonical shape functions successfully match the probability trend for the corresponding classes in Figure 2f.
2 Notation and Problem Definition
In this section, we define notation that will be used throughout the paper. We focus on multiclass classification where is the input space and is the output space, and is the number of classes. Let denote a training set of size , where is a feature vector with features and is the target. For , let denote the empirical proportion of class in , where denotes the set . Given a model , let denote the prediction of the model on data point . Our learning objective is to minimize the expected value of some loss function . In multiclass classification, the model output is a probability distribution among the classes, . We use multiclass cross entropy as our loss function:
(2) 
We focus on generalized additive models of the form (1) with softmax probabilities. We denote , as the set of shape functions for a multiclass GAM model, and also as the model itself. Throughout the paper, we make the following assumptions on , the multiclass shape functions. For continuous feature , ’s domain is a continuous finite interval ; for categorical or ordinal features, ’s domain is a finite ordered discrete set. We denote the domain of feature as . For the API postprocessing method (Section 4.3), we also assume that shape functions of continuous features are continuous everywhere except for a finite number of points
3 Multiclass GAM Learning via Cyclic Gradient Boosting
We now describe the training procedure for MCiGAM, our generalization of binary iGAM [\citeauthoryearLou, Caruana, and Gehrke2012] to the multiclass setting. Like Lou et al., we use bagged trees as the base learner for boosting, with largest variance reduction as the splitting criterion, and control tree complexity by limiting the number of leaves .
3.1 Cyclic Gradient Boosting
Our optimization procedure is cyclic gradient boosting [\citeauthoryearBuhlmann and Yu2003, \citeauthoryearLou, Caruana, and Gehrke2012], a variant of standard gradient boosting [\citeauthoryearFriedman2001] where features are cycled through sequentially to learn individual shape functions for each feature. Algorithm 1 presents our algorithm for cyclic gradient boosting in the multiclass setting.
In standard gradient boosting, each boosting step fits a base learner to the pseudoresidual, the negative gradient in the functional space [\citeauthoryearFriedman2001]. In a multiclass setting with cross entropy loss (Equation (2)) and softmax probabilities (Equation (1)), the pseudoresidual for class is:
Adding the fitted base learner (multiplied by a typically small constant ) to the ensemble corresponds to taking an approximate gradient step in the functional space with learning rate . However, as suggested by Friedman et al., to speed up computation one can instead take an approximate Newton step using a diagonal approximation to the Hessian [\citeauthoryearFriedman, Hastie, and Tibshirani2000]. The resulting additive update to learn a multiclass GAM then becomes:
(3)  
(4) 
for , where is the set of training points in tree leaf for current feature . Applying the above boosting procedure cyclically on individual features gives our multiclass cyclic boosting algorithm (Algorithm 1).
Hyperparameters.
We found the following hyperparameters for MCiGAM to be high performing across many datasets, including ones not included in this paper due to space constraints: learning rate , number of leaves in tree , number of bagged trees in each base learner , number of boosting iterations with early stopping based on heldout validation loss. These are the default hyperparameter choices in our soontobereleased software package.
3.2 Accuracy on Real Datasets
In this section, we evaluate MCiGAM against other multiclass baselines. We select five datasets with interpretable features and different numbers of classes, features, and data points.
Table 1 describes them. Diabetes, Covertype, Sensorless and Shuttle are from the UCI repository; SIDS is from the Centers for Disease Control and Prevention
Dataset  Classes  Features  Size  

SIDS  
Diabetes  
Covertype  
Sensorless  
Shuttle 
Baselines.
Model  SIDS  Diabetes  Covertype  Sensorless  Shuttle 

Balanced Accuracy on Test Sets  
GBT  
MCiGAM  
MGCV  
LR  
CrossEntropy Loss on Test Sets  
GBT  
MCiGAM  
MGCV  
LR 
We compare MCiGAM to three baselines:

Multiclass logistic regression (LR), the simplest multiclass additive model, to tell us how much accuracy improvement is due to the nonlinearity of MCiGAM. We use the Python sklearn implementation.

Multiclass gradient boosted trees (GBT), an unrestricted, fullcomplexity model to get a sense of how much accuracy we sacrifice to gain the interpretability of GAMs. We use the xgboost implementation [\citeauthoryearChen and Guestrin2016] and tune hyperparameters using random search.

GAMs with splines (MGCV), a widelyused R package that fits GAMs with splinebased learners using a penalized likelihood procedure [\citeauthoryearWood2011]. Unfortunately, as noted in the documentation
^{4} and found by us, mgcv’s multiclass GAM fitting procedure does not scale beyond several thousand data points and five classes. Therefore, we trained GAMs with binary targets to predict whether a point belongs in class , then generated multiclass predictions for each point by normalizing the probabilities to sum to one.
Experimental design.
For each dataset, we generated five traintest splits of size 8020% to account for potential variability between test set splits, and report the mean and standard deviation of metrics over test set splits. We track two performance metrics on testsets: balanced accuracy and crossentropy loss. Balanced accuracy addresses the imbalance of classes in classification tasks [\citeauthoryearBrodersen et al.2010]: .
Results.
The results are shown in Table 2. The top half of the table reports the balanced accuracies of each model on the five datasets. The bottom half reports the crossentropy loss on test set. Several clear patterns emerge in both tables:
(1) There is a large performance gap between the linear model (LR) and MCiGAM under both metrics. On four out of five datasets, the performance of MCiGAM is much closer to the fullcomplexity model (GBT) than to the linear model, suggesting MCiGAM benefits from the additional nonlinearity.
(2) MCiGAM consistently outperforms MGCV across all five datasets over both metrics, supporting the earlier finding [\citeauthoryearLou, Caruana, and
Gehrke2012] that boosted trees are more accurate than splines as GAM basemodels.
(3) Interestingly, on datasets with very imbalanced classes (SIDS and Shuttle), MCiGAM still performs reasonably well compared to GBT, even though no method countering class imbalance (e.g. loss function reweighting) is used.
4 Interpretability of Multiclass GAMs
Multiclass GAMs are hard to interpret fundamentally because each class’s prediction necessarily involves the shape functions of all classes. However, research has found that human perception cannot effectively dissect interactions between more than a few function curves [\citeauthoryearJaved, McDonnel, and Elmqvist2010]. Therefore, we need to find a way to allow each shape function to be viewed individually, while still conveying useful and faithful information about the model’s predictions. To do so, we first revisit the binary classification setting and define what ‘useful and faithful information’ is. Throughout this section we will use notation defined in Section 2.
4.1 Axioms of Interpretability: Inspiration from Binary GAMs
What information do people gain from binary shape functions and what aspect of shape functions carries that information? As demonstrated in the pneumonia example in Figure 1, when practitioners look at a binary GAM shape plot, they try to determine which feature values contribute positively or negatively to the outcome by looking at trends in different regions of the feature’s domain. They also look for jumps in the shape function that indicate sudden increase or decrease in predicted probability. For example, one might expect the influence of age on pneumonia risk to be smooth — one’s health at age 67 should not be dramatically different than at age 66 — and the appearance of jumps may hint at the existence of hidden variables such as retirement that warrant further investigation. Because human perception naturally focuses on discontinuities in otherwise smooth curves, it is important for shape functions to be smooth when possible.
In binary GAMs, the rising and falling trends and jumps of individual shape functions faithfully represent the trend and jumps of the model’s predictions. We would like to be able to interpret multiclass GAMs the same way. To achieve this, we propose two interpretability axioms that every multiclass GAM should satisfy.
A1: The axiom of monotonicity asks that for each feature, the trend of shape functions for all classes should match the trend of the ‘average’ predicted probability of that class. Mathematically:
Definition 1 (The axiom of monotonicity).
For each class , feature and feature value , denote the marginal distribution of points satisfying as . Then, a multiclass GAM satisfies the axiom of monotonicity if
(5) 
,
A2: The axiom of smoothness asks that the shape functions do not have any artificial or unnecessary jumps or unsmoothness. Mathematically:
Definition 2 (The axiom of smoothness).
satisfies the axiom of smoothness if
(6) 
where is some smoothness metric and denote the equivalence class of , defined in the next section.
To measure the smoothness of 1d functions such as our shape functions, we use quadratic variation:
Definition 3 (Quadratic Variation).
For functions defined on a finite ordered discrete domain of size S, quadratic variation is
For functions defined on a continuous interval with finite points of discontinuity , quadratic variation is:
Does there exist a multiclass GAM model that satisfies both axioms? Figure 1 in Section 1 is an example of one. By transforming the binary pneumonia GAM model (Figure 1) to a multiclass GAM model with two classes (Figure 1), the model changes from to . The blue curve representing risk of death is exactly the same as the binary age shape and is therefore faithful to the model prediction. The orange curve representing the ’risk’ to survive is exactly the mirror image of the risk of death. Since in the binary case the probability of death is always one minus the chance to survive, the orange curve is faithful to its own class as well.
4.2 Leveraging Key Properties of Multiclass GAMs to Regain Interpretability
We have proposed two axioms satisfied by binary GAMs that multiclass GAMs should also satisfy in order to not be visually misleading, and provided an example of a (twoclass) multiclass GAM model that satisfies these axioms. We now highlight two key properties shared by all multiclass GAM models that we will leverage in Section 4.3 to postprocess any multiclass GAM model to satisfy these axioms. These properties stem from the softmax formulation (Equation (1)) used by these models.
P1: Equivalence class of multiclass GAMs. Different GAMs can produce equivalent model predictions. In particular, we have the following equivalence relationship:
Corollary 1.
Let and be two GAMs defined as
for some arbitrary functions ’s. Then, and are equivalent in terms of model prediction, and we define the equivalence class of as .
Proof.
Notice that unlike the binary GAMs’ logistic probabilities, softmax probabilities are invariant with respect to a constant shift of the logits due to the softmax being overparametrized. Therefore we can add a constant to all logits without changing the predicted probability, i.e.
∎
We will use this invariance property in our additive postprocessing (API) method presented in Section 4.3 to find a more interpretable equivalent to .
P2: Ranking consistency between shape functions and class probabilities. Another characteristic of the softmax is the ranking consistency between the change in shape function values and the change in predicted class probability.
Corollary 2.
Let and be two data points sharing the exact same feature values except for one particular feature . Let be the differences between their corresponding logits due to the difference in feature . Then, the ranking of across is consistent with the ranking of the ratios of predicted probabilities across .
Proof.
Simple calculation shows that , for all . Now, suppose that for some particular , then we have
(7) 
which implies that
(8) 
This property holds for all pairs. ∎
This ranking consistency property will come in useful in the optimization of our API method (cf. Section 4.3).
4.3 Additive PostProcessing for Interpretability
We now describe a postprocessing method, API, that leverages the softmax’s properties (cf. Section 4.2) to modify any multiclass additive model to regain interpretability (cf. Section 4.1), while keeping its predictions unchanged. Given a pretrained GAM model , API finds another equivalent additive model that satisfies the axiom of monotonicity while fulfilling the minimization condition of the axiom of smoothness. We formulate this as a constrained optimization problem in functional space to find the set defining while minimizing objective (6) and satisfying condition (5):
(9)  
s.t.  (10)  
Before we discuss how to solve this optimization problem, we first show that there is a solution:
Theorem 1.
Condition (10) is feasible.
Proof.
Let be a feature and be a data point with . Due to space constraints, we only present the proof for the case where the domain of feature is continuous and the shape functions are differentiable at . The proofs for the other two cases are similar.
Applying the definition of , we have
The ranking consistency property (Corollary 2) therefore guarantees that the ranking among is the same as the ranking among . This is true for every individual data point with . Then, due to the invariance of the inequality under expectation, we have that the ranking among is the same as the ranking among . Therefore, there must exist a constant such that the sign of equals the sign of for all . This holds for all features and values . Therefore, Condition (10) is feasible. ∎
Now to solve optimization problem (9), observe that both the objective function and the constraints are separable with respect to the feature set and the feature values , and the optimization problem can be reparametrized to be a problem over . Therefore, problem (9) can be solved by individually solving
s.t.  
for all and . It therefore becomes a set of 1d quadratic programs with linear constraints, which can be solved in closed form. The closed form solution gives rise to the API postprocessing method presented below.
In the next section, we apply API to improve the interpretability of shape functions of a 12class multiclass GAM.
4.4 Interpretability in Action on Real Data: Sudden Infant Death Syndrome (SIDS)
The SIDS dataset classifies newborn infants into 12 classes: alive and 11 distinct causes of death (see Figure 3 legend). The usual way of visualizing multiclass additive models, used in packages such as mgcv [\citeauthoryearWood2011], plots the logit relative to a base class that is the majority or ‘normal’: in SIDS the class ‘alive’ is the natural base class.
The first column in Figure 3 shows this view of the shape functions for features ‘birthweight’ and ‘gestation length’. Interpreting the model from these two plots (Figure 3a,e), one may think that the risk for almost all causes of death increases for infants with low birthweight or short gestation length. However, experts who examined these two plots found them misleading and questioned why risk did not appear to differ more by cause of death.
The three columns on the right show the shape functions for the same two features, ‘birthweight’ and ‘gestation length’, after applying the API method. For the sake of demonstration, for each feature we split the 12 shapes into three figures. Keep in mind that after API postprocessing, the trend of the shapes agrees with the trend of the corresponding class probabilities. One can see that the chance of living (class 0) is indeed monotonically decreasing as birthweight gets lower and gestation gets shorter (Figure 3b,f). Now, not all causes of death are affected in similarly by the two features.
Low birthweight infants are more likely to die from preterm low birthweight, complications of pregnancy, placenta cord membranes and respiratory distress (2,3,6,8 in Figure 3c), and less likely to die from SUID and accidents (4,5 in Figure 3b). For congenital malformation, bacterial sepsis and neonatal hemorrhage, these risks peak at birthweight 12kg (1,7,10 in Figure 3d). Finally for circulatory diseases and other causes, their shapes are relatively flat, suggesting that birthweight variation does not affect them as much (9,11 in Figure 3d). This finding was confirmed by experts.
For gestation, the causes of death exhibit three different patterns. As gestation length gets shorter, death is less likely from congenital malformation, SUID and accidents (1,4,5 in Figure 3f), but the risk of death from preterm low birthweight, complications of pregnancy and placenta cord membranes increases (2,3,6,8 in Figure 3g). The 3rd category (Figure 3h) is especially interesting. The risk of death from bacterial sepsis, respiratory distress, circulatory system, neonatal hemorrhage and others appear to peak around weeks 2427, near the end of the second trimester.
This short case study demonstrates that multiclass GAM shape functions are more readily interpretable after API (three columns on right in Figure 3) compared to the traditional presentation (column on left in Figure 3). In particular, the shape plots after API successfully show the diversity between different causes of death that is not immediately apparent in the plots before API.
5 Related Work
Generalize additive models (GAMs) were first introduced to model features with more flexible base learners [\citeauthoryearHastie and Tibshirani1990, \citeauthoryearWood2006]. They are traditionally fitted using splines [\citeauthoryearEilers and Marx1996]; other base learners include trees [\citeauthoryearLou, Caruana, and Gehrke2012], trend filtering [\citeauthoryearTibshirani2014], etc.
Comparing between several GAM fitting procedures,
binder2008comparison found that boosting performs well particularly in highdimensional settings. Lou et al. developed iGAM [\citeauthoryearLou, Caruana, and Gehrke2012] using boosting with shallow bagged tree base learners. This paper generalizes iGAM to the multiclass setting.
We briefly summarize GAM software besides those already mentioned in Section 3.2. mboost [\citeauthoryearHothorn et al.2018] fits GAMs using componentwise gradient boosting [\citeauthoryearBuhlmann and Yu2003]; pyGAM [\citeauthoryearServen and Brummitt2018] fits GAMs with Psplines base learners using penalized iteratively reweighted least squares. However, neither supports multiclass classification. To the best of our knowledge, our soontobereleased package is the first that can learn largescale, highperformance multiclass GAMs.
Recent research in interpretability can be roughly divided into two categories. The first category tries to explain predictions of a blackbox model, either locally [\citeauthoryearRibeiro, Singh, and Guestrin2016, \citeauthoryearBaehrens et al.2010] or globally [\citeauthoryearRibeiro, Singh, and Guestrin2018, \citeauthoryearTan et al.2018]. The second category builds interpretable models from the ground up, such as rule lists [\citeauthoryearLetham et al.2015], decision sets [\citeauthoryearLakkaraju, Bach, and Leskovec2016], additive models [\citeauthoryearLou et al.2013], etc. This paper addresses the interpretability challenges of additive models in the multiclass realm.
6 Discussion and Conclusions
We have presented a comprehensive framework for training interpretable multiclass generalized additive models. The framework consists of a multiclass GAM learning algorithm, MCiGAM, and a modelagnostic postprocessing procedure, API, that transforms any multiclass additive model into a more interpretable, canonical form. The API postprocessing method provably satisfies two interpretability axioms that, when satisfied, make the learned shape functions easier to interpret and prevent them from being visually misleading. The API method is very general, and can even be applied to simple additive models such as multiclass logistic regression to create a more interpretable, canonical form.
The MCiGAM algorithm and API postprocessing method are efficient and easily scale to large datasets with hundreds of thousands of points and hundreds or thousands of features. We are currently generalizing both the MCiGAM algorithm and API postprocessing method to work with GAMs that include higherorder interactions such as pairwise interactions.
The definition of interpretability is in flux [\citeauthoryearDoshiVelez and Kim2017]. Ultimately, we hope that the two axioms of interpretability proposed in this paper for additive models will contribute to a more precise definition of interpretability.
Footnotes
 Thanks to Caruana et al. for providing Figure 1.
 This is a weak assumption, as most base learners used for fitting GAM shapes satisfy this assumption (e.g. splines are continuous and trees are piecewise constant with a finite number of discontinuities).
 https://www.cdc.gov/sids/data.htm
 https://stat.ethz.ch/Rmanual/Rdevel/library/mgcv/html/multinom.html
 This forces the logit for class ‘alive’ to zero for all values of each feature so that the risk of other classes is relative to the ’alive’ class.
References
 Baehrens, D.; Schroeter, T.; Harmeling, S.; Kawanabe, M.; Hansen, K.; and Muller, K.R. 2010. How to explain individual classification decisions. Journal of Machine Learning Research 11(Jun):1803–1831.
 Binder, H., and Tutz, G. 2008. A comparison of methods for the fitting of generalized additive models. Statistics and Computing 18(1):87–99.
 Brodersen, K. H.; Ong, C. S.; Stephan, K. E.; and Buhmann, J. M. 2010. The balanced accuracy and its posterior distribution. In ICPR.
 Buhlmann, P., and Yu, B. 2003. Boosting with the l2 loss: regression and classification. Journal of the American Statistical Association 98(462):324–339.
 Caruana, R.; Lou, Y.; Gehrke, J.; Koch, P.; Sturm, M.; and Elhadad, N. 2015. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30day readmission. In KDD.
 Chen, T., and Guestrin, C. 2016. XGBoost: A scalable tree boosting system. In KDD.
 DoshiVelez, F., and Kim, B. 2017. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608.
 Eilers, P., and Marx, B. 1996. Flexible smoothing with bsplines and penalties. Statistical Science 11(2):89–121.
 Friedman, J.; Hastie, T.; and Tibshirani, R. 2000. Additive logistic regression: a statistical view of boosting. The Annals of Statistics 28(2):337–407.
 Friedman, J. 2001. Greedy function approximation: a gradient boosting machine. The Annals of Statistics 29(5):1189–1232.
 Hastie, T., and Tibshirani, R. 1990. Generalized Additive Models. Chapman and Hall/CRC.
 Hothorn, T.; Buhlmann, P.; Kneib, T.; Schmid, M.; and Hofner, B. 2018. mboost: ModelBased Boosting. https://CRAN.Rproject.org/package=mboost.
 Javed, W.; McDonnel, B.; and Elmqvist, N. 2010. Graphical perception of multiple time series. IEEE Transactions on Visualization & Computer Graphics.
 Lakkaraju, H.; Bach, S. H.; and Leskovec, J. 2016. Interpretable decision sets: A joint framework for description and prediction. In KDD.
 Letham, B.; Rudin, C.; McCormick, T. H.; and Madigan, D. 2015. Interpretable classifiers using rules and bayesian analysis: Building a better stroke prediction model. The Annals of Applied Statistics 9(3):1350–1371.
 Lou, Y.; Caruana, R.; Gehrke, J.; and Hooker, G. 2013. Accurate intelligible models with pairwise interactions. In KDD.
 Lou, Y.; Caruana, R.; and Gehrke, J. 2012. Intelligible models for classification and regression. In KDD.
 Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. “why should i trust you?”: Explaining the predictions of any classifier. In KDD.
 Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2018. Anchors: Highprecision modelagnostic explanations. In AAAI.
 Serven, D., and Brummitt, C. 2018. pygam: Generalized additive models in python. https://doi.org/10.5281/zenodo.1208723.
 Tan, S.; Caruana, R.; Hooker, G.; and Lou, Y. 2017. Auditing blackbox models using transparent model distillation with side information. arXiv preprint arXiv:1710.06169.
 Tan, S.; Caruana, R.; Hooker, G.; and Gordo, A. 2018. Transparent model distillation. arXiv preprint arXiv:1801.08640.
 Tibshirani, R. J. 2014. Adaptive piecewise polynomial estimation via trend filtering. The Annals of Statistics 42(1):285–323.
 Wood, S. N. 2006. Generalized Additive Models: An Introduction with R. Chapman and Hall/CRC.
 Wood, S. N. 2011. Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. Journal of the Royal Statistical Society (B).
 Zeng, J.; Ustun, B.; and Rudin, C. 2016. Interpretable classification models for recidivism prediction. Journal of the Royal Statistical Society (A).