Identifying Significant Predictive Bias in Classifiers
We present a novel subset scan method to detect if a probabilistic binary classifier has statistically significant bias — over or under predicting the risk — for some subgroup, and identify the characteristics of this subgroup. This form of model checking and goodness-of-fit test provides a way to interpretably detect the presence of classifier bias or regions of poor classifier fit. This allows consideration of not just subgroups of a priori interest or small dimensions, but the space of all possible subgroups of features. To address the difficulty of considering these exponentially many possible subgroups, we use subset scan and parametric bootstrap-based methods. Extending this method, we can penalize the complexity of the detected subgroup and also identify subgroups with high classification errors. We demonstrate these methods and find interesting results on the COMPAS crime recidivism and credit delinquency data.
Increasingly, data-driven tools like probabilistic classifiers are used for decision support and risk assessment in various sectors: criminal justice, public policy, health, banking, online platforms (Angwin et al., 2016; Goel et al., 2016; Miller, 2015; Starr, 2014). To evaluate the usage of such methods, we usually focus on overall predictive performance. However, recent academic and popular writing has also emphasized the importance of potential biases or discrimination in these predictions. Earlier this year, ProPublica conducted a widely discussed analysis (Angwin et al., 2016) of the COMPAS recidivism risk prediction algorithm, arguing that the predictions, controlling for actual risk, were more likely to mistakenly predict black defendants as high-risk of reoffending.
Bias in data-driven classifiers could have several possible sources and forms of bias. We focus on the source of bias from classification techniques that could be insufficiently flexible to predict well for subgroups in the data, due to optimizing for overall performance or model mis-specification. In this paper, we focus on the predictive bias in probabilistic classifiers or risk prediction that result from that source. As a simplified example — we define predictive bias in detail in Section 2 — consider a subgroup , with outcomes and a classifier’s predictions for that subgroup’s outcomes; over-estimation predictive bias is considered:
here is an indicator function for membership in a subgroup , and vice-versa for under-estimation. Predictive bias is different from predictive fairness, which emphasizes comparable predictions between subgroups of a priori interest, like race or gender, while predictive bias emphasizes comparable predictions for a subgroup and its observations.
In this paper, we (1) define a measure of predictive bias based on how a subgroup’s observed outcome odds are different from the predicted odds, and (2) operationalize this definition into a bias scan method to detect and identify which subgroup(s) have statistically significant predictive bias, given a classifier’s predictions . Further, we discuss briefly extending this method to penalize subgroup complexity or detect subgroups with higher than expected classification errors, and present novel case study results from the bias scan on crime recidivism and loan delinquency predictions.
Existing literature on predictive bias focus on sets of subgroups defined by one dimension of a priori interest, such as race, gender, or income. However, some important subgroups may not be described so simply or be considered a priori. ProPublica’s COMPAS analysis (Angwin et al., 2016) and follow-up analyses (Chouldechova, 2017; Skeem and Lowenkamp, 2016) focus on predictive bias for subgroups . In our analysis of COMPAS, we do not detect a significant predictive bias along racial lines, but instead identify bias in a more subtle multi-dimensional subgroup: females who initially committed misdemeanors (rather than felonies), for half of the COMPAS risk groups, have their recidivism risk significantly over-estimated.
Assessing bias in all the exponentially many subgroups is a difficult task both computationally and statistically. First, exhaustively evaluating all subgroups for predictive bias quickly becomes computationally infeasible. From a dataset with features, with each feature having discretized values, we define a subgroup as any -dimension Cartesian set product, between subsets of feature-values from each feature — excluding the empty set. With this axis-aligned criteria, we only consider subgroups that are interpretable, rather than a collection of dataset rows. There are then unique subgroups; consider a dataset with only 4 discretized features (e.g. age, income, ethnicity, location), each with arity , then there are possible subgroups to consider (and if there were a fifth feature).
A second difficulty is estimating statistical significance of a detected subgroup. It is trivial to identify some measure of predictive bias — any subgroup where fraction of observed outcomes does not equal the predicted proportion. The relevant question is instead, can we identify a subgroup(s) that has significantly more predictive bias than would be expected from an unbiased classifier?
To address these difficulties, this work develops a novel extension of fast subset scan anomaly detection methods (Neill, 2012; Neill et al., 2013; Kumar and Neill, 2012; Speakman et al., 2016). This enables our bias scan method to approximately identify the most statistically biased subgroup in linear time (rather than exponential). We then use parametric bootstrapping (Efron and Tibshirani, 1994) to adjust for multiple testing and estimate the statistical significance of the detected subgroup. A distinguishing mechanism of our method is the ability to statistically consider all possible subgroups, expanding the search space beyond just interaction effects, and in essence, enabling the collective consideration of groups of weak, but related signals.
Topically, the problem of assessing bias in data-driven decision making covers areas including predictive bias, problems in the original training data, disparate impacts of predictions, and adjusting predictions to ensure fairness (Adler et al., 2016; Dwork et al., 2012; Feldman et al., 2015; Romei and Ruggieri, 2014) — in areas like criminal justice (Angwin et al., 2016; Chouldechova, 2017; Flores et al., 2016; Skeem and Lowenkamp, 2016), but also other various sectors (House, 2016; O’Neil, 2016; Miller, 2015; Starr, 2014). We focus here on predictive bias. Existing literature on predictive bias has focused just on subgroups of a priori interest, such as race or gender (Angwin et al., 2016; Chouldechova, 2017; Flores et al., 2016; Skeem and Lowenkamp, 2016). We contribute by providing a more general method that can detect and characterize such bias, or poor classifier fit, in the larger space of all possible subgroups, without a priori specification.
Methodologically, our method is comparable to other methods that analyze the residuals between a classifier’s predictions and observed outcomes . This includes a range of long-standing literature, including model checking, goodness-of-fit methods, and visualization of residuals. The identification of patterns in residuals is an early key lesson when teaching regression, first using one-dimensional visualization. A common more rigorous extension is to use interpretable predictive methods to characterize patterns in residuals. Examples of methods include linear models using quadratic or interaction terms. These models cannot collectively consider groups of signals or interactions though, unless specified ex ante, e.g. group lasso (Yuan and Lin, 2006). More flexible assessment of patterns in residuals, comparing residual sum-of-squares between linear models and non-linear methods like random forests, can be formally used via a generalized F-test-style test and parametric bootstrapping, as Shah and Buhlmann (Shah and Buhlmann, 2017) show for regression. They aim to detect the general presence of poor fit for model selection, but do not characterize where this bias is. Tree-based methods with top-down optimization may split subgroups of interest and need distinguishing significance of leaves.
2. Bias Subset Scan Methodology
We extend methodology from the anomaly detection literature, specifically the use of fast, expectation-based subset scans (Neill, 2012; Neill et al., 2013; Kumar and Neill, 2012; Speakman et al., 2016). This methodology is able to identify or approximate the most anomalous subgroup of feature space in linear time, amongst the exponentially many possible ones, enabling tractable subgroup analysis. The general form of these methods are:
where is the detected most anomalous subgroup, is one of several subset scan algorithms for different problem settings, is a dataset with outcomes and discretized features , are a set of expectations or “normal” values for , and is an expectation-based scoring statistic that measures the amount of anomalousness between subgroup observations and their expectations. For this to be tractable, the statistic must satisfy Linear Time Subset Scanning (LTSS, (Neill, 2012)) or Additive Linear Time Subset Scanning (ALTSS, (Speakman et al., 2016)) properties — which prove that feature-values for one feature can be optimally ordered to reduce the number of subgroups to consider.
In the bias scan method, we develop a novel extension specifically of the Multi-Dimensional Subset Scan (MDSS) method (Neill et al., 2013), described by Algorithm 1. We contribute (1) a new subgroup scoring statistic, , that measures the bias in a given subgroup, and prove it satisfies the ALTSS property; and (2) the application of parametric bootstrapping for the subset scanning setting to estimate statistical significance of detections.
First, we define the statistical measure of predictive bias function, . It is a likelihood ratio score and a function of a given subgroup . The null hypothesis is that the given prediction’s odds are correct for all subgroups in : . The alternative hypothesis assumes some constant multiplicative bias in the odds for some given subgroup :
In the classification setting, each observation’s likelihood is Bernoulli distributed and assumed independent. This results in the following scoring function for a subgroup 111Following common practice in the scan statistics literature, we maximize the free parameter to identify the most likely alternative hypothesis, which also maximizes the score. The resulting score is influenced by both the number of observations in the subgroup and most likely value of .:
Our bias scan is thus represented as: .
Second, to determine the statistical p-value that the given classifier/model has a biased subgroup, we use parametric bootstrap simulated predictions under the null of a correctly specified classifier, as also used by (Shah and Buhlmann, 2017) in their residual prediction tests. As extensions, with little added computation, we also introduce penalties for subgroup complexity, with an increasing penalty for each feature, based on the size of each feature’s subset of feature-values, but with no penalty if the feature includes 1 or all feature-values. This encourages a reduced dimension of the detected subgroup, and can be used in an “elbow-curve” style heuristic between bias score and complexity. Also, to detect subgroups with higher than expected classification errors, we can adjust the bias scan based on the concept that a subgroup’s predictions also predicts the expected classification rate of a subgroup.
3. Demonstrative Bias Scan Results
In synthetic experiments, we compare the bias scan method to both a lasso and stepwise regression analysis of residuals222We calibrate the lasso analysis of residuals to identify the lasso penalty hyper-parameter that has a 5% false positive rate (on data with no bias injections). This is to match the chosen 5% false positive rate of the bias-scan, and is visualized in the top-left of Figure 1. Though we only inject one subgroup with bias and the lasso can detect multiple subgroups simultaneously, the bias scan could also be used repeatedly to detect multiple subgroups.. The detection performance results are shown in Figure 1, without the stepwise regression however, because it always has worse detection performance than the lasso regression analysis. In the experiment, we generate data with 4 categorical features, each with arity 6, and data is evenly distributed and generate Bernoulli outcomes333Each feature-value has random coefficient values in an additive log-odds model (e.g. a logistic regression model) to generate probabilities for each observation and draw Bernoulli outcomes.. The variation in the experiment is in the injected predictive bias: we inject additional log-odds bias of size 1.5 in one, or several, interaction effects of 2, 3, or 4 dimensions. To demonstrate the power of grouping weak signals, we only affect 100 observations, but range from concentrating them all in one specific interaction, or spread them across several related interactions444The total number of observations changes depending on the size of the injected region, to ensure that data is evenly distributed across all feature-values..
We find that the lasso analysis on residuals, which considers the space of all 2, 3, and 4-way interactions and uses the cross-validation optimal penalty555Using the “1SE” penalty term has worse detection performance., has a better rate of non-zero interaction coefficients (top-right of Figure 1) when the injected bias is concentrated in one 2-way or 3-way interaction. However, when the injected bias is spread across four 2-way interactions, eight 3-way interactions, or sixteen 4-way interactions, lasso detection rate is below the bias scan detection rate, e.g., approximately 25% compared to 60%. A similar story, where the lasso has more difficulty when bias is spread across related interactions, occurs looking at the recall and detection precision of the biased observations/tensor cells, shown in the bottom half of Figure 1. For example, when the bias is injected in eight (8) 3-way interactions (“2x2x2x6”): lasso has an average recall/precision of 35%/45% compared to the average bias scan recall/precision of 75%/80%. This highlights the potential improvement by considering subgroups rather than interactions, grouping weak, related signals together. If we used the bias scan to detect a subgroup to use for an additional logistic regression model term, this also has improved out-of-sample prediction performance over a lasso logistic regression with 2, 3, and 4-way interaction effects.
Recidivism Prediction Case Study
As a case study for identifying biases in classification, we apply our bias scan method to the COMPAS crime recidivism risk prediction dataset provided by ProPublica . This dataset includes age, race, gender, number of prior offenses, and crime severity (felony vs misdemeanor) for each individual, along with a binary gold standard label (of reoffending in a 2-year time period) and the classification prediction made by the COMPAS algorithm (categorized risk groups 1, 2, …, 10). We find notable biases by the COMPAS prediction that we have not seen noted elsewhere. We assume the provided decile scores adequately represent all the private information that COMPAS uses. We initialize by fitting an unpenalized logistic regression based on categorized decile scores. Using bias scan, we find that the COMPAS decile scores clearly have predictive bias on subgroups defined by their counts of priors. Defendants with >5 priors are significantly under-estimated by the COMPAS deciles (mean predicted rate of 0.60 in the subgroup, observed rate of 0.72, ), while those with 0 priors are significantly over-estimated (mean predicted rate of 0.38, observed rate of 0.29, ).
Using this initial finding, we refit the model to account for both decile score and discretized prior counts. Applying the bias scan again on the predictions of this improved classifier, we again identify two significant subgroups of classifier bias. Young (< 25 years) males are under-estimated (regardless of race or initial crime type) (); with an observed recidivism rate of 0.60 and a predicted rate of 0.50 (). Additionally, females, whose initial crimes were misdemeanors, and have COMPAS decile scores are over-estimated (); with an observed recidivism rate of 0.21 and a predicted rate of 0.38 (). In Figure 2, we compare the original COMPAS decile model (black dashed line), with the logistic model that accounts for the four detected subgroups.
The over- and under-estimated subgroups involve 2 features and 3 features, respectively. They were identified by penalizing the complexity of the detected subgroup. The unpenalized detected subgroups had scores of respectively and involved 4 and 5 features respectively. When we add some penalty terms for complexity, we slightly reduce the scores to respectively, both of which are still significant at a 5% FPR, but reduces the number of involved features to 2 and 3, respectively. For the young males, it removed involvement of race and COMPAS decile score; for the females with misdemeanors, it removed involvement of features on priors and race.
Other Datasets of Interest
Expanding our analysis, we identify predictive bias from the use of various classifiers (e.g., lasso regression on all 2-way interactions, tree-based classifiers) applied to various datasets (credit risk, stop-and-frisk weapon carrying prediction [based on the stop-and-frisk data-driven model proposed by (Goel et al., 2016)], income prediction, COMPAS, breast cancer prediction, and diabetes prediction). For each type of classifier, we detect significant subgroups of predictive bias in some of those datasets. Furthermore, when we hold out half of the dataset, we find the significant detected subgroups also have the same directional bias in the held-out data, though the magnitude of the bias was smaller, as expected.
As an example, we discuss the credit delinquency prediction dataset, i.e. “Give Me Some Credit” dataset provided by Kaggle. In this dataset, using the cross-validation optimal lasso regression on all the discretized features, the top identified over-estimated subgroup is defined by those users in the top half of utilization in the data (>15% credit limit utilization) and who have at least 1 occurrence of being within each of 30-59, 60-89, and 90+ days late (i.e., on a 3 separate payments); (, observed rate of 2-year delinquency of 0.79, predicted rate of 0.90). There are such accounts in the dataset, about 1.7% of the total dataset. For comparison, the mean rate of delinquency in the entire total dataset is 15%. In this same data, we detect a high error subgroup, with both a predicted and observed rate of 61%, but with much more classification errors than expected due to over-confidence by the classifier on both low and high predicted-risk consumers.
To understand the potential impact of this predictive bias, consider if this data were used to rank customers by their delinquency risk. 470 of the 496 top 1% riskiest consumers belong to the over-estimated subgroup. If the observations in that over-estimated subgroup were adjusted by a constant multiplication to their predicted odds, then only 286 consumers from that subgroup would then be ranked in the top 1%.
- Adler et al. (2016) Philip Adler, Casey Falk, Sorelle A Friedler, Gabriel Rybeck, Carlos Scheidegger, Brandon Smith, and Suresh Venkatasubramanian. 2016. Auditing black-box models by obscuring features. arXiv preprint arXiv:1602.07043 (2016).
- Angwin et al. (2016) Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. 2016. Machine bias: There’s software used across the country to predict future criminals. and it’s biased against blacks. ProPublica, May 23 (2016).
- Chouldechova (2017) Alexandra Chouldechova. 2017. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. arXiv preprint arXiv:1703.00056 (2017).
- Dwork et al. (2012) Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. 2012. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference. ACM, 214–226.
- Efron and Tibshirani (1994) Bradley Efron and Robert J Tibshirani. 1994. An introduction to the bootstrap. CRC press.
- Feldman et al. (2015) Michael Feldman, Sorelle A Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. 2015. Certifying and removing disparate impact. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 259–268.
- Flores et al. (2016) Anthony W Flores, Kristin Bechtel, and Christopher T Lowenkamp. 2016. False Positives, False Negatives, and False Analyses: A Rejoinder to Machine Bias: There’s Software Used across the Country to Predict Future Criminals. And It’s Biased against Blacks. Fed. Probation 80 (2016), 38.
- Goel et al. (2016) Sharad Goel, Justin M Rao, Ravi Shroff, et al. 2016. Precinct or prejudice? Understanding racial disparities in New York City?s stop-and-frisk policy. The Annals of Applied Statistics 10, 1 (2016), 365–394.
- House (2016) White House. 2016. Big Data: A Report on Algorithmic Systems, Opportunity, and Civil Rights. Washington, DC: Executive Office of the President, White House (2016).
- Kumar and Neill (2012) Tarun Kumar and Daniel B Neill. 2012. Fast multidimensional subset scan for outbreak detection and characterization. In Proceedings of the International Society of Disease Surveillance Annual Conference. ISDS.
- Miller (2015) Claire Cain Miller. 2015. When algorithms discriminate. New York Times 9 (2015).
- Neill (2012) Daniel B Neill. 2012. Fast subset scan for spatial pattern detection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 74, 2 (2012), 337–360.
- Neill et al. (2013) Daniel B Neill, Edward McFowland, and Huanian Zheng. 2013. Fast subset scan for multivariate event detection. Statistics in medicine 32, 13 (2013), 2185–2208.
- O’Neil (2016) Cathy O’Neil. 2016. Weapons of math destruction: How big data increases inequality and threatens democracy. Crown Publishing Group (NY).
- Romei and Ruggieri (2014) Andrea Romei and Salvatore Ruggieri. 2014. A multidisciplinary survey on discrimination analysis. The Knowledge Engineering Review 29, 05 (2014), 582–638.
- Shah and Buhlmann (2017) Rajen D. Shah and Peter Buhlmann. 2017. Goodness-of-fit tests for high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) (2017), n/a–n/a. https://doi.org/10.1111/rssb.12234
- Skeem and Lowenkamp (2016) Jennifer L Skeem and Christopher T Lowenkamp. 2016. Risk, race, and recidivism: predictive bias and disparate impact. Criminology 54, 4 (2016), 680–712.
- Speakman et al. (2016) Skyler Speakman, Sriram Somanchi, Edward McFowland III, and Daniel B Neill. 2016. Penalized fast subset scanning. Journal of Computational and Graphical Statistics 25, 2 (2016), 382–404.
- Starr (2014) Sonja Starr. 2014. Sentencing, by the Numbers. New York Times (Aug. 10, 2014), available at http://www. nytimes. com/2014/08/11/opinion/sentencing-by-the-numbers. html (2014).
- Yuan and Lin (2006) Ming Yuan and Yi Lin. 2006. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68, 1 (2006), 49–67.