ProPublica’s COMPAS Data Revisited

ProPublica’s COMPAS Data Revisited

Matias Barenstein
June 11, 2019
Abstract

In this paper I re-examine the COMPAS recidivism score and criminal history data collected by ProPublica in 2016, which has fueled intense debate and research in the nascent field of ‘algorithmic fairness’ or ‘fair machine learning’ over the past three years. ProPublica’s COMPAS data is used in an ever-increasing number of studies to test various definitions and methodologies of algorithmic fairness. This paper takes a closer look at the actual datasets put together by ProPublica. By doing so, I find that ProPublica made an important data processing mistake when it created some of the key datasets most often used by other researchers. In particular, the datasets built to study the likelihood of recidivism within two years of the original COMPAS screening date. As I show in this paper, ProPublica made a mistake implementing the two-year sample cutoff rule for recidivists in such datasets (whereas it implemented an appropriate two-year sample cutoff rule for non-recidivists). As a result, ProPublica incorrectly kept a disproportionate share of recidivists. This data processing mistake leads to biased two-year recidivism datasets, with artificially high recidivism rates. This also affects the positive and negative predictive values. On the other hand, this data processing mistake does not impact some of the key statistical measures highlighted by ProPublica and other researchers, such as the false positive and false negative rates, nor the overall accuracy.

Keywords: Fair Machine Learning, Algorithmic Fairness, Recidivism, Risk, Bias, COMPAS, ProPublica111E-mail: mbarenstein@gmail.com. The author is a staff economist at the Federal Trade Commission. However, he developed this work independently, on his own personal time. The views expressed in this article are therefore those of the author. They do not necessarily represent those of the Federal Trade Commission or any of its Commissioners.

Due to the rise in data collection and its use, as well as the accompanying development of more predictive and complex machine learning models, and the still nascent field of artificial intelligence, the past several years has seen a marked spike in interest and research regarding what is now often referred to as “algorithmic fairness” or “fair machine learning.” (See Corbett-Davies and Goel, 2018; Cowgill and E. Tucker, 2019; Kleinberg et al., 2018)222See also a seminal paper in this literature by Barocas and Selbst (2016); although they do not use this specific terminology.

One notable event in the chronological development of this field, which helped propel the interest and research into algorithmic fairness, was the groundbreaking investigative journalism work of ProPublica on the COMPAS recidivism risk score (Angwin et al., 2016), which is sometimes used to aide various decisions in the judicial system.333COMPAS is short for: Correctional Offender Management Profiling for Alternative Sanctions. In 2016, a team of investigative journalist from ProPublica constructed a dataset of defendants from Broward County, FL, who had been arrested in 2013 or 2014 and assessed with the COMPAS screening system. ProPublica then collected data on future arrests for these defendants through the end of March 2016, in order to study how the COMPAS score predicted recidivism.

Based on its analysis, focusing on one set of predictive metrics, ProPublica concluded that the COMPAS risk score was biased against African-Americans. The company that developed the COMPAS risk scoring system, Northpointe, focusing on a different predictive metric, defended the model as unbiased (Dieterich et al., 2016). This sparked intense debate and research on the various possible definitions of fairness. Some of this work has been primarily conceptual. For example, the theoretical work showing the impossibility of simultaneously attaining some of the more popular fairness goals (Chouldechova, 2016; Kleinberg et al., 2018).

ProPublica’s investigation was groundbreaking, since it used public records requests to obtain COMPAS scores and dates and personal information on a group of defendants, as well as prison and jail information for them, and was able to match and merge these disparate data sources. Moreover, perhaps for full transparency and following best practices for reproducibility, ProPublica made the dataset it collected available to the public. As a result, the ProPublica COMPAS data has become one of the key benchmarking datasets with which a growing number of researchers have tested novel algorithmic fairness definitions. Indeed, it has become perhaps the most widely used dataset in the field of algorithmic fairness (Bilal Zafar et al., 2016, 2017; Chouldechova, 2016; Corbett-Davies and Goel, 2018; Corbett-Davies et al., 2017; Cowgill, 2018; Flores et al., 2016; Rudin et al., 2018).

While ProPublica’s COMPAS data is used in an ever-increasing number of studies to test various definitions and methodologies of algorithmic fairness, researchers have taken the data as is to test their methodologies, but do not appear to have examined closely the data itself for data processing issues.444Except for Rudin et al. (2018), who reconstruct datasets from the original ProPublica Python database, “partly to ensure the quality of the features and partly to create new features.” (p.32) While it appears that in so doing they may have avoided making the same data processing mistake as ProPublica, they do not generally highlight the dataset differences between their datasets and ProPublica’s, and do not identify ProPublica’s data processing mistake. Their focus is altogether different, as they attempt to reverse engineer the COMPAS recidivism risk scores, to understand how Northpointe builds those scores. This paper, instead of testing a novel fairness definition or procedure, takes a closer look at the actual datasets put together by ProPublica. Doing so, I find that ProPublica made an important data processing mistake creating some of the key datasets most often used by other researchers. In particular, the datasets built to study the likelihood of recidivism within two years of the original offense and COMPAS screening date.

ProPublica made a mistake implementing the two-year sample cutoff rule for recidivists in the two-year recidivism datasets (whereas it implemented an appropriate two-year sample cutoff rule for non-recidivists). As a result, ProPublica incorrectly kept a disproportionate share of recidivists in such datasets. This data processing mistake leads to biased two-year recidivism datasets, with artificially high recidivism rates. To my knowledge, this is the first paper to highlight this key data processing mistake.

To construct these datasets, ProPublica presumably wanted to keep people observed for at least two years at the end of the time window for which ProPublica collected criminal history data, on April 1, 2016. Therefore, we should not have expected to see anybody in the two-year datasets with COMPAS screening (or arrest) dates after 4/1/2014 (i.e. less than two years prior to the data collection). However, as we will see below, there are many people in ProPublica’s two-year recidivism datasets who do indeed have a COMPAS screening (or arrest) date after this cutoff, all the way through December 31, 2014, the end date of the original database.

Taking a closer look at the data, I find that ProPublica dropped non-recidivists with COMPAS screening dates post 4/1/2014. However, it kept people with COMPAS screening dates after 4/1/2014 if they recidivated. ProPublica’s data processing logic that created the two-year recidivism datasets is as follows: keep a person if they are observed for at least two years OR keep a person if they recidivate within two years. This leads to the issue noted above. Unfortunately, this results in a biased sample dataset.555It is not clear whether ProPublica intended to actually process the data this way, in which case it is a conceptual mistake, or whether they did not intend to use this faulty logic, in which case it is a data processing mistake. In either case, it leads to the same biased sample dataset.

As I show in this paper, the bias in the two-year dataset is clear, there are a disproportionate number of recidivists. This fundamental problem in the dataset construction affects some statistics more than others. It obviously has a substantial impact on the total number of recidivists, and hence, the relative share or rate of recidivism. In particular, it artificially inflates the prevalence of recidivism. Which also affects the positive predictive value (PPV) or precision, and the negative predictive value (NPV). On the other hand, it has relatively little impact on several other key statistics, such as accuracy, the false positive rate (FPR), and the false negative rate (FNR).666Or one minus these rates, i.e. specificity and sensitivity.

Most of the algorithmic fairness research using ProPublica’s COMPAS data has focused on some of the latter prediction metrics, which happen to be less impacted by the data processing error highlighted here. (Although the utility of focusing on those particular prediction metrics has been called into question; see Corbett-Davies and Goel (2018))777See also Northpointe (Dieterich et al., 2016), who argue that PPV and NPV are more relevant. Nevertheless, as I examine below, various authors have produced numerous Figures and Tables with clearly biased numbers, especially regarding the recidivism rate.

In the remainder of this paper, I examine in detail the data processing issue just highlighted, as well as an anomaly in the original COMPAS scores data obtained by ProPublica. I then construct a new dataset that drops all the people whose COMPAS screening date occurs after 4/1/2014, thus implementing a more appropriate sample cutoff for all people for the two year recidivism analysis. I then replicate some of the key numbers, figures, and tables using this new “corrected” dataset, and show how they differ from the results obtained with the biased two-year recidivism sample dataset created by ProPublica, and widely used in the literature.

ProPublica obtained a dataset of pretrial defendants and probationers from Broward County, FL, who had been assessed with the COMPAS screening system sometime between January 1, 2013, and December 31, 2014. ProPublica then collected data on future arrests through the end of March 2016, for the more than 11 thousand pretrial defendants in this dataset, in order to study how the COMPAS score predicts recidivism for these defendants.888The set of more than 11 thousand pretrial defendants is what I call the full dataset. It has 11,757 people.999ProPublica obtained criminal history information (both before and after the COMPAS screen date) for their sample of COMPAS pretrial defendants from public criminal records from the Broward County Clerk’s Office website through April 1, 2016. It also obtained jail records from the Broward County Sheriff’s Office from January 2013 to April 2016, and downloaded public incarceration records from the Florida Department of Corrections website.

The location of the ProPublica data on the Web is at https://github.com/propublica/compas-analysis. ProPublica collected the data for their study and created a Python database. From that database it constructed various sub-datasets that merged and calculated various important features. For example, the period of time between arrests. And the presence of a re-arrest for a new crime within two years of the original one. ProPublica then exported these sub-datasets into .csv files. These files are the ones most often used by other researchers. ProPublica’s two year datasets have some important problems. Which is actually the topic of my research.

I primarily use two of the .csv files that ProPublica created. These were named by ProPublica as “compas-scores.csv” and “compas-scores-two-years.csv”. The first file contains the full dataset (of pretrial defendants that ProPublica obtained from the Broward County Sheriff’s Office). That file contains 11,757 people.101010This file does not contain some key information in other datasets. Such as any prison time served for the original crime if convicted. Which is necessary to calculate whether the person was free for at least two years. And thus, it does not have a flag for two year recidivism. This total is trimmed down by ProPublica to 10,331 people (I discuss the trimming done by ProPublica in the Appendix).

The second file I use is a file that ProPublica created for the purpose of studying two-year general recidivism. (I use the term general to distinguish it from the smaller subset of violent recidivism. General recidivism includes both violent and non-violent offenses. I focus on the general recidivism two-year dataset in this paper, but the two-year violent recidivism data created by ProPublica has the same data processing issue)111111ProPublica did not actually make the two-year violent recidivism csv file it uses for its key analysis available. But it can be easily reconstructed. The more reduced two-year violent recidivism csv file it did make available, which it uses only in certain parts of its analyses, has the further problem that it drops people who are non-violent recidivists. This file contains, in theory, a subset of people who are observed for at least two years. And it tags people who recidivated withing two years as having a two_year_recid flag turned on. This file contains 7,214 people.

The approximately three thousand people dropped from the full dataset to generate the two-year recidivism dataset are dropped because they are not observed for at least two years (outside prison) or do not recidivate within two years.

To construct the two-year recidivism datasets,121212ProPublica also constructed two-year violent recidivism dataset(s). ProPublica presumably wanted to keep people observed for at least two years at the end of the time window for which ProPublica collected criminal history data, on April 1, 2016. As mentioned in the introduction, we should not have expected, therefore, to see anybody in the two-year datasets with COMPAS screening (or arrest) dates after 4/1/2014 (i.e. less than two years prior to ProPubica’s data collection). However, as I show here, there are many people in the data who do indeed have a COMPAS screening (or arrest) date after this cutoff, all the way through the December 31, 2014, which is the end date of the full database.

Here I graph the number of cases or arrests by COMPAS screening date, and I draw a vertical red line at April 1, 2014. (This is the point in time after which we should not have seen any people entering the two-year recidivism datasets, as just explained) For this and the subsequent histograms of COMPAS screening dates, I use 7-day (i.e. week-long) bins.

Figure 1: Persons by COMPAS Screen Date (7-day bins) - Full Dataset

Other than the very noticeable drop in COMPAS screen dates in mid-2013, this graph appears reasonable.131313Also noticeable is the higher number of COMPAS screen dates in the first half of 2013. The dates where the mid-2013 drop occurs are in June and July 2013. It is not clear why there is such a drop in COMPAS cases during these two months. But I do not address this issue in my current research. (To the the extent this is a problem, it appears to be a problem with the original dataset that ProPublica received from Broward County, since it is also evident in the “raw-scores” dataset. So it does not appear to be a data processing issue by ProPublica. Thus, is not clear what can be done about this)141414I did check whether the relatively few people with COMPAS screen dates during those two months looked different in various dimensions, but they did not. (Except they did have somewhat longer time between the arrest date and COMPAS screen date, with a mean of 5, compared to 1 for the rest of the data.

Figure 2: Persons by COMPAS Screen Date (7-day bins)- ProPublica Two-Year Dataset

Here, in ProPublica’s two-year general recidivism dataset, we begin to see ProPublica’s data processing problem. In this two year general recidivism dataset we should not have expected to have any records with COMPAS screen dates after 4/1/14. Since ProPublica collected criminal history data through the end of March 2016. So for someone to have at least two years of exposure post initial arrest, the two year recidivism dataset should have only contained people with COMPAS screen dates prior to 4/1/2014. This date is indicated by the red vertical line. However, we clearly see that while the number of people drops substantially after 4/1/2014, there are still non-trivial numbers of people after that date. This is because, as mentioned above, there was an error in ProPublica’s data processing used to create the two year recidivism datasets.

To create the two year dataset, ProPublica used the following logic. You either had your compas screen date at least two years prior to ProPublica’s data collection time. So two years prior to the end of March 2016 (net of any jail and prison time). Or you could be in the data for less than two years, if you recidivated. Unfortunately, for the latter ProPublica did not use the cutoff of 4/1/2014 for the COMPAS screen date. So this creates an unbalanced dataset with too many recidivists. This is shown more clearly in the Figures below.

To see more clearly the data processing mistake, I now take a look at these COMPAS screen dates separately for recidivists and non-recidivists.151515I do this by overall or any recidivism (i.e. the “is_recid” variable in ProPublica’s datasets), not two year recidivism, since the full dataset does not have two year recidivism flag.

Figure 3: Persons by COMPAS Screen Date (7-day bins) by Recividism Status - Full Dataset

Below I show the same Figures using ProPublica’s two-year recidivism dataset. For comparison to the full data figures above, I also do this for the overall or any recidivism variable (i.e. the “is_recid” variable in ProPublica’s dataset; instead of the “two_year_recid” variable).161616As we see in the Recidivism Rates Section below, there are 220 people who have the general “is_recid” flag turned on, but not the “two_year_recid” flag. These are people who recidivated, but did so after more than two years after the original COMPAS date, but before the end date of ProPublica criminal history data window at the end of March 2016. These 220 people represent a 0.06 share of the 3,471 people who recidivate in total.

Figure 4: Persons by COMPAS Screen Date (7-day bins) by Recidivism Status - ProPublica Two-Year Dataset

These Figures show that ProPublica correctly dropped all non-recidivists with COMPAS screening dates post 4/1/2014. However, it kept people with COMPAS screening dates after 4/1/2014 if they recidivated. Indeed, in the Tables below we see that the two-year recidivism dataset has almost exactly the same number of people who recidivate at any point in time, as the full data does, 0 vs. 0.171717The difference of two people is because these 2 people have score_text = “N/A”, which ProPublica drops from the two-year recidivism dataset.

Table 1: Any Recidivism - Full data
is_recid Freq
0 6858
1 3473
Total 10331
Table 2: Any Recidivism - ProPublica Two-Year dataset
is_recid Freq
0 3743
1 3471
Total 7214

ProPublica’s data processing logic that created the two-year recidivism datasets is as follows: keep a person if they are “observed” for at least two years outside of jail/prison OR keep a person if they recidivate within two years.181818ProPublica obtained criminal history information from the Broward County Clerk’s Office website, and jail records from the Broward County Sheriff’s Office, as well as public incarceration records from the Florida Department of Corrections website. I am not sure what happens if someone from their sample moves away from Florida after the COMPAS screen date. In particular, it is not clear whether they would show up in their data again if they commit a crime in a different state. I also do not know what happens if any of the people in their sample become deceased. There could be some sample attrition.

ProPublica should have dropped from the two year recidivism dataset any people with COMPAS screen dates post 4/1/2014. But it kept people after that date if they recidivated. Unfortunately, this results in a biased sample dataset.191919It is not clear whether ProPublica intended to actually process the data this way, in which case it is a conceptual mistake, or whether they did not intend to use this faulty logic, in which case it is a data processing mistake. In either case, it leads to the same biased sample dataset.

Here, I construct a corrected version of the two-year recidivism dataset where I drop all people with a COMPAS screen date after April 1, 2014, including recidivists. In this corrected dataset, I end up with the same number of non-recidivists as in the ProPublica two-year dataset, but I have substantially fewer recidivists.

Table 3: Pre vs. Post April 1, 2014 COMPAS screen dates - Two-Year Dataset
post_april_2014 Freq
0 6216
1 998
Total 7214
Table 4: Any Recidivism by Pre-Post April 1, 2014 COMPAS screen date - ProPublica Two-Year data
is_recid post_april_2014
0 1
0 3743 0
1 2473 998
Total 6216 998

To avoid right-censoring, one should even implement an earlier date sample cutoff. Since people just before the April 1, 2014 cutoff with jail/prison stints will not have two years of exposure net of prison time.202020In the Appendix, I explore this and find the best sample cutoff that still preserves the most data is around February 1, 2014. But I will use April 1, 2014 here for simplicity. If we look at the COMPAS screening dates for this corrected dataset, we have the following:212121Again, for comparison to the full data Figures displayed earlier, I do this for the overall or any recidivism variable (i.e. the “is_recid” variable in ProPublica’s dataset; instead of the “two_year_recid” variable). As noted above, there are 220 people who have the general “is_recid” flag turned on, but not the “two_year_recid” flag; see the Recidivism Rates Section below. These people, by definition, all have COMPAS screen dates before April 2014.

Figure 5: Persons by COMPAS Screen Date (7-day bins) by Recidivism Status - Corrected Two-Year Dataset

Here I will focus on the two year recidivism variable. The COMPAS screen date Figures above, split by recidivism status, were done using the overall or any recidivism variable “is_recid”, not the “two_year_recid” variable, for comparison purposes to the full data. As we see here, there are 220 more people with is_recid=1 than two_year_recid=1. These are people who recidivated, but did so after more than two years after the original COMPAS screen date (but before the end date of ProPublica criminal history data window at the end of March 2016).

Table 5: Any vs. Two-Year Recidivism - ProPublica Two-Year dataset
is_recid two_year_recid
0 1 Total
0 3743 0 3743
1 220 3251 3471
Table 6: Any vs. Two-Year Recidivism - Corrected Two-Year dataset
is_recid two_year_recid
0 1 Total
0 3743 0 3743
1 220 2253 2473

These 220 people represent a 0.06 share of the 3471 people who recidivate in total. These 220 people, by definition, all have COMPAS dates before April 2014. So if I had done the COMPAS screen date by recidivism status Figures for the two-year ProPublica dataset using the two_year_recid variable instead, the two_year_recid = 0 graph would also show no people after April 2014. And also by definition, all the people with is_recid = 1 who have COMPAS screen dates after April 2014 would also have two_year_recid = 1 (since April 1, 2014, is less than two years before the end of March 2016). So the number of people with post April 1, 2014 COMPAS dates would be the same in the two_year_recid = 1 graph. So the main findings of a data processing error are similar for the general recidivism variable and the two year recidivism variable.

Either way, the bias in ProPublica’s two-year recidivism datasets is clear: there is a disproportionate number of recidivists. This fundamental problem in the dataset construction affects some statistics more than others. It obviously has a substantial impact on the total number of recidivists, and hence, the relative share or rate of recidivism. In particular, it artificially inflates the prevalence of recidivism.

Table 7: Two-Year Recidivism - ProPublica Two-Year dataset
two_year_recid Freq
0 0.55
1 0.45
Table 8: Two-Year Recidivism vs. Pre-Post April 1, 2014 COMPAS screen date - Two-Year data
is_recid post_april_2014
0 1
0 3963 0
1 2253 998
Total 6216 998
Table 9: Two-Year Recidivism vs. Pre-Post April 1, 2014 COMPAS screen date - Two-Year data
is_recid post_april_2014
0 1
0 0.64 0
1 0.36 1

From these Tables we see that two year recidivism is 0.45 in ProPublica’s two year data.222222This number is also reported by ProPublica in the results for item (51) in its GitHub Jupyter notebook (Larson et al., 2017). We also see that since ProPublica kept recidivists (but did not keep non-recidivists) with COMPAS screen dates post 4/1/14, all people with COMPAS screen dates post 4/1/2014 in the two year recidivism dataset are recidivists. As can be seen, therefore, the correct two-year recidivism rate for the two year data should be 0.36. But due to the post 4/1/2014 recidivists that ProPublica incorrectly kept, this artificially inflates the two-year recidivism rate to 0.45. So there is a difference of 0.09 percentage points, and thus, the two-year recidivism rate calculated by ProPublica is 25 percent higher than the true rate.

One can further examine how this artificial inflation of the recidivism rate holds for people across the COMPAS score decile distribution. I do so in the next Figure.

Figure 6: Two-year Recidivism by COMPAS score decile

Another way of seeing ProPublica’s data processing mistake when creating the two-year recidivism datasets is by doing a survival analysis. In the Appendix I do such an analysis, and it confirms the results presented here.

Here I explore how the data processing mistake impacts other results. In particular, I look at the effect on the results from the contingency table analysis performed by ProPublica. For this analysis, ProPublica turned the COMPAS score categories of Low, Medium, and High, into a binary classifier, grouping Medium and High scores into an overall High score category. I do the same here, and report the results obtained by ProPublica with their two year recidivism dataset, and the analogous set of results obtained using the corrected two-year recidivism dataset.

Table 10: COMPAS Score Categories - ProPublica Two-Year Dataset
factor_score_text Freq
Low 3897
Medium 1914
High 1403
Total 7214
Table 11: COMPAS Score Categories [Converted to Binary] - ProPublica Two-Year Dataset
high_score Freq
0 3897
1 3317
Total 7214
Table 12: Confusion Matrix - ProPublica Two-Year Data - Recidivism vs. Low/High COMPAS Score
Actual
two_year_recid
Predicted
COMPAS Score
Low High
0 2681 1282 0.55
1 1216 2035 0.45
Table 13: COMPAS Score Categories [Converted to Binary] - Corrected Two-Year Dataset
high_score Freq
0 3522
1 2694
Total 6216
Table 14: Confusion Matrix - Corrected Two-Year Data - Recidivism vs. Low/High COMPAS Score
Actual
two_year_recid
Predicted
COMPAS Score
Low High
0 2681 1282 0.64
1 841 1412 0.36
Table 15: Confusion Matrix: Results that are similar between ProPublica Two-Year vs. Corrected Two-Year data
Accuracy FPR FNR
ProPublica_results 7214 0.654 0.323 0.374
Corrected_results 6216 0.658 0.323 0.373
Table 16: Confusion Matrix: Results that are different between ProPublica Two-Year vs. Corrected Two-Year data
Prevalence Pos Pred Value Neg Pred Value Detection Rate
ProPublica_results 7214 0.45 0.61 0.69 0.28
Corrected_results 6216 0.36 0.52 0.76 0.23

Above I have replicated some of the results obtained by ProPublica.232323In particular, those reported in item (51) in their GitHub Jupyter notebook (Larson et al., 2017). Although the accuracy and detection rate are not reported by ProPublica. I also report the analogous results using the corrected sample cutoff two-year recidivism dataset. In addition to the prevalence of recidivism, we see that the biased two-year dataset used by ProPublica also affects the positive predictive value (PPV) (which is often referred to as “precision”), and the negative predictive value.242424And the detection rate, as well as the no-information rate, are also different.252525ProPublica highlights the 0.61 PPV for general recidivism (and a 0.20 PPV for violent recidivism) in its article (Angwin et al., 2016). In the corrected data, with the lower prevalence of recidivism, not surprisingly, we see in the Table above that the PPV is lower and the NPV is higher. Northpointe focuses on the complements to PPV and NPV (i.e. 1 minus these) (see Dieterich et al., 2016). If we focus on these instead, we see that with the biased ProPublica two-year dataset 0.39 of people were labeled high risk but did not reoffend, whereas with the corrected data 0.48 of people are labeled high risk but did not reoffend. And before 0.31 were labeled low risk but did re-offend, whereas now 0.24 do so. Again, given the lower prevalence of recidivism in the corrected data, it is not surprising that one type of error goes up and the other goes down.

On the other hand, the biased dataset has relatively little impact on several other key statistics, such as accuracy, the false positive rate (FPR), and the false negative rate (FNR).262626Or one minus these rates, i.e. specificity and sensitivity.

Next, following ProPublica’s analysis, I repeat the confusion matrix analysis separately for African-Americans and Caucasians (who I label black and white, respectively, in the Tables below). This is the key analysis that garnered the most attention. With a higher false positive rate (FPV) and lower false negative rate (FNR) for blacks than whites.

Table 17: Confusion Matrix - African Americans - ProPublica Two-Year Data - Recidivism vs. Low/High COMPAS Score
Actual
two_year_recid
Predicted
COMPAS Score
Low High
0 990 805 0.49
1 532 1369 0.51
Table 18: Confusion Matrix - Caucasians - ProPublica Two-Year Data - Recidivism vs. Low/High COMPAS Score
Actual
two_year_recid
Predicted
COMPAS Score
Low High
0 1139 349 0.61
1 461 505 0.39
Table 19: Confusion Matrix - African Americans - Corrected Two-Year Data - Recidivism vs. Low/High COMPAS Score
Actual
two_year_recid
Predicted
COMPAS Score
Low High
0 990 805 0.57
1 375 969 0.43
Table 20: Confusion Matrix - Caucasians - Corrected Two-Year Data - Recidivism vs. Low/High COMPAS Score
Actual
two_year_recid
Predicted
COMPAS Score
Low High
0 1139 349 0.7
1 314 330 0.3
Table 21: Confusion Matrix African-Americans: Results that are similar between ProPublica Two-Year vs. Corrected Two-Year data
Accuracy FPR FPR
ProPublica_black_results 3696 0.638 0.448 0.448
Corrected_black_results 3139 0.624 0.448 0.448
Table 22: Confusion Matrix Caucasians: Results that are similar between ProPublica Two-Year vs. Corrected Two-Year data
Accuracy FPR FPR
ProPublica_white_results 2454 0.670 0.235 0.235
Corrected_white_results 2132 0.689 0.235 0.235
Table 23: Confusion Matrix - African-Americans: Results that are different between ProPublica Two-Year vs. Corrected Two-Year data
Prevalence Pos Pred Value Neg Pred Value Detection Rate
ProPublica_black_results 3696 0.51 0.63 0.65 0.37
Corrected_black_results 3139 0.43 0.55 0.73 0.31
Table 24: Confusion Matrix - Caucasians: Results that are different between ProPublica Two-Year vs. Corrected Two-Year data
Prevalence Pos Pred Value Neg Pred Value Detection Rate
ProPublica_white_results 2454 0.39 0.59 0.71 0.21
Corrected_white_results 2132 0.30 0.49 0.78 0.15

As mentioned, this is the key analysis by ProPublica that garnered the most attention.272727Here again, note that accuracy (and the detection rate) is not reported by ProPublica. The lack of reporting for accuracy, especially in these by race results, is one of Northpointe’s main critiques of ProPublica’s analysis, since the accuracy is similar for blacks and whites (Dieterich et al., 2016). Especially the higher false positive rate (FPV) and lower false negative rate (FNR) for blacks than whites. As expected from the combined race sample results earlier, these rates are almost identical with the corrected data. And so blacks have a substantially higher FPR and lower FNR than whites in the corrected data too. So this key finding by ProPublica does not change with the corrected data.282828Although the utility of focusing on the differences in the FPR and FNR across race groups has been called into question (see for example Corbett-Davies and Goel, 2018). Northpointe also argues that the PPV and NPV are more relevant, see (Dieterich et al., 2016).

However, just like we saw with the combined race sample, we see substantial differences in other statistics. In particular, regarding prevalence, PPV, NPV, and the detection rate.292929Although PPV and NPV are pretty similar across race groups within the corrected data (as well as within the ProPublica biased data), which Northpointe argues is more useful than comparing the FPR and FNR across race groups, see Dieterich et al. (2016).

Going back to the combined sample results, to test whether the differences are statistically significant, I use one-sample tests. For example, for prevalence, I compare the prevalence obtained with ProPublica’s two-year dataset 0.45 with the prevalence obtained with the corrected two-year dataset 0.36, and vice-versa. I use one-sample tests, since these two dataset and the statistics calculated from them are not independent samples. I do two types of test, a t-test, and a chi-squared test, which is more appropriate for comparing proportions or rates.303030I report results for two-sided tests, although one could in principle do one-sided tests here. Those would be even more statistically significant.

Table 25: Recidivism Rate - ProPublica vs. Corrected Two-Year Data
N Mean SE
two_year_recid_ProPublica 7214 0.45 0.006
two_year_recid_Corrected 6216 0.36 0.006
Table 26: Statistical Significance Tests - Recidivism Rate (Prevalence) - ProPublica vs. Corrected Two-Year Data
N Mean Mean Null Low CI Hi CI t stat p-value chi-sq. p-value
Corrected_vs_ProPub 6215 0.362 0.45 0.350 0.374 -14.36 0 192.5 0
ProPub_vs_Corrected 7213 0.451 0.36 0.439 0.462 15.47 0 257.3 0

Given the small standard errors estimated on the first Table above, 0.006, it is not surprising to see on the second Table that the difference in the mean, or the prevalence of recidivism, which is 0.09, is highly statistically significant (p-values are smaller than 0.005 and hence rounded to 0 on the Table, which only shows up to three decimal points).

While ProPublica’s COMPAS score and recidivism data is used in an ever-increasing number of studies to test various definitions and methodologies of algorithmic fairness, researchers have taken the data as is to test their methodologies, but do not appear to have examined closely the data itself for data processing issues. This paper, instead of testing a novel fairness definition or procedure, takes a closer look at the actual datasets put together by ProPublica. Doing so, I find that ProPublica made an important data processing mistake creating some of the key datasets most often used by other researchers. In particular, the datasets built to study the likelihood of recidivism within two years of the original offense and COMPAS screening date. To my knowledge, this is the first paper to highlight this key data processing mistake.

As I show in this paper, ProPublica made a mistake implementing the two-year sample cutoff rule for recidivists. As a result, the bias in the two-year dataset is clear, there are a disproportionate number of recidivists. This fundamental problem in the dataset construction affects some statistics more than others. It obviously has a substantial impact on the total number of recidivists, and hence, the relative share or rate of recidivism. In particular, it artificially inflates the prevalence of recidivism, raising the two year recidivism rate from 0.36 to 0.45, or by 25 percent. ProPublica’s data processing mistake also affects the positive predictive value (PPV) or precision, and the negative predictive value (NPV). On the other hand, it has relatively little impact on several other key statistics, such as accuracy, the false positive rate (FPR), and the false negative rate (FNR). While the latter statistics, especially the differentials in the FPR and the FNR by race, have garnered the most attention in the academic research and public debate, the utility of focusing on those particular metrics has been called into question, see Corbett-Davies and Goel (2018).313131Also, Northpointe (Dieterich et al., 2016) has argued that PPV and NPV may be more relevant.

Ultimately, the practical importance of this data processing mistake may be somewhat limited. I am not suggesting that Northpointe itself made a mistake in actually developing the COMPAS score. (While the data used for that, and the actual model, are proprietary and not publicly available; it is unlikely that a similar mistake was made when developing such scores, or other recidivism risk scores by other companies)323232Moreover, as mentioned previously, it is not clear to what extent a recidivism risk score is used by judges at the pretrial stage to set bail. Although Cowgill using the ProPublica COMPAS data finds a non-trivial effect at score class breakpoints (Cowgill, 2018). (I am not sure if Cowgill corrected the data processing issue highlighted in this paper when doing his analysis, and if doing so would have any impact on his results). Although domain expertise does not always translate into correctly processed data. For example, Northpointe’s critique of ProPublica’s analysis, using ProPublica’s datasets, fails to identify ProPublica’s data processing mistake, and thus, produces some biased results (Dieterich et al., 2016). Or the analysis of Flores et al. (2016), who are experienced criminal justice academics and judicial system administrative officers, also fails to identify the data mistake and also produces some biased Figures. In any event, it is clear than when possible, research and public debate should be based on correctly processed data. I am currently working on a GitHub repository to make public the corrected data, although the data correction is straightforward and can be implemented by others independently.333333Additionally, Rudin et al. (2018) have also reconstructed the ProPublica COMPAS datasets from the original ProPublica Python database and made them available on GitHub. In so doing they appear to have avoided making the same data processing mistake as ProPublica. (Although they do not generally highlight the datasets differences between their dataset and ProPublica’s, and do not identify ProPublica’s data processing mistake. As mentioned earlier, their focus is altogether different)

Angwin, J., Larson, J., Mattu, S., Kirchner, L., 2016. Machine Bias. There’s software used across the country to predict future criminals. And it’s biased against blacks. ProPublica.

Barocas, S., D. Selbst, A., 2016. Big data’s disparate impact. California Law Review.

Bilal Zafar, M., Valera, I., Gomez Rodriguez, M., Gummadi, K.P., 2016. Fairness Beyond Disparate Treatment & Disparate Impact: Learning Classification without Disparate Mistreatment. arXiv e-prints arXiv:1610.08452.

Bilal Zafar, M., Valera, I., Gomez Rodriguez, M., Gummadi, K.P., Weller, A., 2017. From Parity to Preference-based Notions of Fairness in Classification. arXiv e-prints arXiv:1707.00010.

Chouldechova, A., 2016. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. arXiv e-prints arXiv:1610.07524.

Corbett-Davies, S., Goel, S., 2018. The measure and mismeasure of fairness: A critical review of fair machine learning. CoRR abs/1808.00023.

Corbett-Davies, S., Pierson, E., Feller, A., Goel, S., Huq, A., 2017. Algorithmic decision making and the cost of fairness. CoRR abs/1701.08230.

Cowgill, B., 2018. The Impact of Algorithms on Judicial Discretion: Evidence from Regression Discontinuities. Working Paper.

Cowgill, B., E. Tucker, C., 2019. Economics, fairness and algorithmic bias. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.3361280

Dieterich, W., Mendoza, C., Brennan, T., 2016. COMPAS Risk Scales: Demonstrating Accuracy Equity and Predictive Parity. Northpointe Inc.

Flores, A.W., Bechtel, K., Lowenkamp, C.T., 2016. False Positives, False Negatives, and False Analyses: A Rejoinder to "Machine Bias: There’s Software Used Across the Country to Predict Future Criminals. And It’s Biased Against Blacks.". Federal Probation Journal 80 Number 2.

Kleinberg, J., Ludwig, J., Mullainathan, S., Rambachan, A., 2018. Algorithmic fairness. AEA Papers and Proceedings 108, 22–27. https://doi.org/10.1257/pandp.20181018

Larson, J., Mattu, S., Kirchner, L., Angwin, J., 2017. COMPAS Analysis.jpynb. ProPublica Jupyter Notebook on GitHub.

Larson, J., Mattu, S., Kirchner, L., Angwin, J., 2016. How We Analyzed the COMPAS Recidivism Algorithm. ProPublica.

Rudin, C., Wang, C., Coker, B., 2018. The age of secrecy and unfairness in recidivism prediction. arXiv e-prints arXiv:1811.00731.

When creating the two year recidivism dataset, ProPublica reduced the full sample, as described above, regarding the amount of time people are “observed” in the data. However, there are two other reasons some people get dropped by ProPublica. If we take the full dataset as a starting point, with 11,757 people, ProPublica for some reason dropped the last 756 person IDs when constructing the two year datasets. Starting from person ID 11002 to person ID 11757 in the full dataset. It is not clear why these were dropped. Many have COMPAS screen dates prior to 4/1/2014, since person IDs are not chronologically ordered. And thus, many of these are observed for two years or recidivate within two years. In any case, I also dropped these 756 people in the construction of the corrected two year recidivism dataset, to make it as comparable to ProPublica’s 7,214 person two year recidivism dataset as possible (but for the explicit correction I implement).

Additionally, ProPublica also dropped 719 people who did not appear to have good data. ProPublica could not find case/arrest information on these people. ProPublica tagged these as is_recid = -1 in their full dataset.343434Interestingly, ProPublica dropped these people from the main two year general recidivism dataset. But it generally did not drop them from the two year violent recidivism dataset. While it did drop them from the more reduced 4743 two year violent csv file, it did not drop them in the 6454 two year violent data it used for the violent recidivism truth tables. There is some overlap between the 7,214 people mentioned previously and these 719 people. So the net additional drop in this step is actually 670 people.

Table 27: Any Recidivism - Full data
is_recid Freq
-1 719
0 7335
1 3703
Total 11757

I also drop these 670 people in the construction of the corrected two year recidivism dataset, so as to make it more comparable to ProPublica’s two year recidivism dataset (again, but for the explicit correction I implement).

Thus we end up with 10,331 people total in the ‘full’ dataset.353535This is also the same number of people as in ProPublica’s Cox general recidivism dataset.

Another way of seeing ProPublica’s data processing mistake when making the two year recidivism datasets is by doing a survival analysis. In the Figures below I graph the Kaplan-Meier survival curves for the full data, the ProPublica two-year data, and the corrected two year data. I use the overall recidivism variable “is_recid”, not the “two_year_recid” variable, here, so we can see the full curve, even past two years.363636As we see in the Recidivism Rates Section above, in the two year datasets there are 220 more people with is_recid=1 than two_year_recid=1. These are people who recidivated, but did so after more than two years after the original COMPAS date (but before the end date of ProPublica criminal history data window at the end of March 2016). These 220 people represent a 0.06 share of the 3471 people who recidivate in total.

Figure 7: Non-Recidivism Survival Curves - Three Samples

As we see from these graphs, at the two-year mark a 0.34 share of the people have recidivated in the full data. However, in ProPublica’s two-year data, at the two-year mark a much higher fraction of people recidivate, 0.46. (This rate is almost identical to the rate estimated in the Recidivism Rates Section above, 0.45). In the corrected two-year data, at the two year mark, a 0.37 share of the people have recidivated, which is very close to the full data estimate of 0.34. (And is also almost identical to the rate estimated in the Recidivism Rates Section above, 0.36).373737The slight difference between the corrected two-year data and the full data is due to the sample difference; the corrected two-year data does not contain any people with COMPAS dates post-April 2014, so we shouldn’t expect the rates to be exactly the same.

Here I replicate some Figures in prior papers that have used ProPublica’s COMPAS two-year datasets, and which are incorrect. While the relative patterns they show (e.g. across race or sex) remain qualitatively the same, the levels are off. I show the original Figures and then what the Figures look like with the corrected data that drops everyone with a COMPAS screen date post 4/1/2014. I try to replicate Figures as closely as possible to what they look like in the original publications. Using the same color schemes for example. (Except for axis labels; I use a constant naming convention for axes here for clarity in my paper. I also add a dashed line for the mean recidivism rate)

Table 28: African-American and Caucasians - ProPublica Two-Year Recidivism Data
race Freq
African-American 3696
Caucasian 2454
Figure 8: Two-year Recidivism by COMPAS score decile by Race (replicating Corbett-Davies et al., 2017)
Table 29: Sex - ProPublica Two-Year Recidivism Data
sex Freq
Female 1395
Male 5819
Figure 9: Two-year Recidivism by COMPAS score decile by Sex (replicating Corbett-Davies and Goel, 2018)
Figure 10: Two-year Recidivism by COMPAS score decile by Race (replicating Chouldechova, 2016)
Figure 11: Two-year Recidivism by COMPAS score decile by Recidivism Status (replicating DistrictDataLabs)

In the last Figure, I added a vertical dashed line in my paper. This line is where the two curves cross. That is the score at which there begin to be more recidivists than non-recidivists. This occurs at a (decile) average score slightly above 5 (around 5.34) in the ProPublica two year dataset. But it occurs at a substantially higher (decile) average score of almost 7 (around 6.9) in the corrected two-year dataset. This is because there are fewer recidivists in the corrected data.383838I plot these visually; since it is not clear how to obtain the exact crossing given the discrete nature of the decile score data. But since the difference is large, i.e. almost two score decile points, a visual approximation seems sufficient.

To avoid right-censoring due to prison time, one should really implement an earlier sample cutoff than April 1, 2014. Here we see that the most appropriate sample cutoff is really around February 1, 2014. (For simplicity and ease of exposition I used the April 1, 2014 cutoff in most of this paper; which already pinpoints all the issues with ProPublica’s two-year dataset)

Figure 12: Exploring Optimal COMPAS Screening Date Cutoff Rule

The red vertical line is as always at April 1, 2014. The green vertical line is at February 1, 2014. We see that prior to February 1, 2014 most non-recidivists are observed for at least two years outside jail or prison.

My paper’s key objective has been to point out the fundamental data processing error made by ProPublica in the construction of the original two-year recidivism datasets. As such, I do not engage in a wholesale revision of the ProPublica data and analysis. Therefore, I mostly take as given many aspects of the data and analysis, and make many of the same assumptions made by ProPublica and other researchers. (While I may revisit some of these assumptions in future work, that is not the purpose of the current paper) Therefore, I am otherwise assuming the data is generally in good shape, and that the analytic approach is valid. However, here I list some exceptions to this assumption regarding the good quality of the data. As well as the key assumptions made in the analysis.

  • As with many data collection efforts that must obtain different features on a given sample from different data sources, and then match these, the matching is not perfect, and ProPublica acknowledges this:

    “We found that sometimes people’s names or dates of birth were incorrectly entered in some records – which led to incorrect matches between an individual’s COMPAS score and his or her criminal records. We attempted to determine how many records were affected. In a random sample of 400 cases, we found an error rate of 3.75 percent (CI: +/- 1.8 percent).” (Larson et al., 2016)

    I have not explored this issue in my analysis.

  • Related to this, there are some people in their data who have multiple COMPAS screen dates. In results not shown, I find there appear to be 688 people in the 11,757 dataset with multiple COMPAS screen dates. ProPublica appears to have selected a single COMPAS screen date for such people. I have not explored how they did this. But since it is a relatively small number of people, it should not affect the main findings in my paper.

  • There are also some people who ProPublica finds do not have good data. In particular, ProPublica says it could not find some key case and/or arrest information for these people. They total 719 out of the 11,757 in their full dataset, or 0.06. I also drop these people.393939As discussed in an earlier section in this Appendix, ProPublica tagged these as “is_recid = -1” in their full dataset. Interestingly, ProPublica dropped these people from the main two year general recidivism dataset. But it generally did not drop them from the two year violent recidivism dataset. While it did drop them from the more reduced 4743 two year violent csv file, it did not drop them in the 6454 two year violent data it used for the violent recidivism truth tables.

  • A very small number of people appear to have implausible negative time spells outside prison. In calculations not shown here, I find that in the 11,757 full dataset, only 63 people have such negative time spells. ProPublica adds these negative amounts when calculating the total time outside of prison for a given person. I do the same.

  • Some people have a “current” offense date that occurs a long time prior to the COMPAS screen date. However, the jail_in date for these people is close to the COMPAS screen date, so such people could plausibly have committed the offense a long time ago and only been caught/charged recently. So they do not necessarily represent a data problem.

  • As noted earlier, ProPublica obtained criminal history information from the Broward County Clerk’s Office website, and jail records from the Broward County Sheriff’s Office, as well as public incarceration records from the Florida Department of Corrections website. I am not sure what happens if someone from their sample moves away from Florida after the COMPAS screen date. In particular, it is not clear whether they would show up in their data again if they commit a crime in a different state. I also do not know what happens if any of the people in their sample become deceased. There could be some sample attrition.

  • As noted in the main text section with the COMPAS screen date Figures, there are two months with very few people with COMPAS screen dates (June and July 2013).404040Also noticeable is the higher number of COMPAS screen dates in the first half of 2013. It is not clear why there is such a drop in COMPAS cases during these two months. To the the extent this is a problem, it appears to be a problem with the original dataset that ProPublica received from Broward County, since it is also evident in the “raw scores” dataset. So it does not appear to be a data processing issue by ProPublica. Thus, is not clear what can be done about this. I did check whether the relatively few people with COMPAS screen dates during those two months looked different in various dimensions, but they did not. (Except they did have somewhat longer time in between arrest date and COMPAS screen date, with a mean of 5, compared to 1 for the rest of the data)

  • As I discuss in an earlier section in this Appendix, for some reason ProPublica dropped the people with the last 756 person IDs in their pretrial defendants sample. It is not clear why it dropped these people. However, I also drop them for comparability to their analysis.

  • As other researchers note, some people in this dataset have low COMPAS scores and yet, surprisingly, have many prior offenses (Rudin et al., 2018). These researchers also note that on the flip-side, some people have high COMPAS scores, but no priors, and their current offense is not violent. (For this group, the researchers hypothesize that maybe ProPublica’s data is missing some criminal history information)

  • Finally, the age variable that ProPublica constructed is not quite accurate. ProPublica calculated age as the difference in years between the point in time when it collected the data, in early April 2016, and the person’s date of birth. However, when studying recidivism, one should really use the age of the person at the time of the COMPAS screen date that starts the two year time window. So some people may be up to 25 months younger than the age variable that ProPublica created. Since I do not really use age in any of my analyses, I do not take the trouble of correcting this variable.

  • Since this analysis is for people in Broward County and for a particular point in time, it may not generalize to other jurisdictions and time windows.

  • I am assuming that it is valid to study the use of the COMPAS recidivism score for pretrial defendants. As Flores et al. (2016) point out, the recidivism score may actually be intended to be applied more to current prison inmates for probation decisions. (Indeed, the ProPublica data has a third score, regarding the risk of failure to appear in court, which may be intended for pretrial decisions)

  • The observed recidivism rate is really a re-arrest rate. It may not reflect the true recidivism rate in the sense that some people may commit new offenses but not get caught. (Clearly, therefore, the amount and aggressiveness of policing may affect the observed recidivism rate)

  • I am not exploring any feedback loop effects. As Cowgill points out, judges sometimes use COMPAS scores in their bail decisions “and longer bailtime exerts a causal influence on defendants’ outcomes, including recidivism” (Cowgill, 2018).

  • I am assuming that netting out prison (and jail) time, as ProPublica does, to focus on people who have at least two years out of prison, is appropriate. Of course, one could potentially keep such people in the sample, since one can also recidivate while in prison.414141Indeed, it seems that for the “current” offense that triggers the COMPAS screening, it appears that some defendants committed this offense while in prison.

  • I focus on the study of the binary two-year recidivism outcome. With survival data, however, it is often preferable to apply survival models. Although, in the survival analysis appendix above, I show that at the two-year mark, the two approaches are almost identical (at least without controls). A survival analysis, nonetheless, gives a fuller picture of recidivism, since it is not constrained to a single point in time.

  • Foregoing the fuller picture provided by a survival analysis approach, and doing an analysis of recidivism at a particular point in time instead, I am assuming that the two-year recidivism metric is the appropriate recidivism metric for this approach. (As opposed to, say, one-year recidivism, or three-year recidivism, etc.) ProPublica explains why it chose this time-frame. For example, saying it:

    “based this decision on Northpointe’s practitioners guide, which says that its recidivism score is meant to predict ‘a new misdemeanor or felony offense within two years of the COMPAS administration date.’ ”424242ProPublica also points to “a recent study of 25,000 federal prisoners’ recidivism rates by the U.S. Sentencing Commission, which shows that most recidivists commit a new crime within the first two years after release (if they are going to commit a crime at all).”

  • I am assuming for the contingency table analyses that using a binary score category (Low vs. High) is adequate. As opposed to a more detailed breakdown, such as Low, Medium, High, or deciles, or the continuous raw score. And that the breakpoint used, which groups deciles 1-4 and 5-10 into the two categories is appropriate.434343Note that the binary score breakdown, for example, represents only one point (threshold) on a ROC curve.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
374730
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description