Counterfactual Risk Assessments, Evaluation, and Fairness
Abstract.
Algorithmic risk assessments are increasingly used to help humans make decisions in highstakes settings, such as medicine, criminal justice and education. In each of these cases, the purpose of the risk assessment tool is to inform actions, such as medical treatments or release conditions, often with the aim of reducing the likelihood of an adverse event such as hospital readmission or recidivism. Problematically, most tools are trained and evaluated on historical data in which the outcomes observed depend on the historical decisionmaking policy. These tools thus reflect risk under the historical policy, rather than under the different decision options that the tool is intended to inform. Even when tools are constructed to predict risk under a specific decision, they are often improperly evaluated as predictors of the target outcome.
Focusing on the evaluation task, in this paper we define counterfactual analogues of common predictive performance and algorithmic fairness metrics that we argue are better suited for the decisionmaking context. We introduce a new method for estimating the proposed metrics using doubly robust estimation. We provide theoretical results that show that only under strong conditions can fairness according to the standard metric and the counterfactual metric simultaneously hold. Consequently, fairnesspromoting methods that target parity in a standard fairness metric may—and as we show empirically, do—induce greater imbalance in the counterfactual analogue. We provide empirical comparisons on both synthetic data and a real world child welfare dataset to demonstrate how the proposed method improves upon standard practice.
[Amanda]amblue \addauthor[Alex]acpurple \addauthor[Edward]ekgreen \addauthor[Alan]alred
1. Introduction
Much of the activity in using machine learning to help address societal problems focuses on algorithmic decisionmaking and algorithmic decision support systems. In settings such as health, education, child welfare and criminal justice, decision support systems commonly take the form of risk assessment instruments (RAIs), which distill rich case information into risk scores that reflect the likelihood of the case resulting in one or more adverse outcomes. (Chouldechova et al., 2018; Kube et al., 2019; Ferguson, 2016; Kehl and Kessler, 2017; Stevenson, 2018; Caruana et al., 2015; Smith et al., 2012). Prior literature has raised significant concerns regarding the fairness, transparency, and effectiveness of existing RAIs \amdeletealgorithmic risk assessments (Barocas and Selbst, 2016; Barabas et al., 2017; Dressel and Farid, 2018; CorbettDavies et al., 2017; Chouldechova and Roth, 2018). Yet RAIs remain very popular in practice, and there is a large body of research on fairness and transparency promoting methods that seek to address some of these concerns (e.g Zafar et al., 2015; Hardt et al., 2016; Kamiran and Calders, 2012; Pleiss et al., 2017; Kamishima et al., 2011; Zemel et al., 2013).
This paper highlights a different issue, one that has not received sufficient attention in the discussion of RAIs but that nonetheless has significant implications for fairness: RAIs are typically trained and evaluated as though the task were prediction when in reality the associated decisionmaking tasks are often interventions. Models trained and evaluated in this way answer the question: What is the likelihood of an adverse outcome under the observed historical decisions? Yet the question relevant to the decision maker is: What is the likelihood of an adverse outcome under the proposed decision? When decisions do not impact outcomes—when we are in what (Kleinberg et al., 2015) call a “pure predition” setting—these are one and the same. However, many decisions take the form of interventions specifically designed to mitigate risk. RAIs for these settings must \amdeletetherefore be developed and evaluated taking into account the effect of historical decisions on the observed outcomes. Failure to do so will result in RAIs that, despite appearing to perform well according to standard evaluation practices, underperform on cases such as those that have been historically receptive to intervention.
In this paper we propose an approach to counterfactual risk modeling and evaluation to properly account for these intervention effects. Counterfactual modeling has been proposed for medical RAIs (Schulam and Saria, 2017; Shalit et al., 2017; Alaa and van der Schaar, 2017), and prior work has used counterfactual evaluation for offpolicy learning in bandit settings (Dudík et al., 2011). However, the question of adapting counterfactual evaluation for risk assessments and in particular for predictive bias assessments remains open. In this paper, we propose a new evaluation method for RAIs that uses doublyrobust estimation techniques from causal inference (Van der Laan et al., 2003; Robins and Rotnitzky, 2001). We also argue that fairness metrics that are functions of the outcome should be defined counterfactually, and we use our evaluation method to estimate these metrics. We theoretically and empirically characterize the relationship between the standard fairness metrics and their counterfactual analogues. Our results suggest that in many cases, achieving parity in the standard metric will not achieve parity in the counterfactual metric.
Our main contributions are as follows: 1) We define counterfactual versions of standard predictive performance metrics and propose doublyrobust estimators of these metrics (§ 3); 2) We provide empirical support that this evaluation outperforms existing methods using a synthetic dataset and a realworld child welfare hotline screening dataset (§ 3); 3) We propose counterfactual formulations of three standard fairness metrics that are more appropriate for decisionmaking settings (§ 4); 4) We provide theoretical results showing that only under strong conditions, which are unlikely to hold in general, does fairness according to standard metrics imply fairness according to counterfactual metrics (§ 4); 5) We demonstrate empirically that applying existing fairnesscorrective methods can increase disparity in the counterfactual redefinition of the metric they target (§ 4).
2. Background and Related Work
2.1. Counterfactual learning and evaluation
Literature on contextual bandits has considered counterfactual learning and evaluation of decision policies. While this literature is methodologically relevant, as we discuss below, it addresses a different problem. In the decision support setting we are considering, human users will ultimately decide what action to take. The goal of the learning and evaluation task is not to learn a decision policy, but rather to learn a risk model that will inform human decisions. That is, the risk assessment task is to accurately and fairly estimate the probability of an outcome under a given intervention.
While the underlying task is different, the statistical methods used in evaluation are related. (Swaminathan and Joachims, 2015) use propensity score weighting, a form of importance sampling, to correct for the effect of the historical treatment on the observed outcome, and they propose learning the optimal policy based on the minimization of the propensityscore weighted empirical risk. Propensityscore methods are a good candidate when one has a good model of the historical decisionmaking policy, but may otherwise be biased. Doubly robust (DR) methods, by contrast, are robust to parametric misspecification of the propensity score model if instead one has the correct specification of the model of the regression outcome where is the outcome and are the features/covariates (Van der Laan et al., 2003; Robins et al., 1994; Robins and Rotnitzky, 1995). In a nonparametric setting, DR methods have faster rates of convergence than propensityscore methods (Kennedy, 2016). DR methods have been used for policy learning in the offline bandit setting (Dudík et al., 2011). The policy learned minimizes a DR estimate of the loss. Their framework can also be used to evaluate a policy by computing the DR estimate of its expected reward.
Prior work has considered counterfactual RAIs in a temporal setting (Schulam and Saria, 2017). In this work, the trained model is evaluated on real data using the observed outcomes, and on simulated data. Evaluating against the observed outcomes can be misleading in settings in which treatment was not assigned randomly (see § 3.3.3). In our work we propose instead to adapt DR techniques, as have been used in the bandit literature for evaluating policies, to provide evaluations of counterfactual RAIs.
Counterfactual learning in the causal inference literature uses model selection based on DR estimation of counterfactual loss (Van der Laan et al., 2003). Whereas this approach evaluates counterfactual metrics implicitly, our approach does so explicitly, providing the estimators for standard classification metrics in § 3.3.3.
There is also a line of work focused on counterfactual learning in the presence of hidden confounders. (Kallus and Zhou, 2018a) propose policy learning via minimax regret learning over uncertainty sets. Their method is not immediately applicable to decisionsupport settings where RAIs are more informative to decisionmakers than a policy recommendation. (Madras et al., 2019) propose using deep latent variable models to model hidden confounders via proxies in the data and evaluate how well this model learns an optimal policy. While their model may be used for learning a risk assessment model, they do not address how to evaluate the model in such a setting, which is the focus of our work. § 3 of our paper assumes no hidden confounders, and future work could attempt to incorporate these techniques for handling hidden confounders. We note that our theoretical analysis in § 4 holds even in the presence of hidden confounding.
2.2. Fairness and causality
A growing literature on counterfactual fairness has offered notions of fairness based on the counterfactual of the protected attribute (or its proxy) (Kusner et al., 2017; Wang et al., 2019; Kilbertus et al., 2017). In this work, a policy is considered fair if it would have made the same decision had the individual had a different value of the protected attribute (and hence, potentially different values of features affected by the attribute). In this setting, the treatment decision is the outcome, and the protected attribute is the ‘treatment’. By contrast, we consider counterfactual treatment decisions and consider a future observation to be the outcome.^{1}^{1}1This distinction is also made in a survey of fairness literature (Mitchell et al., 2018).
Another line of work considers unfair causal pathways between the protected attribute (or its proxy) and the outcome variable or target of prediction (Nabi and Shpitser, 2018; Zhang and Bareinboim, 2018b). These papers characterize or explain discrimination via pathspecific effects, which are defined by interventions on the protected attribute. We do not consider interventions on (i.e. counterfactuals of) the protected attribute; rather, we propose methods that account for interventions on treatment decisions in training and evaluation.
Fairness definitions based on the counterfactual of the protected attribute are not widely used in RAI settings for two reasons: one technical and one practical. The technical challenge is that the assumptions required to estimate these counterfactual metrics prohibit the use of important features, such as prior history, or require full specification of the structural causal model (SCM) (Zhang and Bareinboim, 2018a; Kusner et al., 2017, 2019) These requirements are too restrictive for our settings of interest where we have insufficient domain knowledge to construct the SCM and where we are unable to disregard important predictors like prior history. More significantly, the practical concern is that these definitions are illsuited for risk assessment settings like child welfare screening. As we discuss in § 4, decisions made based on the counterfactual protected attribute may cause further harm to the protected groups.
Our work bears conceptual similarity to the analysis of residual unfairness when there is selection bias in the training data that induces covariate shift at test time as discussed in (Kallus and Zhou, 2018b). In settings where cases are systematically screened out from the training set, such as loan approvals in which we do not get to see whether someone who was denied a loan would have repaid, they find that applying fairnesscorrective methods is insufficient to achieve parity. We consider a different but related setting in which we observe outcomes for all cases, but these outcomes are under different treatments. We propose fairness definitions that account for the effect of these treatments on the observed outcomes, and analyze the conditions under which existing methods can achieve this notion of counterfactual fairness.
3. Counterfactual Modeling and Evaluation
Before proceeding to introduce the learning approaches and evaluation methods considered in this work, we pause to clarify the types of riskbased decision policies to which our evaluation strategy as presented is tailored, and provide some background on algorithmassisted decision making in child welfare hotline screening.
RAIs typically inform human decisions either by identifying cases that are the most (or least) risky, or by identifying cases that are the most (or least) responsive. The evaluation metrics we consider are most directly relevant in the paradigm where human decisionmakers wish to intervene on the riskiest cases. However, our method can readily be adapted (as discussed in § 3.3) for paradigms in which interventions are being targeted based on responsiveness.
The motivating application for our work is child welfare screening. Child welfare service agencies across the nation field over 4.1 million child abuse and neglect calls each year (U.S. Department of Health & Human Services, 2019). Call workers must decide whether to “screen in” a call, which refers to opening an investigation into the family. The child welfare system is responsible for responding to all cases where there is significant suspicion that the child is in present or impending danger. The standard of practice is therefore to identify the riskiest cases. Jurisdictions in California, Colorado, Oregon and Pennsylvania are all in various stages of developing and integrating RAIs into their call screening processes. The RAIs are trained on historical data to predict adverse child welfare outcomes, such as rereferral to the hotline or outofhome foster care placement (Chouldechova et al., 2018). The decision to investigate a call can affect the likelihood of the target outcomes.
3.1. Notation
We use to denote the observed binary outcome, and for exposition we assume is the unfavorable outcome. denotes the decision which for simplicity we take to be binary. We note, however, that DR estimation methods can be used in any treatment setting, including for continuous treatments such as dosing (VanderWeele and Hernan, 2013; Kennedy et al., 2017). Throughout the remainder of the paper we will use the term ‘decision’ and ‘treatment’ interchangeably to aid in the exposition. In describing counterfactual learning and evaluation, we rely on the potential outcomes framework common in causal inference (Rubin, 2005; Neyman, 1923; Kennedy et al., 2013). In this framework, denotes the outcome under treatment . For any given case we only get to observe or , depending on whether the case was treated. We will take to be the baseline treatment, the decision under which it is relevant to assess risk. Most risk assessment settings have a natural baseline, which is often the decision to not intervene. For instance, in education one might wish to assess the likelihood of poor outcomes if a student is not offered support; in child welfare it is natural to assess the risk of rereferral if the call is not investigated. \accommentReviewers might be concerned that the framework only permits considerations of a binary treatment vs. no treatment regime. That’s not the case though. Maybe here we say that denotes treatment, which will be taken as binary for the purpose of our discussion. But the results can generalized beyond the binary treatment setting. Am I missing some kind of conceptual barrier to generalizing beyond binary treatment? \amcommentupdated We refer to the baseline treatment as control and the notbaseline treatment as treatment. denotes the covariates (or features) which may include a protected or sensitive attribute . denotes the propensity score, whose estimate we denote by . In the child welfare setting, contains call details and historical information on all associated parties, is whether the case is screenedin for investigation, and is whether the case is rereferred to the hotline in a sixmonth period. We use subscripts to index our data; e.g., are the features for case . We use to denote our predicted label and to denote the predicted score which is the model’s estimate of the target outcome (our RAI).^{2}^{2}2 is typically obtained by thresholding . \accommentReaders may find it helpful to have examples of the quantities in a context. E.g., may be rereferral, is whether a case is investigated, is historical information on a case, etc. \amcommentupdated
3.2. Learning models of risk
In this section we introduce “observational” (standard practice) and “counterfactual” forms of model training.
3.2.1. Observational
The observational RAI produces risk estimates by regressing on for the entire observed dataset. i.e., this RAI estimates . This model answers the question: What is the likelihood of an adverse outcome under the observed historical decisions? The observational RAI is illsuited for guiding future decisions; it will, for instance, underestimate (baseline) risk for cases that were historically responsive to treatment. \accommentSo, could be one of the ’s in this type of analysis, which would in some way “account” for the effect of treatment on the outcome. If we wanted to be more precise, I think the issue is that this framing of risk just isn’t coherent for the decision support context. The observational model is responsive to the question: Which cases historically had the greatest risk? But what we care about in informing future decision making is: Which cases would have the greatest risk if we don’t step in and intervene? (Or, for which cases can we produce the greatest reduction in risk through intervention?) \amcommentupdated
3.2.2. Counterfactual
The counterfactual model of risk estimates the outcome under the baseline treatment. Our counterfactual model of risk targets . Even though we only observe or for any given observation, we may nevertheless draw valid inference about both potential outcomes under a set of standard identifying assumptions^{3}^{3}3Identification is the process of using a set of assumptions to write a counterfactual quantity in terms of observable quantities. These assumptions hold by design in our synthetic dataset, and we discuss why they may be reasonable in the child welfare setting under each point.

Consistency: .
This assumes there is no interference between treated and control units. This is a reasonable assumption in the child welfare setting since opening an investigation into one case will not likely affect another case’s observed outcome.^{4}^{4}4We set the treatment to be the same value for all children in a family. 
Exchangeability: . This assumes that we measured all variables that jointly influence the intervention decision and the potential outcome . This is an untestable assumption but it may be reasonable in the child welfare setting where the measured variables capture most of the information the call screeners use to make their decision (see Section 3.4.2 for more details).

Weak positivity requirement: requires that each example have some nonzero chance of the baseline treatment. This can hold by construction in decision support settings. We can filter out cases that violate this assumption since the decision for these cases is nearly certain.^{5}^{5}5 Risk assessments are unnecessary for these cases since the decisionmaker already knows what to do.
Our assumptions identify the target . \amdelete
(1) 
The counterfactual model estimates by computing an estimate of . We can train such a model by applying any probabilistic classifier to the control population. Since the control population may have a different covariate distribution than the full population, reweighing can be used to correct this covariate shift (QuioneroCandela et al., 2009). This may be useful in a setting with limited data or where model misspecification is a concern (Sugiyama et al., 2007).
3.3. Evaluation
To evaluate how well our models of risk might inform decisionmaking in the paradigm where interventions should be targeted at the riskiest cases, we assess performance metrics such as precision, true positive rate (TPR), false positive rate (FPR), and calibration.^{6}^{6}6In the paradigm where interventions are to be targeted at the most responsive cases, performance metrics such as discounted cumulative gain (DCG) or Spearman’s rank correlation coefficients are more natural choices for evaluation. DR estimates can be constructed for these metrics as well. Since the task is to evaluate how well the model predicts risk under a baseline intervention, we specify the performance metrics in terms of . The target counterfactual TPR is
(2) 
The target counterfactual precision is
(3) 
The target counterfactual FPR is
(4) 
A model is wellcalibrated in the counterfactual sense when
(5) 
where define a bin of predictions. We describe two standard practice approaches for evaluation, noting why these approaches do not adequately estimate the counterfactual targets. We introduce our proposed approach that uses doubly robust (DR) estimation.^{7}^{7}7All evaluations are computed on a test partition that is separate from the train partition
3.3.1. Observational Evaluation
\accommentA fairness audience might prefer to have Recall called the TPR. E.g., a precisionrecall curve would plot precision vs the TPR. \amcommentupdated A standard practice approach evaluates the model against the observed outcomes. An observational PrecisionRecall (PR) curve plots observational precision, , against observational TPR^{8}^{8}8TPR and recall are equivalent., . An observational ROC curve plots observational TPR against observational FPR . An observational calibration curve plots , the observational outcome rate for scores in the interval . The observational evaluation answers the question: Does the RAI accurately predict the likelihood of an adverse outcome under the observed historical decisions? This evaluation approach can be misleading since . For instance, it will conclude that a valid counterfactual model of risk under baseline performs poorly because its predictions will be systematically inaccurate for cases that are responsive to treatment.
3.3.2. Evaluation on the Control Population
\accommentIf we call this “Evaluation on the Control Population” and then describe it as a form of counterfactual evaluation rather than “counterfactual evaluations via control population”, does that mess things up? It seems like the Figures all call it “control” so those won’t need to be regenerated. \amcommentupdated The standard practice counterfactual approach to evaluation computes error metrics on the control population (Schulam and Saria, 2017). The PR curve evaluated on the control population plots against , and the ROC and calibration curves are similarly defined by conditioning on . When the control population is not representative of the full population (i.e. ), as is the case in nonexperimental settings, this evaluation may be misleading since . A method that performs well on the control population may perform poorly on the treated population (or viceversa). In child welfare, cases where the perpetrator has a history of abuse are more likely to be screened in. Since there is more information associated with these cases, a model may be able to discriminate risk better for these cases than on cases in the control population with little history. \amcommentis this a good example or do we want something in the other direction? I think either direction is compelling
3.3.3. Doublyrobust (DR) Counterfactual Evaluation
We propose to improve upon the control population evaluation procedure by using DR estimation to perform counterfactual evaluation using both treated and control cases. This ensures that performance is assessed on a representative sample of the population. Our method estimates the counterfactual outcome for all cases and evaluates metrics on this estimate. Other approaches such as inverseprobability weighing (IPW) or plugin estimates could be used for a counterfacutal evaluation, but DR techniques are preferable because they have faster rates of convergence for nonparametric methods, and for parametric methods they are robust to misspecification in one of the nuisance functions, which estimate treatment propensity and the outcome regression (Robins et al., 1994; Robins and Rotnitzky, 1995; Kennedy, 2016). Under sample splitting and convergence in the nuisance function error terms, these estimates are consistent and asymptotically normal. This enables us to compute confidence intervals (see Calibration below for an example). \accommentCan you provide some intuition to readers for where these formulas are coming from? Currently they seem to descend from the heavens :) \amcommentupdated
We first consider estimates of the average outcome under control . Under our causal assumptions in Section 3.2.2, . The plugin estimate is:
where denotes the score of our counterfactual model. The IPW estimate uses the observed outcome on the control population and reweighs the control population to resemble the full population:
DR estimators^{9}^{9}9In survey inference, this is known as the generalized regression estimator (Särndal et al., 1989). combine the plugin estimate with an IPWresidual biascorrection term for the control cases:
(6) 
Next we consider the counterfactual targets in Equations 2 5. We identify the target under our causal assumptions and then state the DR estimator. We emphasize the distinction that is the score of any model we wish to evaluate whereas is the score of our counterfactual model in § 3.2.2.
TPR (Recall):
Counterfactual TPR is identified as
(7) 
.
The target counterfactual TPR is
Using our causal assumptions, this is identified as
(8) 
.
The DR estimate for the numerator is
(9) 
The DR estimate for the denominator is in Equation 6.
Precision:
\amdeleteThe target counterfactual precision is
Under our causal assumptions this is identified as
The target counterfactual precision is identified as
(10) 
is it helpful or confusing to have LHS and RHS?
The DR estimator for precision is
(11) 
where denotes the indicator function.
Calibration:
The target in Equation 5 is identified as
The DR estimate for calibration is
(12) 
To compute the confidence interval for this estimate, we compute the number of data points in the bin and the variance in the bin
.
Then we use the normal approximation to compute the interval: where for a 95% confidence interval.
Fpr:
\amdeleteThe target counterfactual FPR is
Under our causal assumptions, this is identified as
(13) 
The target counterfactual FPR is identified as
(14) 
The DR estimator for the numerator is
(15) 
For the denominator we use where is in Eq 6.
3.4. Results
\amdeleteWe present the results of these three evaluations on a synthetic dataset and our child welfare dataset. Comparing to the true counterfactual for the synthetic data, we find that our DR evaluation is more accurate than either the observational or control evaluations. For the experiments on the real world child welfare data, where we do not have access to all counterfactuals, we perform a comparison to expert assessment of risk to give further credence to the conclusions from our DR evaluation.
3.4.1. Synthetic example
We begin with a synthetic dataset so that we can compare methods in a setting where we observe both potential outcomes. We specify two groups with different treatment propensities, but the treatment is constructed to be equally effective at reducing the likelihood of adverse outcome () for both groups. We generate 100,000 data points where and , a normal distribution with mean 0 and variance 1. , a Bernoulli with mean 0.5. where . where controls the treatment effect. where describes the bias in treatment assignment toward group .^{10}^{10}10We present results for alternative values of and in Appendix D. The offset is to roughly balance the number of treated/control units We set . The base rates are ; ; and . The treatment rates are ; ; and .
We use logistic regression to train both the observational and counterfactual models as well as the propensity model . Under this choice of model, the propensity model and counterfactual model are both correctly specified, and accordingly, the plugin and IPW estimates are both consistent in this setting. However, in practice, there is no way to know whether the models are correctly specified, so DR estimates are preferable for realworld settings. We use as the features.^{11}^{11}11In Appendix D.1 we include as a feature in the observational model to see if this can appropriately control for treatment effects, but we find that it does not.
Figure 1 displays PR, ROC, and calibration curves.^{12}^{12}12The code for this experiment is given in https://github.com/mandycoston/counterfactual DR evaluation most closely aligns with the true counterfactual evaluation. Notably, the observational evaluation suggests that the observational model outperforms the counterfactual model when the true counterfactual evaluation shows the counterfactual model performs better.
3.4.2. Child Welfare
We also apply counterfactual learning and evaluation to the problem of child welfare screening. The baseline intervention is screenout (which means no investigation occurs). The data consists of over 30,000 calls to the Allegheny County hotline, each containing more than 1000 features describing the call information as well as county records for all individuals associated with the call. The call features are categorical variables describing the allegation types and workerassessed risk and danger ratings. The county records include demographic information such as age, race and gender as well as criminal justice, child welfare, and behavioral health history. The outcome is rereferral within a six month period. Our approach contrasts to prior work which used placement outofhome as the outcome (Chouldechova et al., 2018; DeArteaga et al., 2018). This outcome is only observed for cases under investigation; therefore it cannot be used to identify , the risk under no investigation.
We use random forests to train the observational and counterfactual risk assessments as well as the propensity score model. We used reweighing to correct for covariate shift but did not observe a boost in performance, likely because we have sufficient data and we used a nonparametric model.
We present the PR, ROC and calibration curves in Figure 2. The observational evaluation suggests that the observational model performs better. The control evaluation suggests that the counterfactual and observational models of risk perform equally well. Our DR evaluation suggests the counterfactual model has both better discrimination and calibration in estimating the probability of rereferral under screenout. In Figure 1(c), the observational evaluation suggests that the observational model is wellcalibrated whereas the counterfactual model is overestimating risk; this is expected because the counterfactual model assesses risk under no investigation whereas the observed outcomes include cases whose risk was mitigated by child welfare services. The control evaluation suggests that the two models are similarly calibrated. The DR evaluation shows that the counterfactual model is wellcalibrated and the observational model underestimates risk. This makes intuitive sense because the observational model is not accounting for that fact that treatment reduced risk for the screenedin cases.
We see further evidence that the observational model performs poorly on the treated population in the drop in ROC curves between the control evaluation and DR evaluation in Figure 1(b). Deploying such a model would mean failing to identify the people who need and would benefit from treatment. The observational and control evaluations do not show this significant limitation; DR evaluation is the only evaluation that illustrates the poor performance of the observational model on the treated population.
We also evaluate the different models according to whether they are equally predictive, in the sense of being equally well calibrated, across racial groups. Research suggests child welfare processes may disproportionately involve black families (Dettlaff et al., 2011). Here we ask whether the observational or counterfactual model is more equitable. We compare calibration rates by race in Figure 3. The observational evaluation suggests that the counterfactual model of risk is poorly calibrated by race. The DR evaluation shows that the counterfactual model is wellcalibrated by race and indicates that the observational model underestimates risk on both black and white cases.
Overall the observational evaluation suggests that the observational model performs better whereas the DR evaluation suggests the counterfactual model performs better. Since we do not have access to the true counterfactual to validate these results, we further consider how well the models align with expert assessment of risk.
3.4.3. Expert Evaluation
At various stages in the child welfare process, social workers assign treatment based on their assessment of risk. Social workers sequentially make three treatment decisions:

Whether to screen in a case for investigation

Whether to offer services for a case under investigation

Whether to place a child outofhome after an investigation
Assuming that social workers are competent at assessing risk, we expect the group placed outofhome (3) to have the highest risk distribution, followed by the group offered services (2), followed by those screened in, and finally we expect the screened out group to have the lowest risk. Figure 4 shows that the counterfactual model exhibits this expected behavior whereas the observational model does not. The observational model assesses the screened out population to have more high risk cases than any other treatment group. This indicates that the observational model is underestimating risk on the treated groups (investigated, services, and placed) since it fails to account for the riskmitigating effects of these treatments. The observational model underestimates risk on those who were assigned effective treatments. These cases should be assigned treatment, but the observational model would suggest that they are low risk and should be screened out.
Such a mistake can have cascading effects downstream. We are particularly concerned about screening out cases that, had they been screened in, would have been accepted for services or placed outofhome. \amdeleteHumans determined that these cases needed treatments that will be inaccessible if they are screened out. Figure 5 shows the recall for placed cases and serviced cases as we vary the proportion of cases classified as highrisk. This plot shows that at any proportion the counterfactual model has significantly higher recall for both services and placement cases. \amdeleteIn particular, at the 0.5 proportion (which is the rate of screen in), the counterfactual model screens in 74% of cases that were placed whereas the observational model only screens in 53%. At the 0.5 proportion the counterfactual model screens in 69% of cases that were accepted for services versus 31% for the observational model.
3.4.4. Task adaptation: Predicting Placement
Another way to evaluate the models is to assess their performance on related risk tasks. While the counterfactual risk models , we can assess how well it estimates , which is the risk under investigation. If we have reason to believe there will be common risk factors for risk under no investigation and risk under investigation, then we expect our model to perform well on this task. We use placement outofhome, an adverse child welfare outcome that is observed for cases under investigation.
Table 1 shows the area under the ROC and PR curves for the placement task. The observational model performs worse than a random classifier, whereas the counterfactual model shows some degree of discrimination. This suggests that the counterfactual model is learning a risk model that is useful in related risk tasks whereas the observational model is not.
Observ. model  Counterfact. model  Random  

AUROC  0.48 (0.46,0.49)  0.62 (0.61,0.63)  0.50 
AUPR  0.13 (0.11,0.14)  0.18 (0.16,0.19)  0.14 
The comparison to expert assessment of risk and the performance on a downstream risk task support the conclusions of our DR evaluation: the counterfactual model outperforms the observational model. In decisionmaking contexts, failure to account for treatment effects can lead one to the wrong conclusions about model performance, even potentially leading to the deployment of a model that underestimates risk for those who stand to gain most from treatment. In the next section, we consider how failure to account for treatment effects can impact fairness.
4. Counterfactual Fairness
Standard observational notions of algorithmic fairness are subject to the same pitfalls as observational model evaluation. In this section we propose counterfactual formulations of several fairness metrics and analyze the conditions under which the standard (observational) metric implies the counterfactual one.
We motivate the importance of defining these metrics counterfactually with an example. Suppose teachers are assessing the effectiveness and fairness of a model that predicts who is likely to fail an exam which they intend to use to assign tutoring resources. Suppose anyone tutored will pass. The tutoring session conflicts with girls’ sports practice so only male students are tutored. A model that perfectly predicts who will fail without the help of a tutor will have a higher observational FPR for men than women because some male students were tutored, which enabled them to pass. It would be wrong to conclude that this model is unfair with regards to FPR. Someone who would have been highrisk had they not been treated but whose risk was mitigated under treatment should not be considered a false positive. Failure to make this distinction could lead to unfairness, not only in settings where the treatment assignment varies according to the protected attribute but also in settings where the risk under treatment varies according to the protected attribute, as we can see in the next example.
Suppose that the classroom next door is also evaluating the model. This classroom offers tutoring during lunch so girls and boys both can attend; however they hired a tutor who happens to only be effective in preparing male students to pass. The teachers don’t know this and randomly assign this tutor to students regardless of gender. The model that perfectly predicts who will fail without a tutor has a higher observational FPR for men, but as before, it is wrong to conclude that the model is unfair with regards to FPR.
These examples give intuition for the theory presented in the next Section. As we show, when there are differences in the propensity to treat and/or the treatment is differentially effective, parity in the observational metric generally implies counterfactual disparity. \amdeleteThis suggests that methods that attempt to equalize the observational metric may not be equalizing the counterfactual metric and raises the question whether they could increase the disparity in the counterfactual metric. In Section 4.2, we perform experiments on a synthetic dataset that illustrate examples in which this occurs.
We distinguish our notion of counterfactual fairness from prior work which considered counterfactuals of the protected attribute (Kusner et al., 2017; Kilbertus et al., 2017; Wang et al., 2019), an approach which is counterproductive in our settings of interest. Consider a female student who is at high risk of failing because of gender discrimination at home or in the classroom e.g. parents or previous teachers have not given her the support they would have had she been male. Treating this student ”counterfactually as if she had been male all along” may suggest that we should not assign this student a tutor. In fact we must assign her a tutor in order to correct historical discrimination. Similar arguments can be made in settings like child welfare screening and loan approvals.
4.1. Theoretical results
For three definitions of fairness (parity), we show that observational parity implies counterfactual parity if and only if a balance condition holds. We further show that an independence condition is sufficient for observational parity to imply counterfactual parity. We discuss why it is generally unlikely that the independence condition holds and even more unlikely that the finer balance condition holds when the independence condition fails. All proofs are provided in Appendix B.
4.1.1. Base Rate Parity
Base rate plays a core role in statistical definitions of fairness (also known as group fairness). Base rate parity is similar to the fairness notion of demographic parity, which requires (Dwork et al., 2012; Calders et al., 2009; Zafar et al., 2015). In Section 4.2, we perform experiments on a fairness corrective method that targets base rate parity in order to encourage demographic parity (Kamiran and Calders, 2012). A related fairness notion, predictionprevalence parity, requires . Satisfying both predictionprevalence parity and demographic parity requires parity in the base rates. \amcommentAny further citations for these definitions, predicitonprevalence in particular? We distinguish observational base rate parity (oBP) from counterfactual base rate parity (cBP), which requires , where is the potential outcome under the baseline treatment.
Theorem 1 (Base Rate Parity).
If oBP holds, then cBP holds if and only if the following balance condition holds,
assuming .
Condition 0 (balBP).
(16) 
BalBP holds under the following independence conditions, which provide sufficient conditions for oBR to imply cBR.
Condition 0 (indBP).
(17) 
It is unlikely that indBP (17) holds in many contexts. In settings such as child welfare screening and criminal justice, research suggests that even when controlling for the true risk, certain races are more likely to receive treatment (Dettlaff et al., 2011; Alexander, 2011; Mauer, 2010). indBP cannot hold in these settings since . Even in settings where there is no such bias, indBP will not hold if the risk distributions under treatment vary by protected attribute since indBP requires that . indBP also requires , which forbids discrimination in treatment assignment when controlling for risk under treatment. If indBP does not hold, it is possible that balBP (16) still holds if the conditional and marginal probabilities are such that all terms in Condition 16 exactly cancel; however there is no semantic reason why this should hold. Theorem 1 assumes , a mild positivitylike assumption that holds in all settings that are suitable for algorithmic risk assessment. Violations of this assumption indicate either completely perfect or imperfect treatment assignment historically for a demographic group.
Proof of Base Rate Necessary Condition.
By consistency . Then we have
Likewise for
By oBP, . We assume cBP holds so . Then, we have
∎
Proof of Base Rate Parity Sufficiency.
where the first line used consistency and the second line applied linearity of expectation and . By oBP, , so it must be true that
∎
4.1.2. Predictive parity
Base parity and demographic parity may be illsuited for settings where base rates differ by protected attribute due to disparate needs. Here we may instead desire parity in an error metric, such as precision. Positive predictive parity requires the precision (also known as positive predictive value) to be independent of the protected attribute, and negative predictive parity requires the negative predictive value to be independent of the protected attribute (Chouldechova, 2017; Kleinberg et al., 2016). We define observational Predictive Parity (oPP) as and counterfactual Predictive Parity (cPP) as where corresponds to negative predictive parity and corresponds to positive predictive parity.
Theorem 2 (Predictive Parity).
If oPP holds, then cPP holds if and only if the following balance condition holds,
assuming .
Condition 0 (balPP).
(18) 
BalPP is satisfied under the following independence conditions, which provide sufficient conditions for oPP to imply cPP.
Condition 0 (indPP).
(19) 
IndPP will not hold in many settings. Note that and . \amdelete
Conditions require to contain all the information that tells us about treatment assignment that is not contained in . Since is typically trained to predict and not , it is quite unlikely that these conditions will hold in settings where there is bias in treatment assignment even when controlling for true risk. Condition allows differences in the risk distribution under treatment if we can fully explain these differences with . In the best case , but it is unlikely that the observed outcome, which is not causally welldefined, would explain differences in the risk distribution under treatment. \amcommentdoes this make sense? in my head it does because you could imagine that conditioning on could explain discrimination and conditioning on certainly would, but conditioning on the observed outcome is a weird combination that no longer seems meaningful As above, even if indPP does not hold, balPP may hold but it is difficult to reason why this should hold in any setting. Like Theorem 1, Theorem 2 also assumes a mild positivitylike assumption that is reasonable in risk assessment settings.
4.1.3. Equalized odds
In settings, where TPR and FPR are more important than predictive value, we may desire parity in TPR and FPR, a fairness notion known as Equalized Odds (Hardt et al., 2016). Let observational Equalized Odds (oEO) require that and counterfactual Equalized Odds (cEO) require that .
Theorem 3 (Equalized Odds).
If oEO holds, then cEO holds if and only if the following balance condition holds,
assuming and .
.
Condition 0 (balEO).
(20) 
The balance condition is satisfied under the following independence conditions, which comprise sufficient conditions for oEO to imply cEO.
Condition 0 (indEO).
(21) 
The first two conditions of indEO require oBP and cBP, so indEO requires balBP to hold. In settings where there is discrimination in treatment assignment even when controlling for true risk, indEO is unlikely to hold. Even if there is no such discrimination, indEO will not hold if there are differences in the risk distributions under treatment since the last condition of 21 requires . indEO requires further conditions such as parity in the TPR/FPR against the outcome under treatment. If these conditions are not met, oEO could imply cEO if balEO holds, but it is difficult to reason about why this would hold for a setting when the independencies do not. Theorem 3 assumes two mild assumptions: the positivitylike assumption of Theorem 2 and .
Our theoretical analysis suggests that in many settings equalizing the observational fairness metric will not equalize the counterfactual fairness metric. We conclude by noting that the theorems hold when conditioning on any feature(s) , and in this context, these theorems are relevant to individual notions of fairness.
4.2. Experiments on synthetic data
We empirically demonstrate that equalizing the observational metric via fairnesscorrective methods can increase disparity in the counterfactual metric on the synthetic data described in § 3.4.1.^{13}^{13}13We do not perform the experiments on the child welfare data since it is balanced in terms of base rates and FPR/TPR with respect to race.
4.2.1. Reweighing
One approach to encourage demographic parity reweighs the training data to achieve base rate parity (Kamiran and Calders, 2012). Figure 6 shows that without any processing (“Original”), the counterfactual base rates are equal while the observational base rates show increasing disparity with . Reweighing applied to the observational outcome achieves oBP but induces disparity in the counterfactual base rate. Theorem 1 suggested this result: For , ; then it is unlikely that oBP implies cBP.
4.2.2. Postprocessing for equalized odds
We evaluate a method that modifies scores to achieve a generalized version of equalized odds (Pleiss et al., 2017; Hardt et al., 2016).^{14}^{14}14We use the Pleiss implementation on https://github.com/gpleiss/equalized_odds_and_calibration that extends the method in (Hardt et al., 2016) to probabilistic classifiers. This method targets parity in the generalized FNR/FPR, where GFPR is and GFNR is . We refer to these observational rates as oGFPR/oGFNR and define their counterfactual counterpart: cGFPR and cGFNR . We use the scores of the counterfactual model as inputs. We compute the cGFNR and cGFPR using our DR method from § 3.3.3.^{15}^{15}15The estimator is nearly identical to the estimators for FPR/FNR if we use in place of the predicted label
Table 2 shows that postprocessing to equalize oGFPR and oGFNR induces imbalance in cGFPR and cGFNR.^{16}^{16}16We use and . We report results for other values in Appendix E. In Figure 7 we see that the original model achieved cEO but postprocessing induced disparity to the detriment of the group that was less likely to be treated. Since treatment is beneficial, this “fairness” adjustment actually compounded the discrimination in the treatment assignment.
Group  Method  cGFNR  cGFPR  oGFNR  oGFPR 

A=1  Original  0.50  0.33  0.58  0.39 
A=0  Original  0.50  0.33  0.56  0.39 
A=1  PostProc.  0.58  0.30  0.63  0.35 
A=0  PostProc.  0.64  0.34  0.63  0.35 
5. Conclusion
This paper demonstrates that training and evaluating models using observed outcomes can lead to the misallocation of resources due to the misestimation of risk for those most receptive to treatment. Furthermore, fairnesscorrecting methods that seek to achieve observational parity can lead to disparities on the relevant counterfactual metrics, and may further compound inequities in intial treatment assignment. The counterfactual approaches to learning, evaluation and predictive fairness assessment introduced in this paper provide more accurate and relevant indications of model performance.
Acknowledgements.
We are grateful to the Block Center for Technology and Society for funding this research. This project would not have been possible without the support of Allegheny County Department of Human Services, who shared their data and answered many questions during the research process. Thanks to our reviewers for helpful comments about the project.References
 (1)
 Alaa and van der Schaar (2017) Ahmed M Alaa and Mihaela van der Schaar. 2017. Bayesian inference of individualized treatment effects using multitask gaussian processes. In Advances in Neural Information Processing Systems. 3424–3432.
 Alexander (2011) Michelle Alexander. 2011. The new jim crow. Ohio St. J. Crim. L. 9 (2011), 7.
 Barabas et al. (2017) Chelsea Barabas, Karthik Dinakar, Joichi Ito, Madars Virza, and Jonathan Zittrain. 2017. Interventions over predictions: Reframing the ethical debate for actuarial risk assessment. arXiv preprint arXiv:1712.08238 (2017).
 Barocas and Selbst (2016) Solon Barocas and Andrew D Selbst. 2016. Big data’s disparate impact. Calif. L. Rev. 104 (2016), 671.
 Calders et al. (2009) Toon Calders, Faisal Kamiran, and Mykola Pechenizkiy. 2009. Building classifiers with independency constraints. In 2009 IEEE International Conference on Data Mining Workshops. IEEE, 13–18.
 Caruana et al. (2015) Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. 2015. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30day readmission. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1721–1730.
 Chouldechova (2017) Alexandra Chouldechova. 2017. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data 5, 2 (2017), 153–163.
 Chouldechova et al. (2018) Alexandra Chouldechova, Diana BenavidesPrado, Oleksandr Fialko, and Rhema Vaithianathan. 2018. A case study of algorithmassisted decision making in child maltreatment hotline screening decisions. In Conference on Fairness, Accountability and Transparency. 134–148.
 Chouldechova and Roth (2018) Alexandra Chouldechova and Aaron Roth. 2018. The frontiers of fairness in machine learning. arXiv preprint arXiv:1810.08810 (2018).
 CorbettDavies et al. (2017) Sam CorbettDavies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. 2017. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 797–806.
 DeArteaga et al. (2018) Maria DeArteaga, Artur Dubrawski, and Alexandra Chouldechova. 2018. Learning under selective labels in the presence of expert consistency. arXiv preprint arXiv:1807.00905 (2018).
 Dettlaff et al. (2011) Alan J Dettlaff, Stephanie L Rivaux, Donald J Baumann, John D Fluke, Joan R Rycraft, and Joyce James. 2011. Disentangling substantiation: The influence of race, income, and risk on the substantiation decision in child welfare. Children and Youth Services Review 33, 9 (2011), 1630–1637.
 Dressel and Farid (2018) Julia Dressel and Hany Farid. 2018. The accuracy, fairness, and limits of predicting recidivism. Science advances 4, 1 (2018), eaao5580.
 Dudík et al. (2011) Miroslav Dudík, John Langford, and Lihong Li. 2011. Doubly robust policy evaluation and learning. arXiv preprint arXiv:1103.4601 (2011).
 Dwork et al. (2012) Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. 2012. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference. ACM, 214–226.
 Ferguson (2016) Andrew Guthrie Ferguson. 2016. Policing predictive policing. Wash. UL Rev. 94 (2016), 1109.
 Hardt et al. (2016) Moritz Hardt, Eric Price, Nati Srebro, et al. 2016. Equality of opportunity in supervised learning. In Advances in neural information processing systems. 3315–3323.
 Kallus and Zhou (2018a) Nathan Kallus and Angela Zhou. 2018a. Confoundingrobust policy improvement. In Advances in Neural Information Processing Systems. 9269–9279.
 Kallus and Zhou (2018b) Nathan Kallus and Angela Zhou. 2018b. Residual Unfairness in Fair Machine Learning from Prejudiced Data. In Proc. International Conference on Machine Learning. Stockholm, Sweden, 2439–2448.
 Kamiran and Calders (2012) Faisal Kamiran and Toon Calders. 2012. Data preprocessing techniques for classification without discrimination. Knowledge and Information Systems 33, 1 (2012), 1–33.
 Kamishima et al. (2011) Toshihiro Kamishima, Shotaro Akaho, and Jun Sakuma. 2011. Fairnessaware learning through regularization approach. In 2011 IEEE 11th International Conference on Data Mining Workshops. IEEE, 643–650.
 Kehl and Kessler (2017) Danielle Leah Kehl and Samuel Ari Kessler. 2017. Algorithms in the criminal justice system: Assessing the use of risk assessments in sentencing. (2017).
 Kennedy (2016) Edward H Kennedy. 2016. Semiparametric theory and empirical processes in causal inference. In Statistical causal inferences and their applications in public health research. Springer, 141–167.
 Kennedy et al. (2017) Edward H Kennedy, Zongming Ma, Matthew D McHugh, and Dylan S Small. 2017. Nonparametric methods for doubly robust estimation of continuous treatment effects. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79, 4 (2017), 1229–1245.
 Kennedy et al. (2013) Edward H Kennedy, Wyndy L Wiitala, Rodney A Hayward, and Jeremy B Sussman. 2013. Improved cardiovascular risk prediction using nonparametric regression and electronic health record data. Medical care 51, 3 (2013), 251.
 Kilbertus et al. (2017) Niki Kilbertus, Mateo Rojas Carulla, Giambattista Parascandolo, Moritz Hardt, Dominik Janzing, and Bernhard Schölkopf. 2017. Avoiding discrimination through causal reasoning. In Advances in Neural Information Processing Systems. 656–666.
 Kleinberg et al. (2015) Jon Kleinberg, Jens Ludwig, Sendhil Mullainathan, and Ziad Obermeyer. 2015. Prediction policy problems. American Economic Review 105, 5 (2015), 491–95.
 Kleinberg et al. (2016) Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. 2016. Inherent tradeoffs in the fair determination of risk scores. arXiv preprint arXiv:1609.05807 (2016).
 Kube et al. (2019) Amanda Kube, Sanmay Das, and Patrick J Fowler. 2019. Allocating interventions based on predicted outcomes: A case study on homelessness services. In Proceedings of the AAAI Conference on Artificial Intelligence.
 Kusner et al. (2019) Matt Kusner, Chris Russell, Joshua Loftus, and Ricardo Silva. 2019. Making Decisions that Reduce Discriminatory Impacts. In International Conference on Machine Learning. 3591–3600.
 Kusner et al. (2017) Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. 2017. Counterfactual fairness. In Advances in Neural Information Processing Systems. 4066–4076.
 Madras et al. (2019) David Madras, Elliot Creager, Toniann Pitassi, and Richard Zemel. 2019. Fairness through causal awareness: Learning causal latentvariable models for biased data. In Proceedings of the Conference on Fairness, Accountability, and Transparency. ACM, 349–358.
 Mauer (2010) Marc Mauer. 2010. Justice for allchallenging racial disparities in the criminal justice system. Hum. Rts. 37 (2010), 14.
 Mitchell et al. (2018) Shira Mitchell, Eric Potash, and Solon Barocas. 2018. Predictionbased decisions and fairness: A catalogue of choices, assumptions, and definitions. arXiv preprint arXiv:1811.07867 (2018).
 Nabi and Shpitser (2018) Razieh Nabi and Ilya Shpitser. 2018. Fair inference on outcomes. In ThirtySecond AAAI Conference on Artificial Intelligence.
 Neyman (1923) J Neyman. 1923. Sur les applications de la theorie des probabilites aux experiences agricoles: essai des principes (Masters Thesis); Justification of applications of the calculus of probabilities to the solutions of certain questions in agricultural experimentation. Excerpts English translation (Reprinted). Stat Sci 5 (1923), 463–472.
 Pleiss et al. (2017) Geoff Pleiss, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Q Weinberger. 2017. On fairness and calibration. In Advances in Neural Information Processing Systems. 5680–5689.
 QuioneroCandela et al. (2009) Joaquin QuioneroCandela, Masashi Sugiyama, Anton Schwaighofer, and Neil D Lawrence. 2009. Dataset shift in machine learning. The MIT Press.
 Robins and Rotnitzky (1995) James M Robins and Andrea Rotnitzky. 1995. Semiparametric efficiency in multivariate regression models with missing data. J. Amer. Statist. Assoc. 90, 429 (1995), 122–129.
 Robins and Rotnitzky (2001) James M Robins and Andrea Rotnitzky. 2001. Inference for semiparametric models: Some questions and an answerComments.
 Robins et al. (1994) James M Robins, Andrea Rotnitzky, and Lue Ping Zhao. 1994. Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association 89, 427 (1994), 846–866.
 Rubin (2005) Donald B Rubin. 2005. Causal inference using potential outcomes: Design, modeling, decisions. J. Amer. Statist. Assoc. 100, 469 (2005), 322–331.
 Särndal et al. (1989) CarlErik Särndal, Bengt Swensson, and Jan H Wretman. 1989. The weighted residual technique for estimating the variance of the general regression estimator of the finite population total. Biometrika 76, 3 (1989), 527–537.
 Schulam and Saria (2017) Peter Schulam and Suchi Saria. 2017. Reliable decision support using counterfactual models. In Advances in Neural Information Processing Systems. 1697–1708.
 Shalit et al. (2017) Uri Shalit, Fredrik D Johansson, and David Sontag. 2017. Estimating individual treatment effect: generalization bounds and algorithms. In Proceedings of the 34th International Conference on Machine LearningVolume 70. JMLR. org, 3076–3085.
 Smith et al. (2012) Vernon C Smith, Adam Lange, and Daniel R Huston. 2012. Predictive modeling to forecast student outcomes and drive effective interventions in online community college courses. Journal of Asynchronous Learning Networks 16, 3 (2012), 51–61.
 Stevenson (2018) Megan Stevenson. 2018. Assessing risk assessment in action. Minn. L. Rev. 103 (2018), 303.
 Sugiyama et al. (2007) Masashi Sugiyama, Matthias Krauledat, and KlausRobert MÃžller. 2007. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research 8, May (2007), 985–1005.
 Swaminathan and Joachims (2015) Adith Swaminathan and Thorsten Joachims. 2015. Batch learning from logged bandit feedback through counterfactual risk minimization. Journal of Machine Learning Research 16, 1 (2015), 1731–1755.
 U.S. Department of Health & Human Services (2019) Administration U.S. Department of Health & Human Services. 2019. Child Maltreatment 2017. https://www.acf.hhs.gov/cb/researchdatatechnology/statisticsresearch/childmaltreatment
 Van der Laan et al. (2003) Mark J Van der Laan, MJ Laan, and James M Robins. 2003. Unified methods for censored longitudinal data and causality. Springer Science & Business Media.
 VanderWeele and Hernan (2013) Tyler J VanderWeele and Miguel A Hernan. 2013. Causal inference under multiple versions of treatment. Journal of causal inference 1, 1 (2013), 1–20.
 Wang et al. (2019) Yixin Wang, Dhanya Sridhar, and David M Blei. 2019. Equal Opportunity and Affirmative Action via Counterfactual Predictions. arXiv preprint arXiv:1905.10870 (2019).
 Zafar et al. (2015) Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P Gummadi. 2015. Fairness constraints: Mechanisms for fair classification. arXiv preprint arXiv:1507.05259 (2015).
 Zemel et al. (2013) Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. 2013. Learning fair representations. In International Conference on Machine Learning. 325–333.
 Zhang and Bareinboim (2018a) Junzhe Zhang and Elias Bareinboim. 2018a. Equality of opportunity in classification: A causal approach. In Advances in Neural Information Processing Systems. 3671–3681.
 Zhang and Bareinboim (2018b) Junzhe Zhang and Elias Bareinboim. 2018b. Fairness in decisionmaking – the causal explanation formula. In ThirtySecond AAAI Conference on Artificial Intelligence.
Appendix A Identifications
In this section we give the identifications referenced in Sections 3.2.2 and Sections 3.3.3. Identification is the process of writing a counterfactual quantity in terms of observable quantities, based on causal assumptions. Our identifications rely on the causal assumptions of § 3.2 and assume that the model is learned and evaluated on separate train/test partitions (so the predictions are just a function of features ).
a.1. Identification of the Counterfactual Outcome
a.2. Identification of Counterfactual TPR
In § 3.3.3, we identify the counterfactual TPR (or recall) as
The derivation is as follows. By definition of conditional expectation we have
We separately identify the numerator and denominator. Since we are evaluating on a test partition, is a function of . Then for the numerator we have
where the first line used iterated expectation, the second line used the definition of an indicator function, the third line use the fact that , the fourth line used exchangeability, and the fifth line used consistency.
To identify the denominator, we use iterated expectation and then apply exchangeability and consistency as we did for the counterfactual target:
a.3. Identification of Counterfactual Precision
In § 3.3.3, we identify the counterfactual precision as
.
The derivation is as follows.
.
a.4. Identification of Counterfactual Calibration
The derivation is the same as for precision since is just a function of .
a.5. Identification of Counterfactual FPR
In § 3.3.3, we identified counterfactual FPR as
The below derivation is similar to that of TPR. We can rewrite the target as
We separately identify the numerator and denominator. For the numerator we have
where the first line used iterated expectation, the second line used the definition of indicator function, the third line used the fact that is a binary random variable, the fourth line used exchangeability, and the last line used consistency.
For the denominator, we have
where the second line used the derivation for the denominator in TPR.
Appendix B Proofs
In this section we give the proofs for theorems in § 4.1. These proofs assume consistency (defined in § 3.2.2).
Proof that balBP is Necessary and Sufficient.
By consistency . Then we have
Likewise for
By oBP, . By the above expansions,
(22) 
Necessary: For oBP to imply cBP, both conditions must hold. By cBP, . Equation 22 then becomes
(23) 
which is the balBP condition since we can rewrite the righthand side as .
Sufficiency: In addition to oBP, we assume balBP holds. The lefthand sides of balBP (Equation 23) and oBP (Equation 22) are the same. Then applying the transitive property,
Assuming (which is a mild positivitylike assumption), we conclude that . ∎
indBP Sufficiency
The following conditions are sufficient for oBP to imply cBP: and
Predictive Parity
The proofs use the same techniques as for base rate parity.
Proof that BalEO is Necessary and Sufficient.
We first expand
(24)  
(25) 
which we can further expand to get
(27) 
Since oEO holds by assumption, then . Using the expansion in Equation 27, we have
(28) 
Rearranging gives
(29) 
Necessary
For oEO to imply cEO, both conditions must hold. By cEO, which would imply that
(30) 