Recently there has been sustained interest in modifying prediction algorithms to satisfy fairness constraints. These constraints are typically complex nonlinear functionals of the observed data distribution. Focusing on the causal constraints proposed by Nabi and Shpitser (2018), we introduce new theoretical results and optimization techniques to make model training easier and more accurate. Specifically, we show how to reparameterize the observed data likelihood such that fairness constraints correspond directly to parameters that appear in the likelihood, transforming a complex constrained optimization objective into a simple optimization problem with box constraints. We also exploit methods from empirical likelihood theory in statistics to improve predictive performance, without requiring parametric models for high-dimensional feature vectors.


Optimal Training of Fair Predictive Models


Razieh Nabi, Daniel Malinsky, Ilya Shpitser


Department of Computer Science,
Johns Hopkins University, Baltimore, MD, USA
(rnabi@, malinsky@, ilyas@cs.)jhu.edu

1 Introduction

Predictive models trained on imperfect data are increasingly being used in socially-impactful settings. Predictions (such as risk scores) have been used to inform high-stakes decisions in criminal justice (Perry et al., 2013), healthcare (Kappen et al., 2018), and finance (Khandani et al., 2010). While automation may bring many potential benefits – such as speed and accuracy – it is also fraught with risks. Predictive models introduce two dangers in particular: the illusion of objectivity and violation of fairness norms. Predictive models may appear to be “neutral,” since humans are less involved and because they are products of a seemingly impartial optimization process. However, predictive models are trained on data that reflects the structural inequities, historical disparities, and other imperfections of our society. Often data includes sensitive attributes (e.g., race, gender, age, disability status), or proxies for such attributes. A particular worry in the context of data-driven decision-making is “perpetuating injustice,” which occurs when unfair dependence between sensitive features and outcomes is maintained, introduced, or reinforced by automated tools.

We study how to construct fair predictive models by correcting for the unfair causal dependence of predicted outcomes on sensitive features. We work with the proposed fairness criteria in Nabi and Shpitser (2018), where the authors propose that fair prediction requires imposing hard constraints on the prediction problem in the form of restricting certain causal path-specific effects. Impermissible pathways are user-specified and context-specific, hence require input from policymakers, legal experts, or the general public. Some alternative but also causally-motivated constrained prediction methods are proposed in Chiappa (2019); Kusner et al. (2017a) and Zhang and Bareinboim (2018). For a survey and discussion of distinct fairness criteria (both causal and associative) see Mitchell et al. (2018).

We advance the state of the art in two ways. First, we give a novel reparameterization of the observed data likelihood in which unfair path-specific effects appear directly as parameters. This allows us to greatly simplify the constrained optimization problem, which has previously required complex or inefficient algorithms. Second, we demonstrate how tools from the empirical likelihood literature (Owen, 2001) can be readily adapted to construct hybrid (semi-parametric) observed data likelihoods that satisfy given fairness criteria. With this approach, the entire likelihood is constrained, rather than only part of the likelihood as in past proposals (Nabi and Shpitser, 2018). As a result, we use the data more efficiently and achieve better performance. Finally, we show how both innovations may be combined into a single procedure.

As a guiding example, we consider a setting such as automated hiring, in which we want to predict job success from applicant data. We have historical data on job success, resumes, and demographics, as well as new individuals for which we only see resumes and demographics for whom we would like to estimate a risk score with our predictive model. This may be considered a variant of semi-supervised learning or prediction with missing labels on a subset of the population. We aim to estimate those scores subject to path-specific fairness constraints. In order to describe the various components of this proposal, we must review some background on causal inference, path-specific effects, and constrained prediction.

2 Causal Inference and a Causal Approach to Fairness

Causal inference is concerned with quantities which describe the consequences of interventions. Causal models are often represented graphically, e.g. by directed acyclic graphs (DAGs). We will use capital letters () to denote sets of random variables as well as corresponding vertices in graphs and lowercase letters () to denote values or assignments to those random variables. A DAG consists of a set of vertices connected by directed edges ( for some ) such that there are no cycles. The set denotes the parents of in DAG . denotes the statespace of .

A causal model of a DAG is a set of distributions defined on potential outcomes (a.k.a. counterfactuals). For example, we consider distributions subject to some restrictions, where represents the value of had all variables in been set, possibly contrary to fact, to value . In this paper, we assume Pearl’s functional model (Pearl, 2009) for a DAG which stipulates that the sets of potential outcome variables are mutually independent. All other counterfactuals may be defined using recursive substitution. For any ,

where is taken to mean the (recursively defined) set of counterfactuals associated with variables in , had been set to . Equivalently, Pearl’s model may be described by a system of nonparametric structural equations with independent errors.

A causal parameter is said to be identified in a causal model if it is a function of the observed data distribution . In the functional model of a DAG (as well as some weaker causal models), all interventional distributions , for any , are identified by the extended g-formula:

For example, consider the DAG in Fig. 1(a). is defined to be by recursive substitution and its distribution is identified as . The mean difference between and for some treatment value of interest and reference value is and quantifies the average causal effect of treatment on the outcome .

2.1 Mediation Analysis and Path-Specific Effects

An important goal in causal inference is to understand the mechanisms by which some treatment influences some outcome . A common framework for studying mechanisms is mediation analysis which seeks to decompose the effect of on into the direct effect and the indirect effect mediated by a third variable, or more generally into components associated with particular causal pathways. As an example, the direct effect of on in Fig. 1(a) corresponds to the effect along the edge and the indirect effect corresponds to the effect along the path , mediated by .

In the potential outcome notation, the direct and indirect effects can be defined using nested counterfactuals such as for , which denotes the value of when is set to while is set to whatever value it would have attained had been set to . Given , the natural direct effect (NDE) (on the expectation difference scale) is defined as , and the natural indirect effect (NID) is defined as . Under certain identification assumptions discussed by (Pearl, 2001), the distribution of (and thereby direct and indirect effects) can be nonparametrically identified from observed data by the following formula:

More generally, when there are multiple pathways from to one may define various path-specific effects (PSEs). In this case, effect along a particular path will be obtained by comparing two potential outcomes, one where for the selected paths all nodes behave as if , and along all other paths nodes behave as if .

PSEs are defined by means of nested, path-specific potential outcomes. Fix a set of treatment variables , and a subset of proper causal paths from any element in . A proper causal path only intersects at the source node. Next, pick a pair of value sets and for elements in . For any , define the potential outcome by setting to for the purposes of paths in , and to for the purposes of proper causal paths from to not in . Formally, the definition is as follows, for any , , otherwise


where if , and given by recursive substitution otherwise, is the set of parents of along an edge which is a part of a path in , and is the set of all other parents of .

A counterfactual is said to be edge inconsistent if counterfactuals of the form and occur in , otherwise it is said to be edge consistent. It is well known that a joint distribution containing an edge-inconsistent counterfactual is not identified in the functional causal model (nor weaker causal models) with a corresponding graphical criterion on and called the ‘recanting witness’ (Shpitser, 2013; Shpitser and Tchetgen Tchetgen, 2016). Under some assumptions, PSEs are nonparametrically identified by means of the edge g-formula described in Shpitser and Tchetgen Tchetgen (2016).

As an example, consider the DAG in Fig. 1(b). The PSE of on along the paths is encoded by a counterfactual contrast of the form . This counterfactual density is identified by the edge g-formula as follows:

For more details on PSEs, see Shpitser (2013) andShpitser and Sherman (2018), and Nabi et al. (2018).

2.2 Algorithmic Fairness via Constraining Path-Specific Effects

There has been a growing interest in the issue of fairness in machine learning (Pedreshi et al., 2008; Feldman et al., 2015; Hardt et al., 2016; Kamiran et al., 2013; Corbett-Davies et al., 2017; Jabbari et al., 2017; Kusner et al., 2017b; Zhang and Bareinboim, 2018; Zhang et al., 2017). In this paper, we adopt the causal notion of fairness described in Nabi and Shpitser (2018) and Nabi et al. (2019), where unfairness corresponds to the presence of undesirable or impermissble path-specific effects of sensitive attributes on outcomes – a view which generalizes an example discussed in Pearl (2009). We provide a brief summary of their perspective on fairness in the following without defending it for lack of space; see Nabi and Shpitser for more details.

Consider an observed data distribution induced by a causal model, where is an outcome and includes all baseline factors , sensitive features , and post-treatment pre-outcome mediators . Context and background ethical considerations pick out some path-specific effect of the sensitive feature on the outcome as unfair; we assume this effect is identified as a functional . Fix upper and lower bounds for the PSE, representing a tolerable range. The most relevant bounds in practice are or approximately zero. Nabi and Shpitser propose to transform the inference problem on , the “unfair world,” into an inference problem on another distribution , called the “fair world,” which is close in the sense of minimal KL-divergence to while also having the property that the PSE lies within .

Given a dataset drawn from , a likelihood function , an estimator of the unfair PSE, and bounds , Nabi and Shpitser suggest to approximate by solving the following constrained maximum likelihood problem:


Having approximated the fair world in this way, Nabi and Shpitser point out a key difficulty for using these estimated parameters to predict outcomes for new instances (e.g., new job applicants). A new set of observations is not sampled from the “fair world” but from “unfair world” . Nabi and Shpitser propose to map new instances from to and to use the result for predicting with model parameters . They assume can be partitioned into and such that . In other words, variables in are shared between and : but . typically corresponds to variables that appear in the estimator . There is no obvious principled way of knowing exactly what values of the “fair version” of the new instance would attain. Consequently, all such possible values are averaged out, weighted appropriately by how likely they are according to the estimated . This entails predicting as the expected value (with respect to the distribution .

Next, we explain some limitations of the inference procedure described here and present our main contributions to address these limitations.

Figure 1: (a) A simple causal DAG, with treatment , outcome , baseline variables , and a mediator (b) A causal graph with two mediators and and unmeasured confounders captured in .

3 Fair Predictive Models in a Batch Setting

Prediction problems in machine learning are typically tackled from the perspective of nonparametric risk minimization and the “train-and-test” framework. Here, we instead take the perspective of maximum likelihood and missing data, i.e., we treat unknown outcomes as missing values which we hope to impute in a way that is consistent with our specified likelihood for the entire data set. Our motivation for doing so is the nature of our constrained prediction problem. Specifically, our causal constraints contain “nuisance” components (conditional expectations and conditional distributions derived from the observed data distribution) which must be modeled correctly to ensure the causal effects are reliably estimated. In the subsequent prediction step, we should predict in a way that is consistent with what has already been modeled – or else we fail to exploit all the information we have already committed to in the constraint estimation step. We chose the maximum likelihood framework as the most natural and simplest approach to accomplish this. Alternative methods for coherently combining nuisance estimation with nonparametric risk minimization are left to future work.

Unlike Nabi and Shpitser (2018), we consider a batch prediction setting – this allows us to avoid the inefficient averaging described in the previous section. In our case, historical data (of sample size ) consists of observations on and new instances (of size ) comprise a set of observations with just . The outcome labels for new instances are missing data which we aim to predict, subject to fairness constraints. Instead of training our constrained model on historical data alone, we train on the combination of historical data and new instances. This seems complicated since the observed data likelihood for the combined data set includes some complete rows and some partially incomplete rows. However, we can borrow ideas from the literature on missing data to accomplish this task. Specifically, we can impute missing outcomes (“labels”) using appropriate functions of observed data. In this paper we assume the labels are missing at random (MAR) (Little and Rubin, 2002). However, our methods extend to any identifiable missing not at random (MNAR) model. Let the random variable denote the missingness status of the outcome variable for each instance. That is, for all rows in the historical data (since is observed) and for all rows in the new instances. Then the observed data likelihood is .

This likelihood function describes the probability of the entire data set, though only uses values from historical data. We can then maximize the likelihood subject to the specified path-specific constraints, and associate predicted values to the new instances. Note that the setting where new instances arrive sequentially one-at-a-time is a special case of this general setup, which would require retraining on the full combined data after the arrival of each instance. Though this is computationally more intensive than the proposal in Nabi and Shpitser (2018) (where they only train once), it will deliver significantly more accurate predictions because it uses all available information. We will elaborate on this point in Section 4.

The approach to fair prediction outlined in Nabi and Shpitser (2018) suffers from two problems: one general and one specific to our setting here. First, their approach requires solving a computationally challenging constrained optimization problem. Likelihood functions are not in general convex and the constraints on path-specific effects involve nonlinear and complicated functionals of the observed data distribution. This makes the proposed constrained optimization a daunting task that relies on complex optimization software (or computationally expensive methods such as rejection sampling), which does not always find high quality local optima. Second, Nabi and Shpitser propose to constrain only part of the likelihood. Specifically they do not constrain the density over the baseline features (since this is high-dimensional and thus inplausible to model accurately in their parametric approach). The baseline density is instead estimated by placing mass at every observed data point. This is sub-optimal in the specific setting we consider, where we do not need to average over constrained variables. Constraining a larger part of the joint should lead to a fair world distribution KL-closer to the observed distribution, which leads to better predictive performance as long as the likelihood is correctly specified. This intuition is formalized in the following result:

Theorem 1

Let denote the observed data distribution, , and . If , then

In other words, if a larger part of the joint is being constrained in compared to , then is at least as close to as .

To address the first difficulty, we provide a novel reparameterization of the observed data likelihood such that the causal parameter corresponding to the unfair PSE appears directly in the likelihood. This approach generalizes previous work on reparameterizations implied by structural nested models (Robins, 1999; Tchetgen Tchetgen and Shpitser, 2014) to apply to a wide class of PSEs. With such a reparameterization, the MLE with a constrained PSE simply corresponds to maximum likelihood inference in a submodel where a certain likelihood parameter is set to . This type of inference can be implemented with standard software.

To address the second difficulty, we propose an approach to constraining the density . An alternative to fully parametric modeling is to consider nonparametric representations of . It is well known that the nonparametric maximum likelihood estimate of any given a set of i.i.d draws is the empirical distribution which places mass at every observed point. Empirical likelihood methods have been developed for settings where the nonparametric and parametric (hybrid) likelihood must be maximized subject to moment constraints (Owen, 2001). We describe below how these methods may be adapted to our setting.

Finally, we show how both the reparameterization method and the empirical likelihood method can be combined to yield a constrained optimization method that maximizes a semi-parametric (hybrid reparameterized) likelihood using standard software.

4 Efficient Approximation of Fair Worlds

4.1 Imposing Fairness Constraints With Reparameterized Likelihoods

In this section, we describe how to reparameterize the observed data likelihood in terms of causal parameters that correspond to the effect of on along certain causal pathways. The results presented in the following theorem greatly simplifies the constrained optimization problem shown in (2) in settings where the PSE includes the direct influence of on . This is due to the fact that the constrained parameter, corresponding to the PSE of interest, now appears as a single coefficient in the outcome regression model.

Theorem 2

Assume the observed data distribution is induced by a causal model, where includes pre-treatment measures , treatment , and post-treatment pre-outcome mediators . Let denote the potential outcome distribution that corresponds to the effect of on along proper causal paths in , where includes the direct influence of on , and let denote the identifying functional for obtained from the edge-formula in (2.1), where the term is evaluated at . Then can be written as follows:

where and . Furthermore, corresponds to -specific effect of on .

To illustrate the above reparameterization, consider the graph in Fig. 1(b), discussed in Nabi and Shpitser (2018) and Chiappa (2019). Assume the direct path and the paths through of on are the impermissible pathways (depicted with green edges). The corresponding PSE is encoded by a counterfactual contrast with respect to . The reparameterization in Theorem 2 amounts to:


where represents the PSE of interest; see the appendix for more details. A special case of this reparameterization when includes only the direct edge is implicit in the work of Tchetgen Tchetgen and Shpitser (2014).

Under linearity assumptions, the PSE of interest in Fig. 1(b) has a simple form. Assume the data generating process in Fig. 1(b) is the same as the one given in display (2) of Chiappa (2019), where . In this case, our reparameterization takes the following form:

In order to move away from the linear setting and exploit more flexible techniques, Chiappa (2019) makes assumptions on the latent variables. However, such assumptions are often hard to verify in practice. In contrast, our result is entirely nonparametric and does not rely on any assumptions beyond what is encoded in the causal DAG.

By Theorem 2, the constrained optimization problem in eq. (2) simplifies significantly to the following optimization problem:


In the prediction setting, i.e., finding optimal parameters for , this amounts to an unconstrained maximum likelihood problem with outcome regression taking the specific form:


where and is parameterized by . In practice, for each in the data is replaced with its empirical approximation , since a parametric specification of is not feasible. In next section, we explain how can be incorporated into the constrained optimization problem using empirical likelihood methods.

4.2 Imposing Fairness Constraints With Hybrid Likelihoods

In light of Theorem 1, we are interested in constraining the nonparameteric form of . Following work in Owen (2001), we use hybrid/semi-parametric empirical likelihood methods to estimate nonparametrically which is a novel idea in the fairness setting. First, according to Theorem 1, constraining would bring our learned distribution closer to the observed (unfair) distribution, and hence results in improvement of model performance, as we demonstrate in our simulations. Second, is often a high dimensional object that is difficult to estimate due to the curse of dimensionality. For simplicity of presentation, we focus on the DAG in Fig. 1(a), and the constraint represented by the NDE, although the methods we describe generalize without difficulty to arbitrary causal models and constraints represented by arbitrary PSEs.

Let be independent random vectors with common distribution . We assume a known parametric form for , and leave unrestricted. Assuming the unfair effect is NDE, the only constraint on the observed distribution is for NDE to be zero. Let be parameterized by , respectively. The direct effect can then be identified by , where


The profile empirical likelihood ratio parameters () are then given by


The above optimization problem involves a semi-parametric hybrid likelihood (Owen, 2001), that contains both nonparametric and parametric terms. In order to solve the above optimization problem (formulated on both and parameters), we can apply the Lagrange multiplier method and solve its dual form (formulated on both and the Lagrange multipliers); see the appendix for more details. Empirical likelihood methods provide a natural extension to imposing constraints on arbitrary PSEs, since these can be written in the form of for some .

If outcomes are missing at random, the NDE is identified by , where

The resulting functional is then used in the profile empirical likelihood in (7).

Unlike the standard unconstrained prediction setting where it is common to use nonparametric methods to estimate an arbitrary regression function, our task requires a combination of prediction and estimation of the relevant causal parameter (constraint). Estimating the causal parameter requires estimating certain nuisance components (like in eq. 6) which we choose to do parametrically in part because we desire certain frequentist properties, namely fast rates of convergence. More fundamentally, the empirical likelihood optimization problem in (7) finds optimal parameter values , where appears also in the constraint . That is, the structure of the empirical likelihood optimization problem requires that and models are specified parametrically. Though some combination of nonparametric risk minimization and empirical likelihood would be an interesting extension, how to accomplish this is an open question.

4.3 Imposing Fairness Constraints With Hybrid Reparameterized Likelihoods

In Section 4.1, we reformulated the constrained optimization problem of interest by rewriting the likelihood in terms of the parameters we were interested in constraining, and directly setting those parameters to zero. However, we did not place any constraints on . In Section 4.2, we used hybrid likelihoods to constrain a nonparametric estimate of , but did not provide a convenient reparameterization of the likelihood in terms of relevant parameters. In this section we describe an approach to optimizing a hybrid reparameterized likelihood that combines the advantages of both proposals. This allows us to constrain the entire likelihood and do so with standard maximum likelihood software, since the constraint we must satisfy directly corresponds to a parameter in the hybrid likelihood.

For simplicity of presentation, we again focus on the constraining the NDE, although the methods we describe generalize without difficulty to arbitrary constraints represented by arbitrary PSEs. The direct effect can then be estimated by , where is given in (6), and is given in (5). Assuming as in (5), will be a function of s as well. The profile empirical likelihood ratio () in this setting is then given by


Unlike the constrained optimization problem in (7), it is not straightforward to find the dual form of the optimization problem in (8), which is a standard approach for solving such problems in the empirical likelihood literature. The reason is that appears in multiple places in the constraint corresponding to setting PSE to zero; that is is now a function of both and ’s. As an alternative, we provide a heuristic approach for optimizing (8) via an iterative procedure that starts with initialization of and s, and at the th iteration updates the values for and s by treating as a function of . The procedure terminates when the difference between the two updates is sufficiently small. In Algorithm 1, we provide a detailed description of our proposed iterative procedure to address this issue, which behaves well in experiments.

Input: and specification of a PSE of the form .

Output: by solving

1:Pick starting values for and .
2:At iteration, given fixed and , estimate the following (in order)
  • by solving , which is a monotone function in .

  • using

  • by maximizing the following

where and .
3:Repeat Step (2) until convergence.
Algorithm 1 Hybrid Reparameterized Likelihood

5 Experiments

Method Estimator Direct Effect Log MSE
Constrained MLE G-formula
Unconstrained MLE G-formula
Table 1: Comparing different versions of estimated by constraining different parts of the likelihood.
                Likelihood Method Direct Effect MSE
: Unconstrained MLE
: Constrained MLE (sec. 2.2)
: Reparameterized MLE (sec. 4.1)
: Hybrid MLE (sec. 4.2)
: Hybrid reparameterized MLE (sec. 4.3)
Table 2: Evaluating different estimation methods by KL divergence and predictive accuracy (MSE).

Given Theorem 1, the accuracy of the prediction procedure depends on what components of are constrained, and following Nabi and Shpitser (2018) this depends on the chosen estimator . Here, we illustrate this dependence via experiments by considering four consistent estimators of the NDE presented in Tchetgen Tchetgen and Shpitser (2012) (assuming the model shown in Fig. 1(a) is correct). We fit models , , and by maximum likelihood. The first estimator (G-formula), is the MLE plug-in estimator and uses and models to estimate NDE. The second one is the inverse probability weighted (IPW) estimator that uses and models. The “mixed” estimator uses the and models, and the augmented IPW estimator (AIPW) uses all three models. See the appendix for details on these estimators.

We generated a sample of size using the data generating process described in the appendix. We approximate the fair world, , by constraining MLE given in Section 2. We estimated the NDE using the four methods described above and evaluated the performance of the approximated for each case. In Table 1 we show the estimated NDE with respect to , the log likelihood, KL-divergence between and , and the mean squared error between the observed outcomes and the predicted ones. We contrast these results with the unconstrained prediction model. Unconstrained MLE is KL-closest to the true distribution and yields the lowest MSE, as expected. However, it suffers from being unfair: . AIPW produces the second closest approximation to the true distribution while being fair. However, the MSE under AIPW is relatively large, since new instances are being mapped and more information are averaged out from the predictions. The approximated fair distributions under the other three estimators are KL-farther from the true distribution, and the accuracy of prediction varies, underscoring how the performance of the learned prediction model depends strongly on what part of the information is being averaged out and what estimator is being used.

Next, we illustrate that even in simple settings our proposed methods for solving constrained maximum likelihood problems considerably outperform the existing method described in Nabi and Shpitser (2018). We will use continuous outcomes for simplicity, but our results are not substantially affected if outcomes are discrete. We generated synthetic data ( with missing outcomes) according to the causal model shown in Fig. 1(a), where are binary and are continuous variables. The model specification details are reported in Appendix D and the code is attached to the submission. For illustration purposes, we assume that the direct effect of sensitive feature on outcome is unfair and estimate it via the g-formula. We approximate the fair world, , by constrained MLE using the three methods described in Section 4 and contrast them with the constrained MLE described in Section 2 as well as unconstrained MLE. We evaluated the performance of all five methods by computing the direct effect with respect to , KL-divergence between and , and the mean squared error between the observed and predicted outcomes.

Results are displayed in Table 2 (averaged over 20 repetitions). We see that all three proposed methods achieve an approximatation to the fair distribution KL-closer to the true unfair distribution , compared to standard constrained MLE. Using the reparameterized MLE by itself requires averaging over the constrained covariates as in Nabi and Shpitser (2018), so there is only minimal improvement in prediction accuracy (measured by MSE). However, the last two methods involve prediction in batch mode as described above – that is, use all information in the data – and so can achieve substantial improvements in prediction accuracy.

6 Conclusion

Imposing hard fairness constraints on predictive models involves a balance of parametric modeling, nonparametric methods, and constrained optimization. In this paper we have proposed two innovations to make the problem easier and make predictions more accurate: a reparameterization of the likelihood such that nonlinear constraints appear explictly as likelihood parameters constrained to be zero and an incorporation of techniques from empirical likelihood theory to make the constrained distribution closer to the unconstrained unfair distribution. Our simulations show that even in a relatively simple setting, we can improve significantly on prior proposals, achieving prediction performance comparable to unconstrained (unfair) maximum likelihood, particularly with the hybrid approach. Though we focus primarily on the path-specific fairness constraints proposed in Nabi and Shpitser (2018), the ideas presented here should be applicable more broadly to fair prediction proposals that require imposing constraints on predictive models. At this stage, our method which combines reparameterization with hybrid likelihood is somewhat heuristic; in future work, we hope to develop an approach for optimizing EL weights and likelihood parameters jointly without the need for iteration.


  • Chiappa (2019) Silvia Chiappa. Path-specific counterfactual fairness. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, 2019.
  • Corbett-Davies et al. (2017) Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 797–806, 2017.
  • Feldman et al. (2015) Michael Feldman, Sorelle A Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. Certifying and removing disparate impact. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 259–268, 2015.
  • Hardt et al. (2016) Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. In Advances In Neural Information Processing Systems, pages 3315–3323, 2016.
  • Jabbari et al. (2017) Shahin Jabbari, Matthew Joseph, Michael Kearns, Jamie Morgenstern, , and Aaron Roth. Fairness in reinforcement learning. In Proceedings of International Conference on Machine Learning, 2017.
  • Kamiran et al. (2013) Faisal Kamiran, Indre Zliobaite, and Toon Calders. Quantifying explainable discrimination and removing illegal discrimination in automated decision making. Knowledge and Information Systems, 35(3):613–644, 2013.
  • Kappen et al. (2018) Teus H Kappen, Wilton A van Klei, Leo van Wolfswinkel, Cor J Kalkman, Yvonne Vergouwe, and Karel GM Moons. Evaluating the impact of prediction models: lessons learned, challenges, and recommendations. Diagnostic and Prognostic Research, 2(1):11, 2018.
  • Khandani et al. (2010) Amir E. Khandani, Adlar J. Kim, and Andrew W. Lo. Consumer credit-risk models via machine learning algorithms. Journal of Banking & Finance, 34:2767–2787, 2010.
  • Kusner et al. (2017a) Matt J. Kusner, Joshua R. Loftus, Chris Russell, and Ricardo Silva. Counterfactual fairness. Advances in Neural Information Processing Systems, 2017a.
  • Kusner et al. (2017b) Matt J. Kusner, Joshua R. Loftus, Chris Russell, and Ricardo Silva. Counterfactual fairness. In Advances In Neural Information Processing Systems, 2017b.
  • Little and Rubin (2002) Rodrick J.A. Little and Donald B. Rubin. Statistical Analysis with Missing Data. Wiley, 2002.
  • Mitchell et al. (2018) Shira Mitchell, Eric Potash, and Solon Barocas. Prediction-based decisions and fairness: A catalogue of choices, assumptions, and definitions. arXiv preprint arXiv:1811.07867, 2018.
  • Nabi and Shpitser (2018) Razieh Nabi and Ilya Shpitser. Fair inference on outcomes. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • Nabi et al. (2018) Razieh Nabi, Phyllis Kanki, and Ilya Shpitser. Estimation of personalized effects associated with causal pathways. In Proceedings of the Thirty Fourth Conference on Uncertainty in Artificial Intelligence (UAI-34th). AUAI Press, 2018.
  • Nabi et al. (2019) Razieh Nabi, Daniel Malinsky, and Ilya Shpitser. Learning optimal fair policies. In Proceedings of the 36th International Conference on Machine Learning (ICML), 2019.
  • Owen (2001) Art Owen. Empirical Likelihood. Chapman & Hall, 2001.
  • Pearl (2001) Judea Pearl. Direct and indirect effects. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, pages 411–420, 2001.
  • Pearl (2009) Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2nd edition, 2009.
  • Pedreshi et al. (2008) Dino Pedreshi, Salvatore Ruggieri, and Franco Turini. Discrimination-aware data mining. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 560–568, 2008.
  • Perry et al. (2013) Walter L. Perry, Brian McInnis, Carter C. Price, Susan Smith, and John S. Hollywood. Predictive policing: The role of crime forecasting in law enforcement operations. RAND Corporation, http://www.rand.org/pubs/research_reports/RR233.html, 2013.
  • Qin (2017) Jing Qin. Biased Sampling, Over-identified Parameter Problems and Beyond. Springer, 2017.
  • Robins (1999) James M. Robins. Marginal structural models versus structural nested models as tools for causal inference. In Statistical Models in Epidemiology: The Environment and Clinical Trials. NY: Springer-Verlag, 1999.
  • Shpitser (2013) Ilya Shpitser. Counterfactual graphical models for longitudinal mediation analysis with unobserved confounding. Cognitive Science (Rumelhart special issue), 37:1011–1035, 2013.
  • Shpitser and Sherman (2018) Ilya Shpitser and Eli Sherman. Identification of personalized effects associated with causal pathways. In Proceedings of the 34th Annual Conference on Uncertainty in Artificial Intelligence, 2018.
  • Shpitser and Tchetgen Tchetgen (2016) Ilya Shpitser and Eric J. Tchetgen Tchetgen. Causal inference with a graphical hierarchy of interventions. Annals of Statistics, 44(6):2433–2466, 2016.
  • Tchetgen Tchetgen and Shpitser (2012) Eric J. Tchetgen Tchetgen and Ilya Shpitser. Semiparametric theory for causal mediation analysis: efficiency bounds, multiple robustness, and sensitivity analysis. Annals of Statistics, 2012.
  • Tchetgen Tchetgen and Shpitser (2014) Eric J. Tchetgen Tchetgen and Ilya Shpitser. Estimation of a semiparametric natural direct effect model incorporating baseline covariates. Biometrika, 101(4):849–864, 2014.
  • Wasserman (2013) Larry Wasserman. All of statistics: a concise course in statistical inference. Springer Science & Business Media, 2013.
  • Zhang and Bareinboim (2018) Junzhe Zhang and Elias Bareinboim. Fairness in decision-making – the causal explanation formula. In Proceedings of the Thirty-Second AAAI Conference on Association for the Advancement of Artificial Intelligence, 2018.
  • Zhang et al. (2017) Lu Zhang, Yongkai Wu, and Xintao Wu. A causal framework for discovering and removing direct and indirect discrimination. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, pages 3929–3935, 2017.


In Appendix A, we provide additional details for the direct effect reparameterization example (illustrating Theorem 2) discussed in the main paper. In Appendix B, we provide a brief overview of empirical likelihood methods and some additional theoretical details useful for understanding our proposed hybrid likelihood approach. In Appendix C, we state the statistical modeling assumptions we made in our simulation experiments. In Appendix D, we give some relevant details for the first simulation reported in the main paper. Appendix E contains proofs of our theorems. For a clearer presentation of materials in this supplement, we use a one-column format.

A. Reparameterized Likelihood Example: Additional Details

Consider the DAG in Fig. 1(a), and assume the natural direct effect is the unfair PSE we wish to constrain to be . Theorem 2 leads to the following reparameterization of the regression function:

The coefficient corresponds to the direct effect, since


The observed data likelihood is given by

where has mean

The constrained optimization problem in eq. (2) then simplifies to the following optimization problem:

B. Hybrid Likelihood: Overview and Details

Empirical Likelihood

We briefly review empirical likelihood methods, described in detail in Owen (2001). Let be independent random vectors with common distribution . Let be any CDF, where , and be the empirical distribution. Suppose that we are interested in through , where is a real-valued function of the distribution. The true unknown parameter is . Proceeding by analogy to parametric MLE, the non-parametric MLE of is . The nonparametric likelihood ratio, , is used as a basis for hypothesis testing and deriving confidence intervals. The profile likelihood ratio function is defined as

where denotes the set of all distributions on .

Often, is the solution to an estimating equation of the form . A natural estimator for is produced by solving the empirical estimating equation . Assuming , the profile empirical likelihood ratio function of is defined as


Since maximizing the likelihood is equivalent to maximizing the logarithm of the likelihood, the profile empirical likelihood ratio is rewritten in terms of log likelihood as follows.


In order to solve the above optimization problem, we can apply the Lagrange multiplier method.

where are the Lagrange multipliers. We take the derivative of , with respect to the ’s, and set them to zero. Solving the system of equations reveals that , and


where is the solution to


which is a monotone function in . Maximizing the profile empirical log-likelihood ration in (10) is equivalent to maximizing the following (substituting from (11) into (10)):


Maximizing over a small set of parameters , is a much simpler optimization problem than maximizing (10) over unknowns. Equation 13 is known as the dual representation of 10. See Owen (2001) for more details.

Hybrid Likelihood

Now, consider independent pairs . Suppose that all observations are independent, and that we have a correctly specified parametric model for but is unspecified. Let . A natural approach for estimating and the s is to form a hybrid likelihood that is nonparametric in the distribution of but is parametric in the conditional distribution of :

Suppose we are interested in parameter through the estimating equation . Hence, the equivalent form of (10) for the profile hybrid likelihood ratio function is as follows:


Similar to the empirical likelihood, we can apply the Lagrange multiplier method to solve the above optimization problem. For more details, see Owen (2001) and Qin (2017).

C. Simulation details

Here we report the precise parameter settings used in our simulation studies. We trained our models on a batch size of using the following data generating process, where outcome is treated as missing on of the data. Mean squared errors in Tables 1 and 2 are computed only on the missing portion of the outcome .


D. Details on Estimation Strategies

Given Theorem 1, the accuracy of the prediction procedure will depend on what parts of are constrained, and following Nabi and Shpitser (2018) this depends on the estimator . Here, we define several consistent estimators of the NDE (assuming the model shown in Fig. 1(a) is correct) presented in Tchetgen Tchetgen and Shpitser (2012).

G-formula: The first estimator is the MLE plug in estimator, where we use the and models to estimate NDE. We fit models and by maximum likelihood, and use the following formula:


Since solving (2) using (16) entails constraining and , classifying a new instance entails using