Assessing Algorithmic Fairness with Unobserved Protected Class Using Data Combination
Nathan Kallus
Cornell University, kallus@cornell.edu
Xiaojie Mao
Cornell University, xm77@cornell.edu
Angela Zhou
Cornell University, az434@cornell.edu
The increasing impact of algorithmic decisions on people’s lives compels us to scrutinize their fairness and, in particular, the disparate impacts that ostensiblycolorblind algorithms can have on different groups. Examples include credit decisioning, hiring, advertising, criminal justice, personalized medicine, and targeted policymaking, where in some cases legislative or regulatory frameworks for fairness exist and define specific protected classes. In this paper we study a fundamental challenge to assessing disparate impacts in practice: protected class membership is often not observed in the data. This is particularly a problem in lending and healthcare. We consider the use of an auxiliary dataset, such as the US census, that includes class labels but not decisions or outcomes. We show that a variety of common disparity measures are generally unidentifiable aside for some unrealistic cases, providing a new perspective on the documented biases of popular proxybased methods. We provide exact characterizations of the sharpestpossible partial identification set of disparities either under no assumptions or when we incorporate mild smoothness constraints. We further provide optimizationbased algorithms for computing and visualizing these sets, which enables reliable and robust assessments – an important tool when disparity assessment can have farreaching policy implications. We demonstrate this in two case studies with real data: mortgage lending and personalized medicine dosing.
Key words: Disparate Impact and Algorithmic Bias; Partial Identification; Proxy Variables; Fractional Optimization; Bayesian Improved Surname Geocoding
The spread of prescriptive analytics and algorithmic decisionmaking has given rise to urgent ethical and legal imperatives to avoid discrimination and guarantee fairness with respect to protected classes. In advertising, prescriptive algorithms target for maximal impact and revenue (Iyer et al. 2005, Goldfarb and Tucker 2011), but recent studies found genderbased discrimination in who receives ads for STEM careers (Lambrecht and Tucker 2019) and other worrying disparities (Datta et al. 2015, Sweeney 2013). In hiring, algorithms help employers efficiently screen applicants (Miller 2015), but in some cases this can have unintended biases, e.g., against women and minorities (Dastin 2018). In criminal justice, algorithmic recidivism scores allow judges to assess risk (Monahan and Skeem 2016), while recent studies have revealed systematic racebased disparities in error rates (Angwin et al. 2016, Chouldechova 2017). In healthcare, algorithms that allocate resources like care management have been shown to exhibit racial biases (Obermeyer and Mullainathan 2019) and personalized medicine algorithms can offer disparate benefits to different groups (Rajkomar et al. 2018, Goodman et al. 2018). In lending, prescriptive algorithms optimize credit decisions using predicted default risks and their induced disparities are regulated by law (Comptroller of the Currency 2010), leading to legal cases against discriminatory lending (Consumer Financial Protection Bureau 2013).
For regulated decisions, there are two major legal theories of discrimination:

Disparate treatment (Zimmer 1996): informally, intentionally treating an individual differently on the basis of membership in a protected class; and

Disparate impact (Rutherglen 1987): informally, adversely affecting members of one protected class more than another even if by an ostensibly neutral policy.
Thus, prescriptive algorithms that do not take race, gender, or other sensitive attributes as an input may satisfy equal treatment but may still induce disparate impact (Kleinberg et al. 2017). Indeed, many of the disparities found above take the form of unintended disparate impact of ostensibly classblind prescriptive algorithms. In some contexts, such as hiring, any disparate impact is prohibited, whereas in other contexts, such as lending, disparate impact is a basis for heightened scrutiny and sometimes sanction while some disparities may be justifiable as “business necessary;” see Fig. 2. Table 1 summarizes the protected classes codified by two US fair lending laws: the Fair Housing Act (FHA) and Equal Credit Opportunity Act (ECOA).
Law  FHA  ECOA 

age  X  
color  X  X 
disability  X  
exercised rights under CCPA  X  
familial status (household composition)  X  
gender identity  X  
marital status (single or married)  X  
national origin  X  X 
race  X  X 
recipient of public assistance  X  
religion  X  X 
sex  X  X 
Assessing the disparate impacts of a prescriptive algorithm involves evaluating the difference in the distributions of decision outcomes received by different groups defined in terms of values of the protected class of interest and potentially some ground truth. For example, the demographic disparity metric might measure the difference between the fraction of black loan applicants approved and white loan applicants approved. The opportunity (or, truepositiverate) disparity metric might measure this difference after restricting to nondefaulting applicants in order to account for baseline differences in businessrelevant variables like income (Hardt et al. 2016). (We define precisely these disparity metrics in id1 and discuss them in Fig. 2.) Note that what size of disparity counts as unacceptable depends on the context. While US employment law often uses the “fourfifths rule” (Equal Employment Opportunity Commission 1978), no such rules of thumb exist in fair lending (Comptroller of the Currency 2010, Consumer Financial Protection Bureau 2018). Therefore, any statistical measures of disparity must be considered in the appropriate legal, ethical, and regulatory context. In any case, they must first be measured.
In this paper, we study a fundamental challenge to assessing the disparity induced by prescriptive algorithms in practice:
protected class membership is often not observed in the data.
There may be many reasons for this missingness in practice, both legal, operational, and behavioral. In the US financial service industry, lenders are not permitted to collect race and ethnicity information on applicants for nonmortgage products^{1}^{1}1The US Home Mortgage Disclosure Act (HMDA) authorizes lenders to collect such information for mortgage applicants and coapplicants. such as credit cards, auto loans, and student loans. This considerably hinders auditing fair lending for nonmortgage loans, both by internal compliance officers and by regulators (Zhang 2016). Similarly, health plans and health care delivery entities lack race and ethnicity data on most of their enrollees and patients, as a consequence of high datacollection costs and people’s reluctance to reveal their race information for fear of potential discrimination (Weissman and HasnainWynia 2011). This data collection challenge makes monitoring of racial and ethnic differences in care impractical and impedes the progress of healthcare equity reforms (Gaffney and McCormick 2017).
To address this challenge, some methods heuristically use observed proxies to predict and impute unobserved protected class labels. The most (in)famous example is the Bayesian Improved Surname Geocoding (BISG) method. BISG estimates conditional race membership probabilities given surname and geolocation (e.g., census tract or ZIP code) using data from the US census, and then imputes the race labels based on the estimated probabilities. Since its invention (Elliott et al. 2008, 2009), the BISG method has been widely used in assessing racial disparities in health care (Fremont et al. 2005, Weissman and HasnainWynia 2011, Brown et al. 2016, Fremont et al. 2016, Haas et al. 2019). In 2009, the Institute of Medicine also suggested it as an interim strategy until routine collection of relevant data is feasible (Nerenz et al. 2009). Later on, this methodology was adopted by the Consumer Financial Protection Bureau (CFPB) (Consumer Financial Protection Bureau 2014), a regulator in the US financial industry. In March 2013, CFPB’s analysis based on the BISG method supported a $98million settlement against Ally Bank for harming minority borrowers in the auto loan market (Consumer Financial Protection Bureau 2013).
However, the validity of using proxies for the unobserved protected class for disparity assessment is controversial, and relevant research is still limited. Although advanced proxy methods like BISG are shown to outperform previous proxy methods that use only surname or only geolocation (Elliott et al. 2008, 2009, Consumer Financial Protection Bureau 2014), some researchers recently employed mortgage datasets to reveal biased disparity assessment resulting from using the race proxies in place of true race labels (Baines and Courchane 2014, Zhang 2016). Chen et al. (2019) further attempted to unveil the underlying mechanism for the disparity assessment bias, and they attributed the bias to the joint dependence among lending outcome, geolocation, and race. However, a systematic understanding of the precise limitations of using proxy methods in disparity assessment in general, and possible remedies to the potential biases, are still lacking. Filling in this gap is an important and urgent need, especially given the wide use of proxy methods and the high impact of disparity assessment in the settings where they are used, which motivates our current work.
In this paper, we study the basic statistical identification limits for assessing disparities when protected class labels are unobserved and provide new optimizationbased algorithms for obtaining sharp partialidentification bounds on said disparities, which can enable robust and reliable auditing of the disparate impact of prescriptive algorithms. We first formulate the problem from the perspective of combining two datasets:
 a main dataset

with the decision outcomes, (potentially) true outcomes, and proxy variables, but where the protected class labels are missing; and
 an auxiliary

dataset with the proxy variable and protected class label, but without the outcomes.
Based on this formulation, we prove that disparity measures are generally unidentifiable from the observed data. In particular, we give necessary and sufficient conditions for identifiability, and we argue that they are too stringent to hold in practice. This implies that, even with stateoftheart proxy methods, the observed datasets generally do not contain enough information to measure the disparity metrics of interest, no matter how large the sample sizes. Instead, we fully and exactly characterize the partial identification sets for some common disparity measures, i.e., the sets of all simultaneously valid values of the disparity measures that are compatible with the observed data (and optionally some mild assumptions). These sets provide the sharpestpossible pinpointing of true disparities given the data. Using convex optimization techniques, we provide algorithmic procedures to compute and visualize these partial identification sets (or, their convex hulls). In other words, our approach gives set estimates for the disparity measures, as opposed to the spurious point estimates given by previous proxy methods for disparity assessment. Given the unidentifiability of the disparity measures, the point estimates of disparity measures provided by previous approaches are generally biased, and the bias may be highly dependent on adhoc modelling specifications and thus have unpredictable behavior (Chen et al. 2019). In contrast, our approach fully acknowledges the intrinsic uncertainty^{2}^{2}2This uncertainty is about the lack of identification of the disparity measures from the observed data because some essential information is unobserved (the protected class). It is very different from the uncertainty resulting from finitesample variability involved in confidence interval or statistical hypothesis testing. When the sample size grows to infinity, the finitesample uncertainty shrinks to zero, but the identification uncertainty remains. In this paper, we exclusively focus on the more problematic identification uncertainty and, for simplicity, ignore the eventuallyvanishing finitesample uncertainty. in learning disparity without direct measurements of the protected class, and fully exploits all information available in the observed data (and assumptions).
We highlight our primary contributions below:
 Problem formulation.

We formulate disparity assessment with proxies for unobserved protected class as a data combination problem. This formulation facilitates a principled analysis of the identifiability of the disparity measures.
 Identification Conditions.

We characterize the necessary and sufficient conditions for pointidentifiability using data combination of various disparity measures, which are nonlinear functionals. We argue that these conditions are generally too strong to assume in practice.
 Characterizing the Partial Identification Set.

We exactly characterize the sharp partial identification sets of some common disparity measures under data combination, that is, the smallest set containing all possible values that disparity measures may simultaneously take while still agreeing with the observed data. We further show how to extend this to incorporate additional mild smoothness assumptions that help reduce uncertainty.
 Computing the Partial Identification Set.

We propose procedures to compute these partial identification sets (or, their convex hulls) based on linear and linearfractional optimization. These procedures enable us to visualize the disparity sets and to assess the fairness of prescriptive algorithms in practice.
 Robust Auditing.

These tools facilitate robust and reliable fairness auditing. Since the sets we describe are sharp in that they are the tightest possible characterization of disparity given the data, their size generally captures the amount of uncertainty that remains in evaluating disparity when the protected class is unobserved and only proxies are available. When the observed data is very informative about the disparity measures, the set tends to be small and may still lead to meaningful conclusions regarding the sign and magnitudes of disparity, despite unidentifiability. In contrast, when the observed data is insufficient, the set tends to be large and gives a valuable warning about the risk of drawing conclusions from the fundamentally limited observed data.
 Empirical Analysis.

We apply our approach in two real case studies: evaluating the racial disparities (1) in mortgage lending decisions and (2) in personalized Warfarin dosing. We demonstrate how adding extra assumptions may decrease the size of partial identification sets of disparity measures, and illustrate how stronger proxies – either for race or for outcomes – can lead to small partial identification sets and informative conclusions on disparities.
Main dataset  

Surname  ZIP code  Approval  Nondefault  
Jones  94122  Y  N  
⋮  ⋮  ⋮  ⋮ 
Auxiliary dataset  

White  API  
Surname  ZIP code  %  %  
Jones  94122  47%  31%  
⋮  ⋮  ⋮  ⋮ 
We mainly consider four types of relevant variables:
 True outcome, ,

is a target variable that justifies an optimal decision. In the lending example (Section id1), we denote for loan applicants who would not default on loan payment if the loan application were approved. is not known to decision makers at the time of decision making.
 Decision outcome, ,

is the prescription by either human decisions makers or machine learning algorithms, often based on imperfect predictions of . For example, represents approval of a loan application, which is often based on some prediction of default risk. We call the positive decision, even if is not favorable in terms of utility (e.g., high medicine dosage).
 Protected attribute, ,

is a categorical variable (e.g., race or gender). For clear exposition, our convention is to write for a group understood to be generally advantaged and for a disadvantaged group. Take race as an example: the advantaged group usually refers to the majority class (White), and the disadvantaged group refers to any of the minority classes (Black, Hispanic, API, etc.).
 Proxy variables, ,

are a set of additional observed covariates. In proxy methods, these are used to predict . In the BISG example (Section id1), stands for surname and geolocation (census tract, zip code, county, etc.). The proxy variables can be categorical, continuous, or mixed.
In this paper, we mainly present the binary outcomes (true outcome and decision outcome), but our results can be straightforwardly extended to multileveled outcomes.
We formulate the problem of using proxy methods from a data combination perspective. Specifically, we assume we have two datasets: the main dataset with observations of , and the auxiliary dataset with observations of . Because we focus on identification uncertainty rather than finitesample uncertainty we characterize this by our knowledge of two probability distributions: and . Given a sample, these can then be estimated in some way and plugged in, as we will actually do in id1.
We cannot simply join these two datasets directly for many possible reasons. For example, no unique identifier for individuals (e.g., social security number) exists in both datasets (if one did, it would fall under the setting of a proxy with perfect prediction; see id1). Alternatively, we might not even have individuallevel observations but only summary frequency statistics. This is for example the case for BISG proxy (see id1). Because we have only these two separate, unconnected datasets, we do not know the combined joint distribution .
In this paper, we focus on assessing the disparity in the decision with respect to the protected attribute . There exist a myriad of disparity measures in the literature (Fig. 2), and the appropriate disparity measure may have to vary depending on the context. Thus it is impossible to present all disparity measures. In this paper, we focus on socalled observational group disparity measures, which are widely used in the fair machine learning literature (CorbettDavies and Goel 2018, Berk et al. 2018) These disparity measures are often formalized as some measure of classification error, and, if we were given observations of true class labels, they could be computed from a withinclass confusion matrix of the decision and true outcome.
Specifically, we consider the following disparity measures:

Demographic Disparity (DD):

True Positive Rate Disparity (TPRD; aka opportunity disparity):

True Negative Rate Disparity (TNRD):

Positive Predictive Value Disparity (PPVD):

Negative Predictive Value Disparity (NPVD):
We interpret these disparity measures using the running example of making lending decisions. DD measures the disparity in withinclass average loan approval rate.^{3}^{3}3Strictly speaking, demographic disparity is not based on classification “error” but it can be also computed from the withinclass confusion matrices. TPRD (respectively, TNRD) measure the disparity in the proportions of people who correctly get approved (respectively, rejected) in loan applications between two classes, given their true npndefault or default outcome. Compared to DD, TPRD and TNRD only measure the disparity unmediated by existing base disparities in true outcome , and for this reason is considered as a more appropriate measure for unfairness than demographic disparity (Hardt et al. 2016). In particular, TPRD can be interpreted as disparity in opportunities offered to deserving or qualified individuals. PPVD (respectively, NPVD) measure the disparity in the proportions of approved applicants who pay back their loan (respectively, rejected applicants who default) between two classes. Such disparities can be interpreted as “disparate benefit of the doubt” in an individual having the positive label.
We will focus our attention just on DD, TPRD, and TNRD. Indeed, by swapping the roles of and in TPRD and TNRD, all our results can straightforwardly be extended to PPVD and NPVD, respectively. Similarly, disparities based on false negative rate and false positive rate simply differ with TPRD and TNRD by a minus sign, i.e., are given by swapping and . Finally, our results can be extended to composite measures that combine any of the above disparities with some weights.
The main challenge in this paper is to estimate the disparity measures when the protected class cannot be observed simultaneously with and . Instead, we need to rely on proxies for from the auxiliary data. Note in particular that we focus on auditing these measures of disparate impact, not necessarily on adjusting algorithms to achieve parity with respect to these. Some robust forms of parity may potentially be achieved using our new tools but it may not necessarily be desirable (see CorbettDavies and Goel 2018 and discussion in Fig. 2).
Generally we use to represent probability measures that are clear from the context. We often use , , , as generic values of the random variables , , , , respectively. We also use and as additional generic values for , where is generally understood to be a majority or advantaged class label.
We further define
So that , , and .
In this section we present an example of the problem in the case of assessing disparate impacts in lending. We discuss previous proxy approaches based on BISG and study a mortgage lending dataset. We apply the tools we will develop in the paper to describe the set of possible values of disparity that agree with the observed data. We will revisit this case study to provide more details in id1. In summary, finding that this set is quite large, because the proxies are rather weak, explains the large and spurious biases observed previously in this dataset (Zhang 2016, Chen et al. 2019, Baines and Courchane 2014). This is in contrast to the personalized medicine case study we explore in id1, where our robust assessment using strong but imperfect proxies suggests conclusions about disparity can be drawn.
The BISG proxy method used for disparity assessment in lending (Consumer Financial Protection Bureau 2014) estimates the conditional probability of race labels, , given an individual’s surname and residence geolocation , either as the census tract or ZIP code. Specifically, BISG uses a naïve Bayes classifier (Friedman et al. 2001, §6.6.3): it assumes surname and geolocation are independent given race and uses Bayes’s law to combine to two separate estimates of the conditional probability of races labels given surname and given geolocation. is typically estimated from a census surname list that includes the fraction of different races for surnames occurring at least 100 times (Comenetz 2016). And, is typically estimated from census Summary File I (US Census Bureau 2010). Even if the naïve Bayes assumption holds and probabilities are perfectly estimated,^{4}^{4}4See Baines and Courchane (2014) for more implicit assumptions in constructing the BISG proxy probabilities. this only gives half the picture.
We also have access to a main dataset including loan applicants’ information. The main dataset typically includes lending decision outcome (approval or not), actual outcome (e.g., default or not within 2 years), proxy variables (surname and geolocation), and other related variables, but it does not include race information. Figure 1 visualizes the two available datasets. Using the proxies, we can use the BISG model to compute conditional race label probabilities.
Consider assessing the demographic disparity – the simplest measure (see Figs. 2 and id1 for others) and one that has often been considered for this problem (Zhang 2016, Chen et al. 2019). Here, it measures the discrepancy in marginal approval rates between groups. Here we will consider White, Black, and API (Asian and Pacific Islander). There are many ways to the above to compute demographic disparity (and it is not known what specific method was used by Consumer Financial Protection Bureau 2013). One way is to impute the most likely race, and possibly to additionally discard any data point where the highest conditional probability is below a specified certainty threshold. Another way is to duplicate every data point times, each with a different label , and weight each by its corresponding conditional probability. Using a publicly available dataset of mortgage applications, the Home Mortgage Disclosure Act (HMDA) dataset, which includes selfreported race/ethnicity labels and geolocation, and using geolocation as a proxy, various authors have demonstrated that all of the above methods lead to biased estimates (Baines and Courchane 2014, Zhang 2016, Chen et al. 2019).
Indeed, as we will show in id1, without additional assumptions, demographic disparity is unidentifiable from this data, even with infinite samples. Furthermore, we will develop tools to study exactly how unidentifiable the disparity is by computing the set of all disparities that agree with the data. To demonstrate this, we apply this method to the same dataset above. We consider proxies consisting of either geolocation (county), income, or both and plot the resulting uncertainty sets in Fig. 2 along with the true value of disparity (only known in this case because we are using a mortgage lending dataset). In the case of income alone, we consider both imposing and not imposing smoothness constraints (see id1).
First, we can see that all partial identification sets contain the true value of disparity (the only set that makes an assumption not implied by the data is income proxy with smoothness constraint). Second, we can see that the sets are all quite large. This captures the intrinsic uncertainty in learning the demographic disparity from this very limited data. Because our sets are sharp (see id1), there is no possible way to further pin down identification without making additional (untestable) assumptions that are not implied by the observed data. When seen in this light, the spurious biases previously reported appear inevitable and serves as a warning sign for drawing any conclusions. In contrast, in id1 we will find smaller sets that support reliable, uncertaintyrobust conclusions. As the disparity assessment may often involve highimpact policy implications, we believe that the uncertainty quantification our tools provide is invaluable in the challenging setting of unobserved protected class.
Fremont et al. (2005) provide a comprehensive review on methods that use only geolocation or surname to impute unobserved race information and comment on their relative strengths for different groups in a US context. As surname and geolocation proxies complement each other, hybrid approaches like BISG were proposed to combine both (Elliott et al. 2008, 2009) and extended to further include first name (Voicu 2018). In terms of the accuracy of race imputation, BISG has been shown to outperform surnameonly and geolocationonly analysis in many datasets, including medicare administration data (Dembosky et al. 2019), mortgage data (Consumer Financial Protection Bureau 2014), and voter registration records (Imai and Khanna 2016).
However, these evaluations focus on classification accuracy, which is never perfect, and do not consider impact on downstream disparity assessment, mostly because this is usually unknowable. In contrast, Baines and Courchane (2014), Zhang (2016) assessed disparity on a mortgage dataset, and found that using imputed race tends to overestimates the true disparity. Chen et al. (2019) provided a full analysis of this bias and provide sufficient conditions to determine its direction. The analysis and additional empirics show that imputedrace estimators are extremely sensitive to tuning parameters like imputation threshold. As we show in id1, disparity is generally unidentifiable from proxies when protected class is unobserved, Consequently, all previous point estimators are generally biased unless very strong assumptions are satisfied.
Over the last several years, a large body of literature have proposed more than twenty mathematical definitions of fairness to facilitate risk assessment for algorithmic decision making (Narayanan 2018, Verma and Rubin 2018, Barocas et al. 2018, Cowgill and Tucker 2019). The appropriate definition clearly depends on the context. There is also no clear agreement on when adjusting for parity is a justified a priori constraint. Selecting fairness criteria to enforce is further complicated by the fact that some are incompatible (Chouldechova 2017, Kleinberg et al. 2017, Feller et al. 2016) and many are closely correlated (Friedler et al. 2019). In this paper, we consider auditing two measures of fairness that have received considerable attention in the fair machine learning community: demographic (dis)parity and classification (dis)parity.
Demographic disparity compares the average decision outcome across different protected groups. This is closely related to the “fourfifths rule” in fair hiring, which states that “a selection rate of any protected group that is less than of the highest rate for other groups is an evidence of disparate impact” (Commission et al. 1978, Feldman et al. 2015). Demographic parity and its variants are the focus of numerous early papers in fair machine learning (e.g., Calders et al. 2009, Zemel et al. 2013, Zliobaite 2015, Zafar et al. 2015, Louizos et al. 2015). However, Hardt et al. (2016) argue that demographic parity is at odds with the utility goal of decision making. Take lending as an example: if default rates differ across groups, demographic parity would rule out the ideal decision according to true default outcome, which can hardly be considered discriminatory and moreover is based on businessrelevant differences.
Classification disparity, including TPRD, TNRD, PPVD, and NPVD, compares some measures of classification accuracy (or error) across different protected groups (CorbettDavies and Goel 2018). In contrast to DD, both TPRD and TNRD measure disparities conditional on the true underlying outcome and thus alleviate the drawbacks of demographic disparity. Classification disparity measures are widely used to characterize disparate impact (Chouldechova 2017) as in the scrutiny of the COMPAS recidivism risk score (Angwin et al. 2016, Feller et al. 2016).
We emphasize that we focus on auditing, not adjusting, disparity measures. Whether observed disparities warrant adjustments depends on the legal, ethical, and regulatory context. For example, as fairness criteria, both demographic and classification parity have been criticized for their inframarginality, i.e., they average over individual risk far from the decision boundary (CorbettDavies and Goel 2018). However, inframarginality may be unavoidable when outcomes are binary. There may be no true individual “risk,” only the stratified frequencies of binary outcomes (default or recidivation) over strata defined by predictive features, which are in turn chosen by the decision maker. Regardless, disparate impact metrics measure the actual average impact on different groups. Adjudicating whether or not disparities are justifiable (e.g., based on businessrelevant factors) still depends on the ability to assess them in the first place.
Different notions of fairness have been studied in assessing performance guarantees for operational decisionmaking, with solution concepts such as proportional fairness, lexicographic, or minmax fairness (Luss 1999, Bertsimas et al. 2011, Ogryczak et al. 2014, Adler 2012). Proportional fairness is a solution concept from the fair bargaining literature; while lexicographic or minmax fairness corresponds to a notion of fairness related to equity (Rawls 2001, Young 1995). In contrast to these notions of fairness from the literature on fair division or inequityaversion in social welfare, we focus on algorithmic fairness definitions that have developed and formalized notions of disparate impact discrimination, often for assessing decisions based on predictive models.
There is an extensive literature on partial identification of unidentifiable parameters (e.g., Manski 2003, Beresteanu et al. 2011). There are many reasons parameters may be unidentifiable, including confounding (e.g., Kallus et al. 2019), missingness (e.g., Manski 2005), and multiple equilibria (e.g., Ciliberto and Tamer 2009). One prominent example is data combination, also termed the “ecological inference problem,” where joint distributions must be reconstructed from observation of marginal distributions (Schuessler 1999, Jiang et al. 2018, Freedman 1999, Wakefield 2004). One key tool for studying this problem is the FréchetHoeffding inequalities, which give sharp bounds on joint cumulative distributions and superadditive expectations given marginals (Cambanis et al. 1976, Ridder and Moffitt 2007, Fan et al. 2014). Such tools are also used in risk analysis in finance, where the distribution of returns of a portfolio can be analyzed based only on marginal return distributions (Rüschendorf 2013). In contrast to much of the above work, we focus on assessing nonlinear functionals of partially identified distributions, namely, true positive and negative rates, as well as on leveraging conditional information to integrate marginal information across proxyvalue levels with possible smoothness constraints.
In this section we study the fundamental limits of our two separate datasets to identify – i.e., pinpoint – the disparity measures of interest.
We first introduce the concept of identification (Lewbel 2018). We call a parameter of interest (either finitedimensional or infinitedimensional) identifiable if it can be uniquely determined by unlimited amount of observed data. In other words in the case of iid data, if it is a function of the generating distribution of the data, since this distribution is the most we can hope to learn from samples from it. Conversely, it is unidentifiable if multiple different values of of this parameter all simultaneously agree with the observed data, i.e., it is not a function of the data generating distribution. The set of all these values is called the partial identification set for the parameter. Any value within the partial identification set is equally valid, as the observed data cannot distinguish one from the other. Identification is equivalent to the partial identification set being a singleton.
In our setting, the data is fully described by the two joint distributions and . In particular, as samples grow infinitely, we can learn these distributions exactly, but we cannot hope to learn more than that. Therefore, we can only hope to learn parameters that are functions of these distributions.
Note that the disparity measures of interest are functions of the full joint distribution and so would immediately be identifiable if we observed the full data simultaneously. However, when the protected class is not observed directly, the identifiability of the disparity measures is not guaranteed. In particular, we only have partial information about this joint via the marginals learned from each dataset.
Analyzing the identifiability of disparity measures is very important. Unidentifiability of the disparity measures means that it is impossible to pin down the exact values of the disparity measures, even if we have infinite amount samples in the main and auxiliary datasets. Consequently, in the absence of some additional knowledge that ensure identification, any point estimate is in some sense spurious. It will in general be biased and may be very sensitive to adhoc modeling specifications (D’Amour 2019). In this case, generally one must be very cautious about drawing conclusions based on point estimates of disparity measures.
In id1 and id1, we first give two sets of sufficient conditions for identifiability – one about the unknown joint and one about the marginals. We argue these conditions are too stringent in practice. In Section id1, we show that the latter condition is minimal in that in any instance where it is not satisfied, disparity measures are necessarily unidentifiable. In other words, the condition is both necessary and sufficient for identification, barring any additional (untestable) assumptions about unobservables. In Section id1, we will characterize the partial identification set of the disparity measures.
Proposition 1
(i) If , then is identifiable from , .
(ii) If , then and are identifiable from , .
When and , respectively, the joint probability distributions factor:
Since is a function of the former joint and of the latter, the conclusion follows by the identifiability of the above marginal conditional probabilities. \@endparenv
In particular, under the conditional independence condition we have that
which, given two datasets, can be consistently estimated by replacing expectations by empirical sums over the main datasets and replacing proxy conditional probabilities by any consistent estimate based on the auxiliary dataset. For example, doing this for corresponds exactly to the weighted estimator of Chen et al. (2019).
The conditional independence condition holds if includes all variables that can mediate the dependence between the protected class , and the decision outcome and the true outcome . However, this condition is indefensible in real applications. In practice, often the number of proxy variables existing in both the main dataset and the auxiliary dataset is too small to account for the joint dependence completely. In the BISG example (id1), only includes surname and geolocation, which are unlikely to capture all dependence between race , loan approval and default behavior . For example, both and may be highly correlated with FICO score or other socioeconomic status factors, even after conditioning on surname and geolocation. In this case, the weighted estimator for DD usually produces biased estimates for the disparity measures. Sufficient conditions to identify the direction of this bias were studied by Chen et al. (2019). Furthermore, the structure of dependence among may vary across different levels of geolocation and surname, which renders the overall bias in the weighted estimators for TPRD and TNRD largely unpredictable a priori.
Proposition 2
(i) If for almost all , we have either for
or for , then is identifiable from , .
(ii) If for almost all , we have either for or for ,
then and are identifiable from , .
Consider the statement (ii), about and . By the Law of Total Probability, for any and ,
(1)  
(2) 
If , then by Eq. 1, , which implies that . In contrast, if , then , thus by Eq. 2, . Analogous conclusions hold if . Since this holds almost surely we have , so the result follows by Proposition 1.
A similar argument holds for statement (i), about . \@endparenv
The condition in Proposition 2 require that the proxy variables can perfectly predict either the protected attribute or the decision outcome and true outcome . This is strictly stronger than the conditional independence condition of Proposition 1. Unlike the latter, the perfect prediction condition is checkable as it only involves each dataset separately.
However, the perfect prediction condition almost never holds in practice: neither race nor true outcome can be perfectly predicted by any observable predictive features, let alone the few features that can be simultaneously found in the two separate datasets. The decision outcome may be predictable if it is deterministic given features (e.g., given by a machine learning algorithm), but it would require that the same features be observed in the auxiliary dataset. It is unrealistic (and illegal) that surname and geolocation determine loan approval. Therefore, point estimators may lead to misleading conclusions. In contrast, the partial identification sets we develop in id1 will capture this uncertainty and, if it so happens that proxies are indeed perfect, this set will capture this and exactly recover the unique disparity measures.
Note that the comparative predictiveness of different proxies for protected class labels, such as race, has been thoroughly studied (Dembosky et al. 2019, Consumer Financial Protection Bureau 2014, Imai and Khanna 2016, Elliott et al. 2008, 2009) as it was understood this impacts the quality of corresponding proxy assessments of disparity. The above result highlights that the proxies’ predictiveness of outcomes is also crucial, or even sufficient. We discuss the impact of better predictiveness for either outcome or class when neither is perfect in id1.
First, in the next subsection, we show that the above condition is minimal in that, in any instance where it is violated, disparity is necessarily unidentifiable. That is, it is both necessary and sufficient.
Since the disparity measures are functions of the full joint distribution , to prove the unidentifiability of the disparity measures we show that there generally exist multiple valid full joint distributions that give rise to different disparities but at the same time agree with the marginal joint distributions and identifiable from the main dataset and the auxiliary dataset, respectively. To formalize the validity of full joint distributions, we introduce the coupling of two marginal distributions (Villani 2008). Because outcomes and protected classes are discrete, we focus on couplings of discrete distributions.^{5}^{5}5Proxies can still be continuous, which we will leverage when we impose smoothness constraints.
Definition 1
Given two discrete probability spaces and (i.e., ), a distribution over is a coupling of if the marginal distributions of coincide with , . The set of all possible couplings is denoted . Specifically:
With the knowledge of marginal distributions, the bounds of the couplings can be rephrased using the FrećhetHoeffding inequality (Cambanis et al. 1976, Ridder and Moffitt 2007, Fan et al. 2014). This characterization will prove convenient when characterizing the size of partial identification sets in id1 and computing the partial identification sets in id1.
Proposition 3 (FréchetHoeffding)
The coupling set is equivalently given by
(3) 
The coupling set includes all valid joint distributions that agree with marginal distributions. Whether the coupling set contains more than one valid joint distribution is crucial for the identifiability of parameters of the joint distribution. For example, demographic disparity is determined by the joint distribution , therefore the identifiability of the joint^{6}^{6}6Note can be easily identified from either the main dataset or the auxiliary dataset. , which is a coupling of , is a sufficient condition for the identifiability of demographic disparity.
1 
An illustration of this is given in Fig. 3. With binary protected class and outcomes, marginal information provides only three independent constraints on four unknowns. If one of the marginals is equal to 1, the nonnegativity of probabilities forces the unknowns to a single point. Indeed this is what drives Proposition 2. This also straightforwardly extends to the coupling set .
It remains to be shown that when the conditions of Proposition 2 are violated, not only can we always necessarily have multiple different couplings, but also that having these for only certain values of is sufficient to render the disparities, which are differences of nonlinear functions of the couplings, unidentifiable. This shows, that barring assumptions about unobservables such as those in Proposition 1, the conditions of Proposition 2 on the known marginals are both necessary and sufficient for pointidentification of disparity measures.
Proposition 4
Let . Let any marginal distributions , be given. As long as there exists a set of ’s with positive probability such that the assumptions of Proposition 2(i) are not satisfied, then is unidentifiable. That is, there exist two different joint distributions that agree with these marginals but give rise to different values of .
Similarly, as long as there exists a set of ’s with positive probability such that the assumptions of Proposition 2(ii) are not satisfied, then both nor are unidentifiable.
Note that by exchanging and the same conclusions hold for PPVD, NPVD.
To prove Proposition 4 we show that given any distributions that violate the assumptions of Proposition 2, we can still construct feasible couplings that lead to different disparities. In particular, since discrepancies are differences of nonlinear functions of the coupling, we need to show that the variation in probabilities can be chosen so not to cancel for any given set of marginals. Since the feasible choices depend on the given marginals, which can be arbitrary as long as they violate Proposition 2, our proof proceeds by considering six exhaustive cases and proving the conclusion in each. See Appendix id1 for the proof.
In the last section we showed that DD, TPRD, and TNRD (and symmetrically also PPVD and NPVD) are generally not identifiable from the two separate datasets. Next, we will characterize exactly how identifiable or unidentifiable they are by describing the set of all disparity values that agree with the data, and possibly any additional assumptions.
We first consider the case where we impose no additional assumptions other than what is given directly from the data in terms of marginal distributions. Toward that end, for any functions and , define, respectively,
(4)  
(5) 
Furthermore define
such that and , recalling the definitions from id1.
We make two important observations. First, for any fixed functions , both and are identifiable from just the marginal since every term is just an expectation over that distribution. Second, the special functions depend upon the unknown full joint distribution . Indeed, and are not identifiable from the data, or else disparities would be too.
We use this reformulation to characterize the range of possible disparities. We will fix one class, , to measure all disparities against. Therefore, define and, given sets ,
Any disparity between can be given by contrasting the disparities for and . Similarly, PPVD and NPVD are given by swapping and . Moreover, we can extend the above to sets combining multiple disparities at the same time.
We next show that appropriately defining the sets gives the sharp partial identification sets. Let (where “” refers to the Law of Total Probability)
and note that these sets only depend on the known marginals .
Proposition 5
Given marginals , the sets , , are sharp for the true disparities. That is, for each set, every element corresponds to the true disparity given by some full joint agreeing with the given marginals and, conversely, every such full joint gives rise to one of the elements.
The proof is given in id1. Proposition 5 shows that the given sets are the tightestpossible characterization of the possible simultaneous values of the disparities of interest when we are given only the main and auxiliary datasets and no other additional assumptions that are not already implied by the data itself. In particular, we necessarily have that if the conditions of Proposition 2 hold then these sets are singletons.
For DD, the formulation given above is constructed by using to average over different couplings , each in . Lacking any additional assumptions, each coupling can in principle be chosen independently of the others: i.e., the sets we get in this way are sharp. However, one might expect that, for two similar values , the two true joints are also similar (some limited amount of similarity is already implied by the Law of Total Probability when the given marginals are smooth). Truly, there is no way to know from the separate datasets only (again, the sets above are sharp), but such an assumption may be defensible and can help narrow the possible values disparities may take.
We therefore further consider sharp partial identification sets of disparities when we impose the following additional assumptions:
(6)  
(7) 
where is some given metric. In particular, we encode the implicit Lipschitz constant within the metric itself.
Let
Proposition 6
Given marginals , is sharp for DD assuming Eq. 6 and , are sharp for TPRD, TNRD, respectively, assuming Eq. 7. That is, for each set, every element corresponds to the true disparity given by some full joint agreeing with the given marginals and satisfying Eq. 6 or Eq. 7 (respectively) and, conversely, every such full joint gives rise to one of the elements.
The proof is given in id1.
The partial identification sets given in Propositions 6 and 5 are sharp, i.e., they are the smallest sets containing all possible values of the disparity measures that are compatible with the data and possibly any assumptions. A natural question is when are these smallest possible sets also actually small. We next discuss different scenarios where the sets can be small or large and the implications.
If the proxies are very predictive, then the observed data may be informative enough to sufficiently pin down the the disparity measures. At the extreme, if proxies are perfectly predictive, Proposition 2 showed the sets will become singletons. Indeed, considering the FréchetHoeffding bounds, Eq. 3, we see that if either marginal is zero or one, the bounds collapse. If proxies are not perfect but are very predictive, either of protected class, of class, or of both, then lower and upper bounds are not equal but they are still close. Consequently, the partial identification sets will be small. This is the case we observe in id1 when using genetic proxies.
If the data is not very informative by itself, we can still combine it with assumptions on the unknown joint to narrow down the options. In id1, we proposed to use smoothness assumptions. These assumptions are both rather mild and possibly defensible even if not verifiable. We observed in id1 that imposing this assumption narrowed down the set based on income proxies slightly. On the other hand, imposing the much stronger assumptions in Proposition 1 would necessarily collapse the partial identification set to a singleton. One could potentially impose some relaxed version of this conditional independence assumption, requiring that conditionals be close rather than equal. This, for example, can easily be incorporated into many of the optimization formulations we present in the next section. However, we do not believe this is advisable in practice and therefore do not present it. The conditional independence assumption is highly unrealistic and so it does not make sense to conduct a sensitivity analysis on its relaxation: we do not expect it to hold even slightly. Therefore, although such strong assumptions are indeed very informative, they may also lead to misleading conclusions if the assumptions are wrong.
If the observed datasets and statistical law alone are not sufficiently informative, and we are not willing to impose overly stringent assumptions, we generally end up with partial identification sets with nontrivial size. In this case, the size of partial identification sets exactly captures the uncertainty in learning disparity measures based on the observed data and imposed assumptions. Large sets are not meaningless: they serve as an important warning about drawing any conclusions from highly flawed data.
In the previous section we described the sharp, i.e., tightestpossible, sets of disparity values that can simultaneously realize given all that we know from the data and any assumptions. However, it is not immediately clear what to do with these definitions. In this section we show how to actually compute these sets. Specifically, we consider computing their support functions.
Given a set its support function is given by . Not only does the support function provide the maximal and minimal contrasts in a set, it also exactly characterizes its convex hull (Rockafellar 2015). That is, . So computing allows us to compute , and ranging over a grid allows us to visualize the set as the polyhedron given by the corresponding hyperplanes, which gives a safe outer approximation. In the next sections we therefore consider how to compute the support functions for the sets we developed in the above sections. Then, we can both compute maximal/minimal disparities between any two class labels and visualize the whole set of simultaneously realizable disparities between any two class labels.
When considering a binary protected class, , the convex hull of partial identification sets are simply intervals. For the case of demographic disparity, the endpoints take a particularly simple form.
Proposition 7
Let
Then
Notice that exactly correspond to the endpoints of the FréchetHoeffding inequalities in Eq. 3. The key observations driving the proof are: (a) that must imply that for the denominator in (4), , which does not depend on , and (b) that when we must have by complementarity of the labels. See id1 for the full proof.
Next we consider the same setting for the classification disparities TPRD and TNRD. Again, the convex hull is an interval and we can express its endpoints in closed form, but they are slightly more intricate.
Proposition 8
Let
Then
To sketch the proof of Proposition 8, consider the case of TPRD. Notice, again, that exactly correspond to the endpoints of the FréchetHoeffding inequalities in Eq. 3. Therefore, the box given by these coordinatewise bounds, , contains . If we consider, say, maximizing over this box, we arrive at the bounds above. Moreover note that only depends on the restriction of to , i.e., . So, to prove Proposition 8, we consider four exhaustive cases and show that in each case, even though it may not be possible to achieve all coordinate bounds simultaneously, we can still find a feasible that achieves the necessary coordinate bounds for the restriction to . See id1 for detailed proof.
When we either have more than two class labels or we wish to impose smoothness constraints, computing identification sets is no longer closed form. We first consider the far simpler case of demographic disparity.
Proposition 9
Suppose . Then
When either or , the above gives an infinite linear program since the constraints are linear in . When either is discrete or when expectations are estimated by empirical sample averages, this becomes a regular linear program, which is easily solvable with offtheshelf software (we use Gurobi).
Proposition 9 follows directly by our sharp characterization of the partial identification set in id1, using the characterization Eq. 4 of withingroup mean outcome, and noting that implies which in turn implies that , which is identifiable from the auxiliary dataset.
We next consider the case of classification disparities in the general case. For a concise and clear exposition, we focus on the case of TPRD. The case of TNRD can be symmetrically handled.
Proposition 10
Suppose where . Then
(8)  
s.t.  (9)  
(10)  
(11)  
(12)  
(13) 
Equation 8 is generally a nonconvex infinite program. When either is discrete or when expectations are estimated by empirical sample averages, this becomes a finite program but still nonconvex. Note that the objective is linear in and that Eq. 13 is a perspective constraint, which is convex if is convex and linear if is a polyhedron, as in the case of . Thus, the only nonconvex constraint is Eq. 10. To handle this nonconvexity, we note that the constraint becomes linear once we fix