50 Years of Test (Un)fairness: Lessons for Machine Learning
Abstract.
Quantitative definitions of what is unfair and what is fair have been introduced in multiple disciplines for well over 50 years, including in education, hiring, and machine learning. We trace how the notion of fairness has been defined within the testing communities of education and hiring over the past half century, exploring the cultural and social context in which different fairness definitions have emerged. In some cases, earlier definitions of fairness are similar or identical to definitions of fairness in current machine learning research, and foreshadow current formal work. In other cases, insights into what fairness means and how to measure it have largely gone overlooked. We compare past and current notions of fairness along several dimensions, including the fairness criteria, the focus of the criteria (e.g., a test, a model, or its use), the relationship of fairness to individuals, groups, and subgroups, and the mathematical method for measuring fairness (e.g., classification, regression). This work points the way towards future research and measurement of (un)fairness that builds from our modern understanding of fairness while incorporating insights from the past.
230
1. Introduction
The United States Civil Rights Act of 1964 effectively outlawed discrimination on the basis of of an individual’s race, color, religion, sex, or national origin. The Act contained two important provisions that would fundamentally shape the public’s understanding of what it meant to be unfair, with lasting impact into modern day: Title VI, which prevented government agencies that receive federal funds (including universities) from discriminating on the basis of race, color or national origin; and Title VII, which prevented employers with 15 or more employees from discriminating on the basis of race, color, religion, sex or national origin.
Assessment tests used in public and private industry immediately came under public scrutiny. The question posed by many at the time was whether the tests used to assess ability and fit in education and employment were discriminating on bases forbidden by the new law (Ash, 1966). This stimulated a wealth of research into how to mathematically measure unfair bias and discrimination within the educational and employment testing communities, often with a focus on race. The period of time from 1966 to 1976 in particular gave rise to fairness research with striking parallels to ML fairness research from 2011 until today, including formal notions of fairness based on population subgroups, the realization that some fairness criteria are incompatible with one another, and pushback on quantitative definitions of fairness due to their limitations.
Into the 1970s, there was a shift in perspective, with researchers moving from defining how a test may be unfair to how a test may be fair. It is during this time that we see the introduction of mathematical criteria for fairness identical to the mathematical criteria of modern day. Unfortunately, this fairness movement largely disappeared by the end of the 1970s, as the different and sometimes competing notions of fairness left little room for clarity on when one notion of fairness may be preferable to another. Following the retrospective analysis of Nancy Cole (Cole and Zieky, 2001), who introduced the equivalent of Hardt et al.’s 2016 equality of opportunity (Hardt et al., 2016) in 1973:
The spurt of research on fairness issues that began in the late 1960s had results that were ultimately disappointing. No generally agreed upon method to determine whether or not a test is fair was developed. No statistic that could unambiguously indicate whether or not an item is fair was identified. There were no broad technical solutions to the issues involved in fairness.
By learning from this past, we hope to avoid such a fate.
Before further diving in to the history of testing fairness, it is useful to briefly consider the structural correspondences between tests and ML models. Test items (questions) are analogous to model features, and item responses analogous to specific activations of those features. Scoring a test is typically a simple linear model which produces a (possibly weighted) sum of the item scores. Sometimes test scores are normalized or standardized so that scores fit a desired range or distribution. Because of this correspondence, much of the math is directly comparable; and many of the underlying ideas in earlier fairness work trivially map on to modern day ML fairness. “History doesn’t repeat itself, but it often rhymes”; and by hearing this rhyme, we hope to gain insight into the future of ML fairness.
Following terminology of the social sciences, applied statistics, and the notation of (Barocas et al., 2018), we use “demographic variable” to refer to an attribute of individuals such as race, age or gender, denoted by the symbol . We use “subgroup” to denote a group of individuals defined by a shared value of a demographic variable, e.g., . indicates the ground truth or target variable, denotes a score output by a model or a test, and denotes a binary decision made using that score. We occasionally make exceptions when referencing original material.
2. History of fairness in testing
2.1. 1960s: Bias and Unfair Discrimination
Concerned with the fairness of tests for black and white students, T. Anne Cleary defined a quantitative measure of test bias for the first time, cast in terms of a formal model for predicting educational outcomes from test scores (Cleary, 1966, 1968):
A test is biased for members of a subgroup of the population if, in the prediction of a criterion for which the test was designed, consistent nonzero errors of prediction are made for members of the subgroup. In other words, the test is biased if the criterion score predicted from the common regression line is consistently too high or too low for members of the subgroup. With this definition of bias, there may be a connotation of “unfair,” particularly if the use of the test produces a prediction that is too low. (Emphasis added.)
According to Cleary’s criterion, the situation depicted in Figure 0(a) is biased for members of subgroup if the regression line is used to predict their ability, since it underpredicts their true ability. For Cleary, the situation depicted in Figure 0(b) is not biased: since data from each of the subgroups produce the same regression line, that line can be used to make predictions for either group.
In addition to defining bias in terms of predictions by regression models, Cleary also performed a study on realworld data from three statesupported and statesubsidized schools, comparing college GPA with SAT scores. Racial data was obtained from an admissions office, from an NAACP list of black students, and from examining class pictures. Cleary used Analysis of Covariance (ANCOVA) to test the relationships between SAT and HSR scores with GPA grades. Contrary to some expectations, Cleary found little evidence of the SAT being a biased predictor of GPA. (Later, larger studies found that the SAT overpredicted the GPA of black students (Vars and Bowen, 1998); it may be that the SAT is biased but less so than the GPA.)
While Cleary’s focus was on education, her contemporary Robert Guion was concerned with unfair discrimination in employment. Arguing for the importance of quantitative analyses in 1966, he wrote that: “Illegal discrimination is largely an ethical matter, but the fulfillment of ethical responsibility begins with technical competence” (Guion, 1966), and defined unfair discrimination to be “when persons with equal probabilities of success on the job have unequal probabilities of being hired for the job.” However, Guion recognized the challenges in using constructs such as the probability of success. We can observe actual success and failure after selection, but the probability of success is not itself observable, and a sophisticated model is required to estimate it at the time of selection.
By the end of the 1960s, there was political and legal support backing concerns with the unfairness of the educational system for black children and the unfairness of tests purporting to measure black intellectual competence. Responding to these concerns, the Association of Black Psychologists formed in 1969 immediately published “A Petition of Concerns”, calling for a moratorium on standardized tests “(which are used) to maintain and justify the practice of systematically denying economic opportunities” (Williams et al., 1980). The NAACP followed up on this in 1974 by adopting a resolution that demanded “a moratorium on standardized testing wherever such tests have not been corrected for cultural bias” (cited by (Samuda, 1998)). Meanwhile, advocates of testing worried that alternatives to testing such as interviews would introduce more subjective bias (Flaugher, 1974).^{1}^{1}1 For example, the origins of the college entrance essay are rooted in ivy league universities’ covert attempts to suppress the numbers of Jewish students, whose performance on entrance exams had led them to become an increasing percentage of the student population (Karabel, 2006).
2.2. 1970s: Fairness
As the 1960s turned to the 1970s, work began to arise that parallels the recent evolution of work in ML fairness, marking a change in framing from unfairness to fairness. Following Thorndike (Thorndike, 1971), “The discussion of ‘fairness’ in what has gone before is clearly oversimplified. In particular, it has been based upon the premise that the available criterion score is a perfectly relevant, reliable and unbiased measure…” Thorndike’s sentiment was shared by other academics of the time, who, in examining the earlier work of Cleary, objected that it failed to take into account the differing false positive and false negative rates that occur when subgroups have different base rates (i.e., is not independent of ) (Thorndike, 1971; Einhorn and Bass, 1971).
With the goal of moving beyond simplified models, Thorndike (Thorndike, 1971) proposed one of the first quantitative criteria for measuring test fairness. With this shift, Thorndike advocated for considering the contextual use of a test:
A judgment on testfairness must rest on the inferences that are made from the test rather than on a comparison of mean scores in the two populations. One must then focus attention on fair use of the test scores, rather than on the scores themselves.
Contrary to Cleary, Thorndike argued that sharing a common regression line is not important, as one can achieve fair selection goals by using different regression lines and different selection thresholds for the two groups.
As an alternative to Cleary, Thorndike proposed that the ratio of predicted positives to ground truth positives be equal for each group. Using confusion matrix terminology, this is equivalent to requiring that the ratio be equal for each subgroup. According to Thorndike, the situation in Figure 0(a) is fair for test cutoff . Figure 0(b) is unfair using any single threshold, but fair if threshold is used for group and threshold is used for group .
Similar to modern day ML fairness, e.g., Friedler et al. in 2016 (Friedler et al., 2016), Thorndike also pointed out the tension between individual notions of fairness and group notions of fairness: “the two definitions of fairness—one based on predicted criterion score for individuals and the other on the distribution of criterion scores in the two groups—will always be in conflict.” The conflict was also raised by others in the period, including Sawyer et al. (Sawyer et al., 1976), in a foreshadowing of the compas debate of 2016:
A conflict arises because the success maximization procedures based on individual parity do not produce equal opportunity (equal selection for equal success) based on group parity and the opportunity procedures do not produce success maximization (equal treatment for equal prediction) based on individual parity.
Almost as an aside, Thorndike mentions the existence of another regression line ignored by Cleary: the line that estimates the value of the test score given the target variable . This idea hints at the notion of equal opportunity for those with a given value of , an idea which soon was picked up by Darlington (Darlington, 1971) and Cole (Cole, 1973).
At a glance, Cleary’s and Thorndike’s definitions are difficult to compare directly because of the different ways in which they’re defined. Darlington (Darlington, 1971) helped to shed light on the relationship between Cleary and Thorndike’s conceptions of fairness by expressing them in a common formalism. He defines four fairness criteria in terms of the correlation between the demographic variable and the test score. Following Darlington,

Cleary’s criterion can be restated in terms of correlations of the “culture variable” with test scores. If Cleary’s criterion holds for every subgroup, then ^{2}^{2}2Although Darlington does not mention this additional constraint, we believe the criterion only holds if , and have a multivariate normal distribution. (Vargha et al., 1996).

Similarly, Thorndike’s criterion is equivalent to requiring that .

The criterion is motivated by thinking about as a dependent variable affected by independent variables and . If has no direct effect on once is taken into account then we have a zero partial correlation, i.e. ]^{3}^{3}3See footnote 2. .

An alternative “starkly simple” criterion of (recognizable as modern day demographic parity (Dwork et al., 2012)) is introduced but not dwelt on.
Darlington’s mapping of Cleary’s and Thorndike’s criteria lets him prove that they’re incompatible except in the special cases where the test perfectly predicts the target variable (), or where the target variable is uncorrelated with the demographic variable (). Figure 2, reproduced from Darlington’s 1971 work, shows that, for any given nonzero correlation between the demographic and target variables, definitions (1), (2), and (3) converge as the correlation between the test score and the target variable approach 1. When the test has only a poor correlation with the target variable, there may be no fair solution using definition (1).
Figure 2 enables a range of further observations. According to definition (1), for a given correlation between demographic and target variables, the lower the correlation of the test with the target variable, the higher it is allowed to correlate with the demographic variable and still be considered fair. Definition (3), on the other hand, is the opposite, in that the lower the correlation of the test with the target variable, the lower too must be the the test’s correlation with the demographic variable. Darlington’s criterion (2) is the geometric mean of criteria (1) and (3): “a compromise position midway between [the] two… however, a compromise may end up satisfying nobody; psychometricians are not in the habit of agreeing on important definitions or theorems by compromise.” Darlington shows that definition (3) is the only one of the four whose errors are uncorrelated with the demographic variable, where by “errors”, he means errors in the regression task of estimating from .
In 1973, Cole (Cole, 1973) continued exploring ideas of equal outcomes across subgroups, defining fairness as all subgroups having the same True Positive Rate (TPR), recognizable as modern day equality of opportunity (Hardt et al., 2016). That same year, Linn (Linn, 1973) introduced (but did not advocate for) equal Positive Predictive Value (PPV) as a fairness criterion, recognizable as modern day predictive parity (Chouldechova, 2017).^{4}^{4}4Although he cites (Guion, 1966) and (Einhorn and Bass, 1971), a seeming misattribution, as pointed out by (Petersen and Novick, 1976).
Under Cleary and Darlington’s conceptions, bias or (un)fairness is a property of the test itself. This is contrary to Thorndike, Linn and Cole, who take fairness to be a property of the use of a test. The latter group tended to assume that a test is static, and focused on optimizing its use; whereas Cleary’s concerns were with how to improve the tests themselves. Cleary worked for Educational Testing Services, and one can imagine a test being designed allowing for a range of use cases, since it may not be knowable in advance either i) the precise populations on which it will be deployed, nor ii) the number of students which an institution deploying the test is able to offer places to.
By March 1976, the interest in fairness in the educational testing community was so strong that an entire issue of the Journal of Education Measurement was devoted to the topic (NCME, 1976), including a lengthy lead article by Peterson and Novick (Petersen and Novick, 1976), in which they consider for the first time the equality of True Negative Rates (TNR) across subgroups, and equal TPR / equal TNR across subgroups (modern day equalized odds (Hardt et al., 2016)). Similarly, they consider the case of equal PPV and equal NPV across subgroups.^{5}^{5}5 They do not advocate for either combination (neither equal TPR and TNR, nor equal PPV and NPV) on the grounds that either combination requires unusual circumstances. However there is a flaw in their reasoning. For example, arguing against equal TPR and equal TNR, they claim that this requires equal base rates in the ground truth in addition to equal TPR.
Work from the mid1960s to mid1970s can be summarized along four distinct categories: individual, noncomparative, subgroup parity, and correlation, defined in Table 1. It should be emphasized that in not all cases where a researcher defined a criterion did they also advocate for it. In particular, Darlington, Linn, Jones, and Peterson and Novick all define criteria purely for the purposes of exploring the space of concepts related to fairness. A summary of fairness technical definitions during this time is listed in Table 2.
Category  Description 

individual  Fairness criterion defined purely in terms of individuals 
noncomparative  Fairness criterion for each subgroup does not reference other subgroups 
subgroup parity  Fairness criterion defined in terms of parity of some value across subgroups 
correlation  Fairness criterion defined in terms of the correlation of the demographic variable with the model output 
2.3. Mid1970s: The Fairness Tide Turns
Immediately after the the journal issue of 1976, research into quantitative definitions of test fairness seems to have come to a halt. Considering why this happened may be a valuable lesson to learn from for modern day fairness research. The same Cole who in 1973 proposed equality of TPR, wrote in 2001 that (Cole and Zieky, 2001):
In short, research over the last 30 or so years has not supplied any analyses to unequivocally indicate fairness or unfairness, nor has it produced clear procedures to avoid unfairness. To make matters worse, the views of fairness of the measurement profession and the views of the general public are often at odds.
Foreshadowing this outcome, statements from researchers in the 1970s indicate an increasing concern with how fairness criteria obscure “the fundamental problem, which is to find some rational basis for providing compensatory treatment for the disadvantaged” (Novick and Petersen, 1976). Following Peterson and Novick, the concepts of culturefairness and group parity are not viable in practice, leading to models that can sanction the discrimination they seek to rectify (Petersen and Novick, 1976). They argue that fairness should be reconceptualized as a problem in maximizing expected utility (Petersen, 1976), recognizing “high social utility in equalizing opportunity and reducing disadvantage” (Novick and Petersen, 1976).
A related thread of work highlights that different fairness criteria encode different value systems (Hunter and Schmidt, 1976), and that quantitative techniques alone cannot answer the question of which to use. In 1971, Darlington (Darlington, 1971) urges that the concept of “cultural fairness” be replaced by “cultural optimality”, which takes into account a policylevel question concerning the optimum balance between accuracy and cultural factors. In 1974, Thorndike points out that “one’s value system is deeply involved in one’s judgment as to what is ‘fair use’ of a selection device” (Novick and Petersen, 1976)), and similarly, in 1976, Linn (Linn, 1976) draws attention to the fact that “Values are implicit in the models. To adequately address issues of values they need to be dealt with explicitly.” Hunter and Schmidt (Hunter and Schmidt, 1976) begin to address this issue by bringing ethical theory to the discussion, relating fairness to theories of individualism and proportional representation. Current work may learn from this point in history by explicitly connecting fairness criteria to different cultural and social values.

Source Criterion Category Proposition Guion (1966) “people with equal probabilities of success on the job have equal probabilities of being hired for the job” individual Is the use of the test fair? Cleary (1966) “a subgroup does not have consistent errors” noncomparative Is the test fair to subgroup ? Einhorn and Bass (1971) is constant for all subgroups subgroup parity Is the use of the test fair with respect to ? Thorndike (1971) is constant for all subgroups subgroup parity Is the use of the test fair with respect to ? Darlington (1971) (1) (equivalent to ) correlation Is the test fair with respect to ? Darlington (1971) (2) Darlington (1971) (3) (equivalent to ) Darlington (1971) (4) Darlington (1971) , is maximized where is the subjective value placed on subgroup attribute correlation Does the test produce the culturally optimum optimal outcome w.r.t. ? Cole (1973) is constant for all subgroups subgroup parity Is the use of the test fair with respect to ? Linn (1973) is constant for all subgroups subgroup parity Is the use of the test fair with respect to ? Jones (1973) noncomparative Is the test fair to subgroup ? mean fair Jones (1973) a subgroup has equal representation in the top candidates ranked by model score as it has in the top candidates ranked by , for all noncomparative Is the test fair to subgroup ? general standard Jones (1973) a subgroup has equal representation in the top candidates ranked by model score as it has in the top candidates ranked by noncomparative Is the use of the test fair to subgroup ? at position Peterson & Novick (1976) is constant for all subgroups , and is constant for all subgroups subgroup parity Is the use of the test fair with respect to ? conditional probability and its converse Peterson & Novick (1976) is constant for all subgroups , and is constant for all subgroups subgroup parity Is the use of the test fair with respect to ? equal probability and its converse
2.4. 1970s on: Differential Item Functioning
Concurrent with the development of criteria for the fair use of tests, another line of research in the measurement community concerned looking for bias in test questions (“items”). In 1968, Cleary and Hilton (Cleary and Hilton, 1968) used an analysis of variance (ANOVA) design to test the interaction between race, socioeconomic level and test item. Ten years later, the related idea of Differential Item Functioning (DIF) was introduced by Scheuneman in 1979 (Scheuneman, 1979): “an item is considered unbiased if, for persons with the same ability in the area being measured, the probability of a correct response on the item is the same regardless of the population group membership of the individual.” That is, if is the variable representing a correct response on question , then by this definition is unbiased if .
In practice, the best measure of the ability that the item is testing is often the test in which the item is a component (Dorans, 2017):
A major change from focusing primarily on fairness in a domain, where so many factors could spoil the validity effort, to a domain where analyses could be conducted in a relatively simple, less confounded way. … In a DIF analysis, the item is evaluated against something designed to measure a particular construct and something that the test producer controls, namely a test score.
Figure 3 illustrates DIF for a test item.
DIF became very influential in the education field, and to this day DIF is in the toolbox of test designers. Items displaying a DIF are ideally examined further to identify the cause of bias, and possibly removed from the test (Penfield, 2016).
2.5. 1980s and beyond
With the start of the 1980s came renewed public debate about the existence of racial differences in general intelligence, and the implications for fair testing, following the publication of the controversial Bias in Mental Testing (Jensen, 1980). Political opponents of groupbased considerations in educational and employment practices framed them in terms of “preferential treatment” for minorities and “reverse discrimination” against whites. Despite, or perhaps because of, much public debate, neither Congress nor the courts gave unambiguous answers to the question of how to balance social justice considerations with the historical and legal importance placed on the individual in the United States (Council et al., 1989).
Into the 1980s, courts were asked to rule on many cases involving (un)fairness in educational testing. To give just one example, Zwick and Dorans (Zwick and Dorans, 2016) described the case of Debra P. v. Turlington 1984, in which a lawsuit was filed on behalf of “present and future twelfth grade students who had failed or would fail” a high school graduation test. The initial ruling found that the test perpetuated past discrimination and was in violation of the Civil Rights Act. More examples of court rulings on fairness are given by (Phillips, 2016; Zwick and Dorans, 2016).
By the early 1980s, ideas about fairness were having a widespread influence on U.S. employment practices. In 1981, with no public debate, the United States Employment Services implemented scoreadjustment strategy that was sometimes called “racenorming” (Rice and Baptiste, 1994). Each individual is assigned a percentile ranking within their own ethnic group, rather than to the testtaking population. By the mid1980s, racenorming was “a highly controversial issue sparking heated debate.” The debate was settled through legislation, with the 1991 Civil Rights Act banning the practice of racenorming (WestFaulcon, 2011).
3. Connections to ML fairness
3.1. Equivalent Notions
Many of the fairness criteria we have overviewed are identical to modernday fairness definitions. Here is a brief summary of these connections:

Peterson and Novick’s “conditional probability and its converse” is equivalent to what in ML fairness is variously called sufficiency (Barocas et al., 2018), equalized odds (Hardt et al., 2016), or conditional procedure accuracy (Berk et al., 2017), sometimes expressed as the conditional independence .

Cole’s 1973 fairness definition is identical to equality of opportunity (Hardt et al., 2016), .

Linn’s 1973 definition is equivalent to predictive parity (Chouldechova, 2017), .

Darlington’s criterion (1) is equivalent to sufficiency in the special case where , and have a multivariate Gaussian distribution. This is because for this special case the partial correlation is equivalent to (Baba et al., 2004). In general though, we cannot assume even a one way implication, since does not imply (see (Vargha et al., 1996) for a counterexample).

Similarly, Darlington’s criteria (2) and (3) are equivalent to independence and separation only in the special cases of multivariate Gaussian distributions.

Darlington’s definition (4) is a relaxation of what is called independence (Barocas et al., 2018) or demographic parity in ML fairness, i.e. ; it is equivalent when and have a bivariate Gaussian distribution.

Guion’s definition “people with equal probabilities of success on the job have equal probabilities of being hired for the job” is a special case of Dwork’s (Dwork et al., 2012) individual fairness with the presupposition that “probability of success on the job” is a construct that can be meaningfully reasoned about.
The fairness literature in both the fields of ML and in testing have also been motivated by causal considerations (Kusner et al., 2017; Hardt et al., 2016). Darlington (Darlington, 1971) motivate his definition (3) on the basis of a causal relationship between and (since an ability being measured affects the performance on the test). However (Hunter and Schmidt, 1976) have pointed out that in testing scenarios we typically only have a proxy for ability, such as later GPA 4 years later, and it is wrong to draw a causal connection from GPA to college entrance exam.
Hardt et al. (Hardt et al., 2016) describe the challenge in building causal models, by considering two distinct models and their consequences and concluding that “no test based only on the target labels, the protected attribute and the score would give different indications for the optimal score in the two scenarios.” This is remarkably reminiscent of Anastasi (Anastasi, 1961), writing in 1961 about test fairness:
No test can eliminate causality. Nor can a test score, however derived, reveal the origin of the behavior it reflects. If certain environmental factors influence behavior, they will also influence those samples of behavior covered by tests. When we use tests to compare different groups, the only question the tests can answer directly is: “How do these groups differ under existing cultural conditions?”
Both the testing fairness and ML fairness literatures have also paid great attention to impossibility results, such as the distinction between group fairness and individual fairness, and the impossibility of obtaining more than one of separation, sufficiency and independence except under special conditions (Thorndike, 1971; Darlington, 1971; Petersen and Novick, 1976; Barocas et al., 2018; Chouldechova, 2017; Kleinberg et al., 2016).
In addition, we see some striking parallels in the framing of fairness in terms of ethical theories, including explicit advocacy for utilitarian approaches.

Petersen and Novick’s utilitybased approaches relate to CorbettDavies et al.’s framing of the cost of fairness (CorbettDavies et al., 2017).

Hunter and Schmidt’s analysis of the value systems underlying fairness criteria is similar in spirit to Friedler et al.’s relation of fairness criteria and different worldviews (Friedler et al., 2016).
3.2. Variable Independence
As briefly mentioned above, modern day ML fairness has categorized fairness definitions in terms of independence of variables, which includes sufficiency and separation (Barocas et al., 2018). Some historical notions of fairness neatly fit into this categorization, but others shed light on further dimensions of fairness criteria. Table 3 summarizes these connections, linking the historical criteria introduced in Section 2 to modern day categories. (Utilitybased criteria are omitted, but will be discussed below.)
Historical criterion  ML fairness criterion  Relationship 

Guion (1966)  individual  relaxation 
Cleary (1968)  sufficiency  when Cleary’s criterion holds for all subgroups then we we have equivalence when and have bivariate Gaussian distribution 
Einhorn and Bass (1971)  sufficiency  both involve probability of conditioned on , but Einhorn and Bass are only concerned with the conditional likelihood at the decision threshold 
Thorndike (1971)  —  — 
Darlington (1971) (1)  sufficiency  equivalent when variables have a multivariate Gaussian distribution 
Darlington (1971) (2)  —  — 
Darlington (1971) (3)  separation  equivalent when variables have a multivariate Gaussian distribution 
Darlington (1971) (4)  independence  equivalent when variables have a bivariate Gaussian distribution 
Cole (1973)  separation  relaxation (equivalent to equality of opportunity) 
Linn (1973)  sufficiency  relaxation (equivalent to predictive parity) 
Jones (1973) mean fair  —  — 
Jones (1973) at position  —  — 
Jones (1973) general criterion  —  — 
Peterson and Novick (1976)  separation  equivalent 
conditional probability and its converse  
Peterson and Novick (1976)  sufficiency  equivalent 
equal probability and its converse 
We find that noncomparative criteria (discussed by Cleary and Jones) do not map onto any of the independence conditions used in ML fairness. Similarly, Thorndike’s, and Darlington’s have no counterparts that we know of. There are conceptual similarities between Jones’ criteria and the constrained ranking problem described by (Celis et al., 2017), and also between Einhorn’s criterion and concerns about inframarginality (Simoiu et al., 2017).
For a binary classifier, Thorndike’s 1971 group parity criterion is equivalent to requiring that the ratio of positive predictions to ground truth positives be equal for all subgroups. This ratio has no common name that we could find (unlike e.g., precision, recall, etc.), although (Petersen and Novick, 1976) refer to this as the “Constant Ratio Model”. It is closely related to coverage constraints (Goh et al., 2016), class mass normalization (Zhu et al., 2003) and expectation regularization (Mann and McCallum, 2007). Similar arguments can be made for Darlington’s criterion (2) and Jones’ criteria “at position ” and “general criterion”. When viewed as a model of subgroup quotas (Hunter and Schmidt, 1976), Thorndike’s criterion is reminiscent of fair division in economics.
3.3. Regression and Correlation
In reviewing the history of fairness in testing, it becomes clear that regression models have played a much larger role than in the ML community. Similarly, the use of correlation as a fairness criterion is all but absent in modern ML Fairness literature.
Given that correlation of two variables is a weaker criterion than independence, it is reasonable to ask why one might want a fairness criterion defined in terms of correlations. One practical reason is that calculating correlations is a lot easier than estimating independence. Whereas correlation is a descriptive statistic, and so calculating requires few assumptions, estimating independence requires an the use of inferential statistics, which can in general be highly nontrivial (Shah and Peters, 2018).
Considering the analogy between model features and test items described in the Introduction, we also know of no ML analogs to the Differential Item Functioning. Such analogs might test for bias in model features. Instead, one approach adopted in ML fairness has been the use of adversarial methods to mitigate the effects of features with undesirable correlations with subgroups, e.g., (Beutel et al., 2017; Zhang et al., 2018).
3.4. Model vs. Model Use
Section 2 described how the test literature had competing notions of whether fairness is a property of a test, or of the use of a test. A similar discussion of whether ML models can be judged as fair or unfair independent of a specific use (including a specific model threshold) has been largely implicit or missing in the ML fairness literature. Models are sometimes trained to be “fair” at their default decision threshold (e.g., 0.5), although the use of different thresholds can have a major impact on fairness (Hardt et al., 2016). The ML fairness notion of calibration, i.e., for all and , can be interpreted to be a property of the model rather than of its use, since it does not depend on the choice of decision threshold.
3.5. Race and Gender
Some work on practically assessing fairness in ML has tackled the problem of using race as a construct. This echoes concerns in the testing literature that stem back to at least 1966: “one stumbles immediately over the scientific difficulty of establishing clear yardsticks by which people can be classified into convenient racial categories” (Guion, 1966). Recent approaches have used Fitzpatrick skin type or unsupervised clustering to avoid racial categorizations (Buolamwini and Gebru, 2018; Ryu et al., 2018). We note that the testing literature of the 1960s and 1970s frequently uses the phrase “cultural fairness” when referring to parity between blacks and whites. Other than Thomas (Thomas, 1973), the test fairness literature of the 1960s and 1970s was typically concerned with race rather than gender (although received attention later, e.g., (Willingham and Cole, 2013)). The role of culture in gender identity and gender presentation has seen less consideration in ML fairness, but gender labels raise ethical concerns (Hoffmann, 2017; Hamidi et al., 2018).
Comparable to modern sentiment in the difficulties of measuring fairness, earlier decisions in the courtroom highlighted the impossibility of properly accounting for all factors that influence inequalities. For example, in 1964, Illinois Fair Employment Practices Commission (FEPC) examiner found that Motorola had discriminated against Leon Myart, a black American, in his application to work at Motorola as an “analyzer and phaser”. The examiner found that the 5 minute screening test that Myart took did not account for inequalities and environmental factors of culturally deprived groups. The case was appealed to the Illinois Supreme Court, which found that Myart actually passed the test, and so declined to rule on the fairness of the test (Ash, 1966).
4. Fairness Gaps
4.1. Fairness and Unfairness
In mapping out earlier fairness approaches and their relationship to ML fairness, some conceptual gaps emerge. One noticeable gap relates to the difference in framing between fairness and unfairness. In earlier work on test fairness, there was a focus on defining measurements in terms of unfair discrimination and unfair bias, which brought with it the problem of uncovering sources of bias (Cleary and Hilton, 1968). In the 1970s, this developed into framings in terms of fairness, and the introduction of fairness criteria similar or identical to ML fairness criteria known today. However, returning to the idea of unfairness suggests several new areas of inquiry, including quantifying different kinds of unfairness and bias (such as content bias, selection system bias, etc., cf. (Jencks, 1998)), and a shift in focus from outcomes to inputs and processes (Cojuharenco and Patient, 2013). Quantifying types of unfairness may not only add to the problems that machine learning can address, but also accords with realities of sentencing and policing behind much of the fairness research today: Individuals seeking justice do so when they believe that something has been unfair.
4.2. Differential Item Functioning
Another gap that becomes clear from the historical perspective is the lack of an analog to Differential Item Functioning (Section 2.4) in current ML fairness research. DIF was used by education professionals as a motivation for investigating causes of bias, and a modernday analog might include unfairness interpretability in ML models. An direct analog in ML could be to compare for different input features , model outputs and subgroups . For example, when predicting loan repayment, this might involve comparing how income levels differ across subgroups for a given predicted likelihood of repaying the loan.
4.3. Target Variable / Model Score Relationship
Another gap is the ways in which the model (test) score and the target variable are related to each other. In many cases in ML fairness and test fairness, there are correspondences between pairs of criteria which differ only in the roles played by the model (test) score and the target variable . That is, one criterion can be transformed into another by swapping the symbols and ; for example, separation can be transformed into sufficiency: . In this section we will refer to this type of correspondence as “converse”, i.e., separation is the converse of sufficiency.
When viewed in this light, some asymmetries stand out:

Converse Cleary criterion: Cleary’s criterion considers the case of a regression model that predicts a target variable given test score . One could also consider the converse regression model (mentioned in passing by (Thorndike, 1971)), which predicts model score from ground truth , as an instrument for detecting bias.^{6}^{6}6The Cleary regression model and its converse are distinct except in the special case where the magnitudes of the variables have been standardized. The converse Cleary condition would say that a test has connotations of unfair for a subgroup if the converse regression line has positive errors, i.e., for each given level of ground truth ability, the test score is higher than the converse regression line predicts.

Converse calibration: In a regression scenario, the calibration condition can be rewritten as , or . The converse calibration condition is therefore for all subgroups . In other words, for each subgroup and level of ground truth performance , the expected error in ’s prediction of the value is zero.
We point out these overlooked concepts not to advocate for their use, but to map out the geography of concepts related to fairness more completely.
4.4. Compromises
Darlington (Darlington, 1971) points out that Thorndike’s criterion is a compromise between one criterion related to sufficiency and one related to separation (see Section 2.2 and Tables 2 and 3). In general, a space of compromises is possible; in terms of correlations, this might be modeled using a parameter :
(1) 
where values of 1, 0, and 1 imply Darlington’s definitions (1), (2) and (3), respectively.
This also suggests exploring interpolations between the contrasting sufficiency and separation criteria. For example, one way of parameterizing their interpolation is in terms of binary confusion matrix outcomes.
Definition 4.1 ().
Thorndikian fairness: A binary classifier satisfies Thorndikian fairness with respect to demographic variable if both

is constant for all values of , and

is constant for all values of .
Note that (1, 0)Thorndikian fairness is equivalent to sufficiency, while (0, 1)Thorndikian fairness is equivalent to separation.
Petersen and Novick (Petersen and Novick, 1976) showed that Thorndikian fairness requires that either a) for each subgroup, the positive class is predicted in proportion to its ground truth rate; or b) every subgroup has the same ground truth rate of positives. We can also consider relaxations of Thorndikian fairness in which only one of the two conditions (a) or (b) is required to hold. For example, only requiring condition (a) gives us a way of parameterizing compromises between equality of opportunity and predictive parity.
Our goal here is not to advocate for this particular model of compromise between separation and sufficiency. Rather, since separation and sufficiency criteria can encode competing interests of different parties, our goal is to suggest that ML fairness consider how to encode notions of compromise, which in some scenarios might relate to the public’s notion of fairness. We propose that the economics literature on fair division might provide some useful ideas, as has also been suggested by (Zafar et al., 2017). However, we do heed Darlington’s (Darlington, 1971) warning that “a compromise may end up satisfying nobody; psychometricians are not in the habit of agreeing on important definitions or theorems by compromise.” This statement may be equally true of ML practitioners.
5. Discussion
This short review of historical connections in fairness suggest several concrete steps forward for future research in ML fairness:

Developing methods to explain and reduce model unfairness by focusing on the causes of unfairness. To paraphrase Darlington’s (Darlington, 1971) question: “What can be said about models that discriminate among cultures at various levels?” yields more actionable insights than “What is a fair model?” This is related to research on causality in ML Fairness (see Section 3.1), but including examination of full causal pathways, and processes that interact well before decision time. In other words: What causes the disparities?

Building from earlier insights of 1970s researchers (Darlington, 1971; Hunter and Schmidt, 1976; Linn, 1976) to incorporate quantitative factors for the balance between fairness goals and other goals, such as a value system or a system of ethics. This will likely include clearly articulating assumptions and choices, as recently proposed in (Mitchell et al., 2018).

Diving more deeply into the question of how subgroups are defined, suggested as early as 1966 (Guion, 1966), including questioning whether subgroups should be treated as discrete categories at all, and how intersectionality can be modeled. This might include, for example, how to quantify fairness along one dimension (e.g., age) conditioned on another dimension (e.g., skin tone), as recent work has begun to address (Kearns et al., 2018; Foulds and Pan, 2018).
6. Conclusions
The spike in interest in test fairness in the 1960s arose during a time of social and political upheaval, with quantitative definitions catalyzed in part by U.S. federal antidiscrimination legislation in the domains of education and employment. The rise of interest in fairness today has corresponded with public interest in the use of machine learning in criminal sentencing and predictive policing, including discussions around compas (Larson et al., 2016; Dieterich et al., 2016; CorbettDavies et al., 2016) and PredPol (O’Neil, 2016; Ensign et al., 2017). Each era gave rise to its own notions of fairness and relevant subgroups, with overlapping ideas that are similar or identical. In the 1960s and 1970s, the fascination with determining fairness ultimately died out as the work became less tied to the practical needs of society, politics and the law, and more tied to unambiguously identifying fairness.
We conclude by reflecting on what further lessons the history of test fairness may have for the future of ML fairness. Careful attention should be paid to legal and public concerns about fairness. The experiences of the test fairness field suggest that in the coming years, courts may start ruling on the fairness of ML models. If technical definitions of fairness stray too far from the public’s perceptions of fairness, then the political will to use scientific contributions in advance of public policy may be difficult to obtain. Perhaps ML practitioners should cautiously take heed from Cole and Zieky’s (Cole and Zieky, 2001) portrayal of developments in their field:
Members of the public continue to see apparently inappropriate interpretations of test scores and misuses of test results. They see this area as a primary fairness concern. However, the measurement profession has struggled to understand the nature of its responsibility in this area, and has generally not acted strongly against instances of misuse, nor has it acted in concert to attack misuses.
We welcome broader debate on fairness that includes both technical and cultural causes, how the context and use of ML models further influence potential unfairness, and the suitability of the variables used in fairness research for capturing systemic unfairness. We agree with Linn’s (Linn, 1976) argument from 1976 that values encoded by technical definitions should be made explicit. By concretely relating fairness debates to ethical theories and value systems (as done by (Hunter and Schmidt, 1976; Zwick and Dorans, 2016)), we can make discussions more accessible to the general public and to researchers of other disciplines, as well as helping our own ML Fairness community to be more attuned to our own implicit cultural biases.
7. Acknowledgements
Thank you to Moritz Hardt and Shira Mitchell for invaluable conversations and insight.
References
 (1)
 Anastasi (1961) Anne Anastasi. 1961. Psychological tests: Uses and abuses. Teachers College Record (1961).
 Ash (1966) Philip Ash. 1966. The implications of the Civil Rights Act of 1964 for psychological assessment in industry. American Psychologist 21, 8 (1966), 797.
 Baba et al. (2004) Kunihiro Baba, Ritei Shibata, and Masaaki Sibuya. 2004. Partial correlation and conditional correlation as measures of conditional independence. Australian & New Zealand Journal of Statistics 46, 4 (2004), 657–664.
 Barocas et al. (2018) Solon Barocas, Moritz Hardt, and Arvind Naranayan. 2018. Fairness in Machine Learning. http://fairmlbook.org. (2018).
 Berk et al. (2017) Richard Berk, Hoda Heidari, Shahin Jabbari, Michael Kearns, and Aaron Roth. 2017. Fairness in criminal justice risk assessments: the state of the art. arXiv preprint arXiv:1703.09207 (2017).
 Beutel et al. (2017) Alex Beutel, Jilin Chen, Zhe Zhao, and Ed H. Chi. 2017. Data Decisions and Theoretical Implications when Adversarially Learning Fair Representations. CoRR abs/1707.00075 (2017). arXiv:1707.00075 http://arxiv.org/abs/1707.00075
 Buolamwini and Gebru (2018) Joy Buolamwini and Timnit Gebru. 2018. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency. 77–91.
 Celis et al. (2017) L Elisa Celis, Damian Straszak, and Nisheeth K Vishnoi. 2017. Ranking with fairness constraints. arXiv preprint arXiv:1704.06840 (2017).
 Chouldechova (2017) Alexandra Chouldechova. 2017. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data 5, 2 (2017), 153–163.
 Cleary (1966) T. Anne Cleary. 1966. Test bias: Validity of the Scholastic Aptitude Test for Negro and white students in integrated colleges. ETS Research Bulletin Series 1966, 2 (1966), i–23.
 Cleary (1968) T. Anne Cleary. 1968. Test bias: Prediction of grades of Negro and white students in integrated colleges. Journal of Educational Measurement 5, 2 (1968), 115–124.
 Cleary and Hilton (1968) T Anne Cleary and Thomas L Hilton. 1968. An investigation of item bias. Educational and Psychological Measurement 28, 1 (1968), 61–75.
 Cojuharenco and Patient (2013) Irina Cojuharenco and David Patient. 2013. Workplace fairness versus unfairness: Examining the differential salience of facets of organizational justice. Journal of Occupational and Organizational Psychology 86, 3 (2013), 371–393.
 Cole (1973) Nancy S Cole. 1973. Bias in selection. Journal of educational measurement 10, 4 (1973), 237–255.
 Cole and Zieky (2001) Nancy S Cole and Michael J Zieky. 2001. The new faces of fairness. Journal of Educational Measurement 38, 4 (2001), 369–382.
 CorbettDavies et al. (2016) Sam CorbettDavies, Emma Pierson, Avi Feller, and Sharad Goel. 2016. A computer program used for bail and sentencing decisions was labeled biased against blacks. Its actually not that clear. https://www.washingtonpost.com/news/monkeycage/wp/2016/10/17/cananalgorithmberacistouranalysisismorecautiousthanpropublicas/. (2016).
 CorbettDavies et al. (2017) Sam CorbettDavies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. 2017. Algorithmic decision making and the cost of fairness. CoRR abs/1701.08230 (2017). arXiv:1701.08230 http://arxiv.org/abs/1701.08230
 Council et al. (1989) National Research Council et al. 1989. Fairness in employment testing: Validity generalization, minority issues, and the General Aptitude Test Battery. National Academies Press.
 Darlington (1971) Richard B Darlington. 1971. Another Look at Cultural Fairness. Journal of Educational Measurement 8, 2 (1971), 71–82.
 Dieterich et al. (2016) William Dieterich, Christina Mendoza, and Tim Brennan. 2016. COMPAS risk scales: Demonstrating accuracy equity and predictive parity. http://go.volarisgroup.com/rs/430MBX989/images/ProPublica_Commentary_Final_070616.pdf. (2016).
 Dorans (2017) Neil J Dorans. 2017. Contributions to the Quantitative Assessment of Item, Test, and Score Fairness. In Advancing Human Assessment. Springer, 201–230.
 Dorans and Holland (1992) Neil J Dorans and Paul W Holland. 1992. DIF Detection and Description: MantelHaenszel and Standardization. ETS Research Report Series 1992, 1 (1992), i–40.
 Dwork et al. (2012) Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. 2012. Fairness Through Awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference (ITCS ’12). ACM, New York, NY, USA, 214–226. https://doi.org/10.1145/2090236.2090255
 Einhorn and Bass (1971) Hillel J Einhorn and Alan R Bass. 1971. Methodological considerations relevant to discrimination in employment testing. Psychological Bulletin 75, 4 (1971), 261.
 Ensign et al. (2017) Danielle Ensign, Sorelle A Friedler, Scott Neville, Carlos Scheidegger, and Suresh Venkatasubramanian. 2017. Runaway feedback loops in predictive policing. arXiv preprint arXiv:1706.09847 (2017).
 Flaugher (1974) Ronald L Flaugher. 1974. Bias in Testing: A Review and Discussion. TM Report No. 36. Technical Report. Educational Testing Services.
 Foulds and Pan (2018) James R. Foulds and Shimei Pan. 2018. An Intersectional Definition of Fairness. CoRR abs/1807.08362 (2018).
 Friedler et al. (2016) Sorelle A Friedler, Carlos Scheidegger, and Suresh Venkatasubramanian. 2016. On the (im) possibility of fairness. arXiv preprint arXiv:1609.07236 (2016).
 Goh et al. (2016) Gabriel Goh, Andrew Cotter, Maya Gupta, and Michael P Friedlander. 2016. Satisfying realworld goals with dataset constraints. In Advances in Neural Information Processing Systems. 2415–2423.
 Guion (1966) Robert M Guion. 1966. Employment tests and discriminatory hiring. Industrial Relations: A Journal of Economy and Society 5, 2 (1966), 20–37.
 Hamidi et al. (2018) Foad Hamidi, Morgan Klaus Scheuerman, and Stacy M Branham. 2018. Gender Recognition or Gender Reductionism?: The Social Implications of Embedded Gender Recognition Systems. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, 8.
 Hardt et al. (2016) Moritz Hardt, Eric Price, , and Nati Srebro. 2016. Equality of Opportunity in Supervised Learning. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.). Curran Associates, Inc., 3315–3323. http://papers.nips.cc/paper/6374equalityofopportunityinsupervisedlearning.pdf
 Hoffmann (2017) Anna Lauren Hoffmann. 2017. Data, technology, and gender: Thinking about (and from) trans lives. In Spaces for the Future. Routledge, 15–25.
 Hunter and Schmidt (1976) John E Hunter and Frank L Schmidt. 1976. Critical analysis of the statistical and ethical implications of various definitions of test bias. Psychological Bulletin 83, 6 (1976), 1053.
 Jencks (1998) Christopher Jencks. 1998. Racial bias in testing. The BlackWhite test score gap 55 (1998), 84.
 Jensen (1980) Arthur R Jensen. 1980. Bias in mental testing. (1980).
 Jones (1973) Marshall B Jones. 1973. Moderated regression and equal opportunity. Educational and Psychological Measurement 33, 3 (1973), 591–602.
 Karabel (2006) Jerome Karabel. 2006. The chosen: The hidden history of admission and exclusion at Harvard, Yale, and Princeton. Houghton Mifflin Harcourt.
 Kearns et al. (2018) Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. 2018. Preventing Fairness Gerrymandering: Auditing and Learning for Subgroup Fairness. In ICML.
 Kleinberg et al. (2016) Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. 2016. Inherent tradeoffs in the fair determination of risk scores. arXiv preprint arXiv:1609.05807 (2016).
 Kusner et al. (2017) Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. 2017. Counterfactual fairness. In Advances in Neural Information Processing Systems. 4066–4076.
 Larson et al. (2016) Jeff Larson, Surya Mau, Lauren Kirchner, and Julia Angwin. 2016. How We Analyzed the COMPAS Recidivism Algorithm. https://www.propublica.org/article/howweanalyzedthecompasrecidivismalgorithm. (2016).
 Linn (1973) Robert L Linn. 1973. Fair test use in selection. Review of Educational Research 43, 2 (1973), 139–161.
 Linn (1976) Robert L Linn. 1976. In search of fair selection procedures. Journal of Educational Measurement 13, 1 (1976), 53–58.
 Mann and McCallum (2007) Gideon S Mann and Andrew McCallum. 2007. Simple, robust, scalable semisupervised learning via expectation regularization. In Proceedings of the 24th international conference on Machine learning. ACM, 593–600.
 Mitchell et al. (2018) Shira Mitchell, Eric Potash, and Solon Barocas. 2018. PredictionBased Decisions and Fairness: A Catalogue of Choices, Assumptions, and Definitions. arXiv:1811.07867 (2018).
 NCME (1976) National Council on Measurement in Education NCME (Ed.). 1976. Journal of Education Measurement. 13, 1 (1976).
 Novick and Petersen (1976) Melvin R Novick and Nancy S Petersen. 1976. Towards equalizing educational and employment opportunity. Journal of Educational Measurement 13, 1 (1976), 77–88.
 O’Neil (2016) Cathy O’Neil. 2016. Weapons of math destruction: How big data increases inequality and threatens democracy. Broadway Books.
 Penfield (2016) Randall D Penfield. 2016. Fairness in Test Scoring. In Fairness in Educational Assessment and Measurement. Routledge, 71–92.
 Petersen (1976) Nancy S Petersen. 1976. An expected utility model for “optimal” selection. Journal of Educational Statistics 1, 4 (1976), 333–358.
 Petersen and Novick (1976) Nancy S Petersen and Melvin R Novick. 1976. An evaluation of some models for culturefair selection. Journal of Educational Measurement 13, 1 (1976), 3–29.
 Phillips (2016) S E Phillips. 2016. Legal Aspects of Test Fairness. In Fairness in Educational Assessment and Measurement, Neil J Dorans and Linda L Cook (Eds.). Routledge, 239–268.
 Rice and Baptiste (1994) Mitchell F Rice and Brad Baptiste. 1994. Race Norming, Validity Generalization, and Employment Testing. Handbook of Public Personnel Administration 58 (1994), 451.
 Ryu et al. (2018) Hee Jung Ryu, Hartwig Adam, and Margaret Mitchell. 2018. InclusiveFaceNet: Improving Face Attribute Detection with Race and Gender Diversity. In Workshop on Fairness, Accountability and Transparency in Machine Learning.
 Samuda (1998) Ronald J Samuda. 1998. Psychological testing of American minorities: Issues and consequences. Vol. 10. Sage.
 Sawyer et al. (1976) Richard L Sawyer, Nancy S Cole, and James WL Cole. 1976. Utilities and the issue of fairness in a decision theoretic model for selection. Journal of Educational Measurement 13, 1 (1976), 59–76.
 Scheuneman (1979) Janice Scheuneman. 1979. A method of assessing bias in test items. Journal of Educational Measurement 16, 3 (1979), 143–152.
 Shah and Peters (2018) Rajen D Shah and Jonas Peters. 2018. The Hardness of Conditional Independence Testing and the Generalised Covariance Measure. arXiv preprint arXiv:1804.07203 (2018).
 Simoiu et al. (2017) Camelia Simoiu, Sam CorbettDavies, Sharad Goel, et al. 2017. The problem of inframarginality in outcome tests for discrimination. The Annals of Applied Statistics 11, 3 (2017), 1193–1216.
 Thomas (1973) Charles L Thomas. 1973. The Overprediction Phenomenon among Black Collegians: Some Prelinimary Considerations. (1973).
 Thorndike (1971) Robert L Thorndike. 1971. Concepts of culturefairness. Journal of Educational Measurement 8, 2 (1971), 63–70.
 Vargha et al. (1996) András Vargha, Tamas Rudas, Harold D Delaney, and Scott E Maxwell. 1996. Dichotomization, partial correlation, and conditional independence. Journal of Educational and Behavioral statistics 21, 3 (1996), 264–282.
 Vars and Bowen (1998) Frederick E Vars and William G Bowen. 1998. Scholastic aptitude test scores, race, and academic performance in selective colleges and universities. The BlackWhite test score gap (1998), 457–79.
 WestFaulcon (2011) Kimberly WestFaulcon. 2011. Fairness Feuds: Competing Conceptions of Title VII Discriminatory Testing. Wake Forest L. Rev. 46 (2011), 1035.
 Williams et al. (1980) Robert L Williams, William Dotson, Patricia Don, and Willie S Williams. 1980. The war against testing: A current status report. The Journal of Negro Education 49, 3 (1980), 263–273.
 Willingham and Cole (2013) Warren W Willingham and Nancy S Cole. 2013. Gender and fair assessment. Routledge.
 Zafar et al. (2017) Muhammad Bilal Zafar, Isabel Valera, Manuel Rodriguez, Krishna Gummadi, and Adrian Weller. 2017. From parity to preferencebased notions of fairness in classification. In Advances in Neural Information Processing Systems. 229–239.
 Zhang et al. (2018) Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. 2018. Mitigating Unwanted Biases with Adversarial Learning. (2018).
 Zhu et al. (2003) Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. 2003. Semisupervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International conference on Machine learning (ICML03). 912–919.
 Zwick and Dorans (2016) Rebecca Zwick and Neil J Dorans. 2016. Philosophical Perspectives on Fairness in Educational Assessment. In Fairness in Educational Assessment and Measurement, Neil J Dorans and Linda L Cook (Eds.). Routledge, 267–281.
Appendix a: Additional definitions of test fairness
This appendix provides some details of fairness definitions included in Table 2 that were not introduced in the text of Section 2.
Einhorn and Bass
In 1971, Einhorn and Bass (Einhorn and Bass, 1971) noted that even if Cleary’s criterion is satisfied, different rates of false positives and false negatives may be achieved for different subgroups due to differences in standard errors of estimate for the two subgroups. That is, differences in variability around the common line of regression leads to different false positive and false negative rates. To address this, they propose a criterion based on achieving equal false discovery rate, or as they put it, “designated risk”, at the decision boundary. That is, is constant for all subgroups .
Darlington’s “culturally optimum”
Darlington (Darlington, 1971) proposes that the subjective value that one places on test validity (related to accuracy) and diversity can be scenariospecific. He proposes a technique for eliciting these value judgements, leading to a variable which measures the amount of tradeoff in validity that is acceptable to increase diversity. He proposes that the “culturally optimum” test is one that maximizes .
Jones
In 1973, Jones (Jones, 1973) proposed a “general standard” of fairness that is related to Thorndike’s (and hence also related quotabased definitions of fairness). In Jones’ criterion, candidates are ranked in descending order both by test score and by ground truth. If an equal proportion of candidates from the subgroup are present in the top of both ranked lists then the test is fair “at position ”. Jones’ “general standard” of fairness requires that this hold for all values of . Jones assumes a regression model relating test scores to ground truth, and also defines a weaker “meanfair” criterion for a subgroup that “the group’s average predicted score equals its average performance score on the [ground truth].”