Learning Fair Scoring Functions: Fairness Definitions, Algorithms and Generalization Bounds for Bipartite Ranking

Learning Fair Scoring Functions: Fairness Definitions, Algorithms and Generalization Bounds for Bipartite Ranking


Many applications of artificial intelligence, ranging from credit lending to the design of medical diagnosis support tools through recidivism prediction, involve scoring individuals using a learned function of their attributes. These predictive risk scores are used to rank a set of people, and/or take individual decisions about them based on whether the score exceeds a certain threshold that may depend on the context in which the decision is taken. The level of delegation granted to such systems will heavily depend on how questions of fairness can be answered. While this concern has received a lot of attention in the classification setup, the design of relevant fairness constraints for the problem of learning scoring functions has not been much investigated. In this paper, we propose a flexible approach to group fairness for the scoring problem with binary labeled data, a standard learning task referred to as bipartite ranking. We argue that the functional nature of the curve, the gold standard measuring ranking performance in this context, leads to several possible ways of formulating fairness constraints. We introduce general classes of fairness conditions in bipartite ranking and establish generalization bounds for scoring rules learned under such constraints. Beyond the theoretical formulation and results, we design practical learning algorithms and illustrate our approach with numerical experiments.


1 Introduction

With the availability of data at ever finer granularity through the Internet-of-Things and the development of technological bricks to efficiently store and process this data, the infatuation with machine learning and artificial intelligence is spreading to nearly all fields (science, transportation, energy, medicine, security, banking, insurance, commerce, etc.). Expectations are high. AI is supposed to allow for the development of personalized medicine that will adapt a treatment to the patient’s genetic traits. Autonomous vehicles will be safer and be in service for longer. There is no denying the opportunities, and we can rightfully hope for an increasing number of successful deployments in the near future. However, AI will keep its promises only if certain issues are addressed. In particular, machine learning systems that make significant decisions for humans, regarding for instance credit lending in the banking sector, diagnosis in medicine or recidivism prediction in criminal justice (see Chen, 2018; Deo, 2015; Rudin et al., 2018), should guarantee that they do not penalize certain groups of individuals.

Hence, stimulated by the societal demand, notions of fairness in machine learning and guarantees that they can be fulfilled by decision-making models trained under appropriate constraints have recently been the subject of a good deal of attention in the literature, see e.g. Dwork et al. (2012) or Kleinberg & Raghavan (2017) among others. Fairness constraints are generally modeled by means of a (qualitative) sensitive variable, indicating membership to a certain group (e.g., ethnicity, gender). The vast majority of the work dedicated to algorithmic fairness in machine learning focuses on binary classification, the flagship problem in statistical learning theory. In this context, fairness constraints force the classifiers to have the same true positive rate (or false positive rate) across the sensitive groups. For instance, Hardt & Srebro (2016) and Pleiss et al. (2017) propose to modify a pre-trained classifier in order to fulfill such constraints without deteriorating classification performance. Other work incorporates fairness constraints in the learning stage (see e.g. Agarwal et al., 2017; Woodworth et al., 2017; Zafar et al., 2017a, b, 2019; Menon & Williamson, 2018; Bechavod & Ligett, 2018, among others). Statistical guarantees (in the form of generalization bounds) for classifiers obtained through empirical risk minimization under fairness constraints are established in Donini et al. (2018).

The present paper is also devoted to algorithmic fairness, but for a different problem: namely, learning scoring functions from binary labeled data. This statistical learning problem, usually referred to as bipartite ranking, is of considerable importance in the applications. It covers in particular tasks such as credit scoring in banking, pathology scoring in medicine or recidivism scoring in criminal justice, for which fairness requirements are a major concern (Kallus & Zhou, 2019). While it can be formulated in the same probabilistic framework as binary classification, bipartite ranking is not a local learning problem: the goal is not to guess whether a binary label is positive or negative from an input observation but to rank any collection of observations by means of a scoring function so that observations with positive labels are ranked higher with large probability. Due to the global nature of the task, evaluating the performance is itself a challenge. The gold standard measure, the curve, is functional: it is the PP-plot of the false positive rate vs the true positive rate (the higher the curve, the more accurate the ranking induced by ). Sup-norm optimization of the curve has been investigated in Clémençon & Vayatis (2009, 2010), while most of the literature focuses on the maximization of scalar summaries of the curve such as the criterion (see e.g. Agarwal et al., 2005; Clémençon et al., 2008; Zhao et al., 2011) or alternative measures (Rudin, 2006; Clémençon & Vayatis, 2007; Menon & Williamson, 2016).

We propose a thorough study of fairness in bipartite ranking, where the goal is to guarantee that sensitive variables have little impact on the rankings induced by a scoring function. Similarly to ranking performance, there are various possible options to measure the fairness of a scoring function. As a first go, we introduce a general family of -based fairness constraints which encompasses recently proposed notions (Borkan et al., 2019; Beutel et al., 2019; Kallus & Zhou, 2019) in a unified framework. However, we argue that criteria may not always be appropriate insofar as two curves with very different shapes may have exactly the same . This motivates our design of richer definitions of fairness for scoring functions related to the curves themselves. Crucially, these definitions have strong implications for fair classification: classifiers obtained by thresholding such fair scoring functions approximately satisfy definitions of classification fairness for a wide range of thresholds. We establish the first generalization bounds for scoring functions learned under and -based fairness constraints, following in the footsteps of Donini et al. (2018) for fair empirical risk minimizers in classification. Beyond our theoretical analysis, we propose training algorithms based on gradient descent and illustrate the practical relevance of our approach on synthetic and real datasets.

The rest of the paper is organized as follows. In Section 2, we briefly recall the key concepts of bipartite ranking and review the related work in fairness for classification and ranking. In Section 3, we introduce our general family of -based fairness constraints and use it to formulate optimization problems and statistical guarantees for learning fair scoring functions. The limitations of -based fairness constraints are discussed in Section 4, leading to the design of richer -based (functional) fairness definitions for which we also provide generalization bounds. Finally, Section 5 presents illustrative numerical experiments on synthetic and real data, and we conclude in Section 6. Due to space limitations, technical details and additional experiments can be found in the supplementary material.

2 Background and Related Work

In this section, we introduce the main concepts involved in the subsequent analysis. We start by introducing the probabilistic framework we consider. We then recall the problem of bipartite ranking and its connections to ROC analysis, and briefly discuss the formalization of fairness in the context of classification. Finally, we review the related work in fairness for ranking, focusing on relevant AUC-based fairness constraints introduced in prior work.

Here and throughout, the indicator function of any event is denoted by and the pseudo-inverse of any cumulative distribution function (c.d.f.) function by .

2.1 Probabilistic Framework

Let and be two random variables: denotes the binary output label (taking values in ) and denotes the input features, taking values in a feature space with and modeling some information hopefully useful to predict . For convenience, we introduce the proportion of positive instances , as well as and , the conditional distributions of given and respectively. The joint distribution of is fully determined by the triplet . Another way to specify the distribution of is through the pair where denotes the marginal distribution of and the regression function . Equipped with these notations, one may write and .

In the context of fairness, we consider a third random variable which denotes the sensitive attribute taking values in . The pair is said to belong to salient group (resp. ) when (resp. ). The distribution of the triplet can be expressed as a mixture of the distributions of . Following the conventions described above, we introduce the quantities as well as . For instance, and the distribution of is written , i.e. for , . We denote the probability of belonging to group by , with .

2.2 Bipartite Ranking

The goal of bipartite ranking is to learn an order relationship on for which positive instances are ranked higher than negative ones. This order is defined by transporting the natural order on the real line to the feature space through a scoring rule (or scorer) . Given a distribution over and a scorer , we denote by the cumulative distribution function of when follows . Specifically:

ROC analysis. ROC curves are widely used to visualize the dissimilarity between two one-dimensional distributions in a large variety of applications such as anomaly detection, medical diagnosis, information retrieval, etc.

Definition 1.

(ROC curve) Let and be two cumulative distribution functions on . The curve related to the distributions and is the graph of the mapping:

When and are continuous, it can also be defined as the parametric curve .

The distance of to the diagonal conveniently quantifies the deviation from the homogeneous case, leading to the classic area under the curve () criterion:

where and denote independent random variables, drawn from and respectively.

In the context of bipartite ranking, one is interested in the ability of the scorer to separate positive from negative data, which is reflected by the curve . The global summary serves as the standard performance measure Clémençon et al. (2008).

Empirical estimates. In practice, the scorer is learned based on a training set of i.i.d. copies of the random pair . Let and be the number of positive and negative data points respectively. We introduce and , the empirical counterparts of and :

Note that the denominators and are sums of i.i.d. random variables. For any two distributions over , we denote the empirical counterparts of and by and respectively. In particular, we have:

where for any . Empirical risk minimization for bipartite ranking typically consists in maximizing over a class of scoring rules (see e.g., Clémençon et al., 2008; Zhao et al., 2011).

2.3 Fairness in Binary Classification

In binary classification, the goal is to learn a mapping function that predicts the output label from the input random variable as accurately as possible (as measured by an appropriate loss function). Any classifier can be defined by its unique acceptance set .

Existing notions of fairness for binary classification (see Zafar et al., 2019, for a detailed treatment) aim to ensure that makes similar predictions (or errors) for the two groups. We mention here the common fairness definitions that depend on the ground truth label . Parity in mistreatment requires that the proportion of errors is the same for the two groups:


where . While this requirement is natural, it considers that all errors are equal: in particular, one can have a high false positive rate (FPR) for one group and a high false negative rate (FNR) for the other. This can be considered unfair when acceptance is an advantage, e.g. for job applications. A solution is to consider parity in false positive rates and/or parity in false negative rates, which respectively write:

Remark 1 (Connection to bipartite ranking).

A scorer induces an infinite collection of binary classifiers . While one could fix a threshold and try to enforce fairness on , we are interested in notions of fairness for the scorer itself, independently of a particular choice of threshold.

2.4 Fairness in Ranking

Fairness for rankings has only recently become a research topic of interest, and most of the work originates from the informational retrieval and recommender systems communities. Given a set of items with known relevance scores, they aim to extract a (partial) ranking that balances utility and notions of fairness at the group or individual level, or through a notion of exposure over several queries Zehlike et al. (2017); Celis et al. (2018); Biega et al. (2018); Singh & Joachims (2018). Singh & Joachims (2019) and Beutel et al. (2019) extend the above work to the learning to rank framework, where the task is to learn relevance scores and ranking policies from a certain number of observed queries that consist of query-item features and item relevance scores (which are typically not binary). This framework is fundamentally different from the bipartite ranking setting considered here.

AUC constraints. In a setting closer to ours, Kallus & Zhou (2019) introduce fairness constraints to better quantify the fairness of a known scoring functions on binary labeled data (they do not address learning). Similar definitions of fairness are also considered by Beutel et al. (2019), and by Borkan et al. (2019) in a classification context. Below, we present these definitions in the unified form of equalities between two s. In general, the can be seen as a measure of homogeneity between two distributions. Its empirical counterpart (called the Mann-Whitney statistic in hypothesis testing) is often used to test equality between distributions, see Vayatis et al. (2009) for details on this interpretation of AUCs.

Introduce (resp. ) as the c.d.f. of the score on the positives (resp. negatives) of group , i.e.

for any . Both Beutel et al. (2019) and Borkan et al. (2019) proposed the following fairness constraints:


Eq. 3 is referred to as intra-group pairwise or subgroup AUC fairness and Eq. 4 as pairwise accuracy or Background Positive Subgroup Negative (BNSP) AUC fairness, depending on the authors. Eq. 3 requires the ranking performance (as measured by the AUC) to be equal within groups, which is relevant for instance in situations where groups are ranked separately (e.g., candidates for two types of jobs). Eq. 4 enforces that positive instances from either group have the same probability of being ranked higher than a negative example, and can be seen as the ranking counterpart of parity in FNR for classification, see Eq. 2. (Borkan et al., 2019; Kallus & Zhou, 2019) also consider the following notions:


Borkan et al. (2019) refers to Eq. 5 as Backgroup Positive Subgroup Negative (BPSN) AUC and can be seen as the counterpart of parity in FPR for classification, see Eq. 2. The notion of Average Equality Gap (AEG) introduced by Borkan et al. (2019) can be written for . Eq. 6 thus corresponds to an AEG of zero, which means that the scores of the positives of any group are not stochastically larger than those of the other. Beutel et al. (2019) and Kallus & Zhou (2019) also define respectively the inter-group pairwise fairness or disparity:


which imposes that the positives of a group can be distinguished from the negatives of the other group as effectively for both groups. Next, we generalize these AUC-based definitions and derive generalization bounds and algorithms for learning scoring functions under such fairness constraints.

3 Fair Scoring via AUC Constraints

In this section, we give a thorough treatment of the problem of statistical learning of scoring functions under AUC-based fairness constraints. First, we introduce a general family of AUC-based fairness definitions which encompasses those presented in Section 2.4 as well as many others. We then derive generalization bounds for the bipartite ranking problem under AUC-based fairness constraints. Finally, we propose a practical algorithm to learn such fair scoring functions.

3.1 A Family of AUC-based Fairness Definitions

Many sensible fairness definitions can be expressed in terms of the between two distributions. We now introduce a framework to formulate AUC-based fairness constraints as a linear combination of a set of 5 elementary fairness constraints, and prove its generality. Given a scorer , we introduce the vector where the ’s, , are elementary fairness measurements. More precisely, the value of (resp. ) quantifies the resemblance of the distribution of the negatives (resp. positives) between the two sensitive attributes:

while , and measure the difference in ability of a score to discriminate between positive and negative for any two pairs of sensitive attributes:

The elementary fairness constraints are the equations for any .

The family of fairness constraints we consider is the set of linear combinations of the elementary fairness constraints:


where .

The following theorem shows that the family covers a wide array of possible fairness constraints in the form of equalities of the ’s between mixtures of the distributions . Denote by the canonical basis of , as well as the constant vector . Introducing the probability vectors where , we define the following constraint:

Theorem 1.

The following propositions are equivalent:

  1. Eq. 9 is satisfied for any measurable scorer when , and ,

  2. Eq. 9 is equivalent to for some ,

  3. .

Theorem 1 shows that our general family defined by Eq. 8 compactly captures all relevant AUC-based fairness constraints while ruling out the ones that are not satisfied when and . Such undesirable fairness constraints are those which actually give an advantage to one of the groups, such as which is a special case of Eq. 9 that requires the scores of the positives of group to be higher than those of group 0.

All AUC-based fairness constraints proposed in previous work (see Section 2.4) can be written as instances of our general definition for a specific choice of , see Table 1. Note that might depend on the quantities . Interestingly, new fairness constraints can be expressed using our general formulation. Denoting , consider for instance the following constraint:


It equalizes the expected position of the positives of each group with respect to a reference group (here group ). Another fairness constraint of interest is based on the rate of misranked pairs when one element is in a specific group:

The equality can be seen as the analogue of parity in mistreatment for the task of ordering pairs, see Eq. 1. It is easy to see that this constraint can be written in the form of Eq. 9 and that point 1 of Theorem 1 holds, hence it is equivalent to for some .

Table 1: Value of for all of the AUC-based fairness constraints in the paper for the general formulation of Eq. 8.

3.2 Statistical Learning Guarantees

We now formulate the problem of fair ranking based on the fairness definitions introduced above. While it is tempting to introduce fairness as a hard constraint, this may come at a large cost in terms of the ability of such scorers to separate positive from negative data points. In general, there is a trade-off between the ranking performance and the level of fairness, as illustrated by the following example.

Example 1.

Let . For any , let , as well as and . We have and . Consider linear scorers of the form parameterized by . Fig. 1 plots and for as a function of , illustrating the trade-off between fairness and ranking performance.

Figure 1: Plotting Example 1 for . Under the fairness definition Eq. 3, a fair solution exists for , but the ranking performance for is significantly higher.

For a family of scoring functions and some instantiation of our general fairness definition in Eq. 8, we thus define the learning problem as follows:


where is a hyperparameter balancing ranking performance and fairness.

For the sake of simplicity and concreteness, in the rest of this section we focus on the special case of the fairness definition in Eq. 3 — one can easily extend our analysis to any other instance of our general definition in Eq. 8. The objective in Eq. 11 then writes:

and we denote its maximizer by .

Given a training set of i.i.d. copies of the random triplet , we denote by the number of points in group , and by (resp. ) the number of positive (resp. negative) points in this group. The empirical counterparts of and are then given by:

Recalling the notation introduced in Section 2.2, the empirical problem can thus be written:

We denote its maximizer by . We can now state our statistical learning guarantees for fair ranking.

Theorem 2.

Assume the class of functions is VC-major with finite VC-dimension and that there exists s.t. . Then, for any , for all , we have w.p. at least :

Theorem 2 establishes a learning rate of for our problem of ranking under AUC-based fairness constraint, which holds for any distribution of as long as the probability of observing each combination of label and group is bounded away from zero.

3.3 Training Algorithm

In practice, maximizing directly by gradient ascent is not feasible since the criterion is not continuous. As standard in the literature, we can use smooth surrogate relaxations of the AUCs based on the logit function . Again, we illustrate our approach for the fairness definition in Eq. 8. The surrogate relaxation of writes:

Similarly, for , we obtain by averaging over pairs of positive/negative points in group . The overall relaxed objective we aim to maximize is then:

We solve the problem using a stochastic gradient descent algorithm in which the constant is set adaptively during the training process based on a small validation set. Specifically, if more errors are done on group 0 than group 1, i.e. on the validation set, we set to (where is a small positive constant) so as to increase the weight of those errors. In the other case, we set to . We update the value of every fixed number of iterations. Details on the implementation are given in the supplementary material.

4 Beyond AUC-based Fairness Constraints

In this section, we highlight some limitations of the AUC-based fairness constraints studied in Section 3. These serve as a motivation to introduce new pointwise ROC-based fairness constraints. We then present generalization bounds and gradient descent algorithms for learning scoring functions under such constraints.

4.1 Limitations of AUC-based Constraints

Figure 2: In this example, Eq. 4 is verified, but and are very different. Indeed, .

As mentioned in Section 2.4, the equality of two s can be used to measure homogeneity between two distributions. However it only quantifies a stochastic order between the two distributions, and not the equality: see Fig. 2 for an example where two very different distributions are indistinguishable in terms of . For continuous s, the equality between their two ’s only implies that the two s intersect at some unknown point, as shown by Proposition 1 (a simple consequence of the mean value theorem). Borkan et al. (2019, Theorem 3.3 therein) corresponds to the special case of Proposition 1 when .

Proposition 1.

Let be cdfs on s.t. and are continuous. If , then there exists , s.t. .

Proposition 1 implies that when a scorer satisfies some -based fairness constraint, there exists a threshold inducing a non-trivial classifier that satisfies a notion of fairness for classification.

Corollary 1.

Under appropriate conditions on the scorer (i.e, where satisfies creftype 1), we have that:

  • If and satisfies Eq. 3, then there exists , s.t. , which resembles parity in mistreatment (Eq. 1).

  • If satisfies Eq. 4 or (6) or (10), then satisfies fairness in FNR (Eq. 2) for some threshold .

  • If satisfies Eq. 5, then satisfies parity in FPR (Eq. 2) for some threshold .

Unfortunately, AUC-based fairness constraints guarantee classification fairness for a single threshold , corresponding to a specific point of the ROC curve (over which one has no control). In contrast, in many applications, one is interested in learning a scoring function which induces classifiers that are fair in particular regions of the ROC curve. One striking example is in biometric verification, where one is interested in low false positive rates (i.e., large thresholds ), see Grother & Ngan (2019). In many practical scenarios, learning with AUC-based fairness constraints thus leads to inadequate scorers.

4.2 Pointwise ROC-based Fairness Constraints

To impose more restrictive fairness conditions, the ideal goal is to enforce the equality of the score distributions of the positives (resp. negatives) between the two groups, i.e. (resp. ). This stronger functional criterion can be expressed in terms of ROC curves. For , consider the deviations between the positive (resp. negative) inter-group ROCs and the identity function:

The aforementioned condition of equality between the distribution of the positives (resp. negatives) between the two groups are equivalent to satisfying, for any :


When both conditions in Eq. 12 are satisfied, all of the fairness constraints covered by Theorem 1 are verified, since for any . Furthermore, guarantees on the fairness of classifiers induced by (see Corollary 1) hold for all thresholds. While this strong property is desirable, it is challenging to enforce in practice due to its functional nature, and in many cases it may only be achievable by completely jeopardizing the ranking performance.

We thus propose to implement a trade-off between the ranking performance and the satisfaction of a finite number of fairness constraints on the value of and for specific values of . Let be the number of constraints for the negatives and the positives respectively, as well as and the points at which they apply (sorted in strictly increasing order). With the notation , we can introduce the criterion , defined as:

where and are trade-off hyperparameters.

This criterion is flexible enough to address the scenarios outlined in Section 4.1. In particular, under some regularity assumption on the ROC curve, Proposition 2 shows, that if a small number of constraints are satisfied for , one obtains guarantees in sup norm on .

Assumption 1.

The candidate scoring functions take their values in for some , and the family of cdfs satisfies: (a) any is continuously differentiable, and (b) there exists s.t. , . The latter condition is satisfied as soon as the score functions do not present neither flat nor steep parts, see Clémençon & Vayatis (2007, Remark 7 therein) for a discussion.

Proposition 2.

Under creftype 1, if s.t. for every , , then:

with the convention and .

4.3 Statistical Guarantees

We now proceed to prove statistical guarantees for the maximization of . Its empirical counterpart writes:

where and for any .

We now study the generalization properties of the scorers that maximize . We denote by the maximizer of over , and the maximizer of over .

Theorem 3.

Under the assumptions of Theorem 2 and creftype 1, we have for any , , w.p. :

where , with and .

4.4 Training Algorithm

To maximize , we can use a similar stochastic gradient descent procedure as the one introduced in Section 3.3. We refer to the supplementary material for more details.

5 Experiments

Method -based fairness -based fairness
Value of
Toy data
Example 1 0.79 0.28 0.73 0.00
Example 2 0.80 0.02 0.80 0.02 0.38 0.75 0.06 0.00
Real data
German 0.77 0.05 0.76 0.04 0.03 0.01 0.73 0.02 0.03 0.00
Adult 0.90 0.04 0.89 0.03 0.27 0.28 0.85 0.05 0.07 0.11
Compas 0.71 0.03 0.70 0.01 0.12 0.10 0.66 0.02 0.07 0.03
Bank 0.93 0.15 0.77 0.03 0.10 0.18 0.81 0.29 0.03 0.02
Table 2: Results on test set. For synthetic data, results are averaged over 100 runs (std. dev. are all smaller than ). The strength of fairness constraints and regularization is chosen by cross-validation to obtain interesting trade-offs, see supplementary for more detail.

In this section, we illustrate our approaches on synthetic and real data. We learn linear scorers for synthetic data and neural network-based scorers for real data, using regularization on the model parameters. Results are summarized in Table 2, where denotes the ranking accuracy , and denotes the absolute difference of the terms in the -based fairness constraint. We highlight in bold the best ranking accuracy, and the fairest algorithm for the relevant constraint. We refer to the supplementary for more details on the setup and further illustrations of the results.

Synthetic data. First, we illustrate learning with the constraint in Eq. 3 on the simple problem in Example 1. Precisely, learning scorers with different ’s allow different trade-offs between ranking performance and fairness (larger leads to more fairness). Example 2 allows to compare AUC-based and ROC-based approaches. The former uses Eq. 3 as constraint and the latter penalizes . The different constraints lead to scorers with specific trade-offs between fairness and performance. Results with -based fairness are the same for and because the optimal scorer for ranking satisfies Eq. 3. In the supplementary, we show that our algorithm recovers optimal scorers (as measured by the loss) for both examples.

Example 2.

Set . For any with , set , , and