Average Individual Fairness:Algorithms, Generalization and Experiments

Average Individual Fairness: Algorithms, Generalization and Experiments

Abstract

We propose a new family of fairness definitions for classification problems that combine some of the best properties of both statistical and individual notions of fairness. We posit not only a distribution over individuals, but also a distribution over (or collection of) classification tasks. We then ask that standard statistics (such as error or false positive/negative rates) be (approximately) equalized across individuals, where the rate is defined as an expectation over the classification tasks. Because we are no longer averaging over coarse groups (such as race or gender), this is a semantically meaningful individual-level constraint. Given a sample of individuals and classification problems, we design an oracle-efficient algorithm (i.e. one that is given access to any standard, fairness-free learning heuristic) for the fair empirical risk minimization task. We also show that given sufficiently many samples, the ERM solution generalizes in two directions: both to new individuals, and to new classification tasks, drawn from their corresponding distributions. Finally we implement our algorithm and empirically verify its effectiveness.

\patchcmd

1 Introduction

The community studying fairness in machine learning has yet to settle on definitions. At a high level, existing definitional proposals can be divided into two groups: statistical fairness definitions and individual fairness definitions. Statistical fairness definitions partition individuals into “protected groups” (often based on race, gender, or some other binary protected attribute) and ask that some statistic of a classifier (error rate, false positive rate, positive classification rate, etc.) be approximately equalized across those groups. In contrast, individual definitions of fairness have no notion of “protected groups”, and instead ask for constraints that bind on pairs of individuals. These constraints can have the semantics that “similar individuals should be treated similarly” ([awareness]), or that “less qualified individuals should not be preferentially favored over more qualified individuals” ([JKMR16]). Both families of definitions have serious problems, which we will elaborate on. But in summary, statistical definitions of fairness provide only very weak promises to individuals, and so do not have very strong semantics. Existing proposals for individual fairness guarantees, on the other hand, have very strong semantics, but have major obstacles to deployment, requiring strong assumptions on either the data generating process or on society’s ability to instantiate an agreed-upon fairness metric.

Statistical definitions of fairness are the most popular in the literature, in large part because they can be easily checked and enforced on arbitrary data distributions. For example, a popular definition ([HPS16, KMR16, Chou17]) asks that a classifier’s false positive rate should be equalized across the protected groups. This can sound attractive: in settings in which a positive classification leads to a bad outcome (e.g. incarceration), it is the false positives that are harmed by the errors of the classifier, and asking that the false positive rate be equalized across groups is asking that the harm caused by the algorithm should be proportionately spread across protected populations. But the meaning of this guarantee to an individual is limited, because the word rate refers to an average over the population. To see why this limits the meaning of the guarantee, consider the example given in [KNRW18]: imagine a society that is equally split between gender (Male, Female) and race (Blue, Green). Under the constraint that false positive rates be equalized across both race and gender, a classifier may incarcerate 100% of blue men and green women, and 0% of green men and blue women. This equalizes the false positive rate across all protected groups, but is cold comfort to any individual blue man and green woman. This effect isn’t merely hypothetical — [KNRW18, KNRW19] showed similar effects when using off-the-shelf fairness constrained learning techniques on real datasets.

Individual definitions of fairness, on the other hand, can have strong individual level semantics. For example, the constraint imposed by [JKMR16, JKMNR17] in online classification problems implies that the false positive rate must be equalized across all pairs of individuals who (truly) have negative labels. Here the word rate has been redefined to refer to an expectation over the randomness of the classifier, and there is no notion of protected groups. This kind of constraint provides a strong individual level promise that one’s risk of being harmed by the errors of the classifier are no higher than they are for anyone else. Unfortunately, in order to non-trivially satisfy a constraint like this, it is necessary to make strong realizability assumptions.

1.1 Our Results

We propose an alternative definition of individual fairness that avoids the need to make assumptions on the data generating process, while giving the learning algorithm more flexibility to satisfy it in non-trivial ways. We consider that in many applications each individual will be subject to decisions made by many classification tasks over a given period of time, not just one. For example, internet users are shown a large number of targeted ads over the course of their usage of a platform, not just one: the properties of the advertisers operating in the platform over a period of time are not known up front, but have some statistical regularities. Public school admissions in cities like New York are handled by a centralized match: students apply not just to one school, but to many, who can each make their own admissions decisions ([abdulkadirouglu2005new]). We model this by imagining that not only is there an unknown distribution over individuals, but there is an unknown distribution over classification problems (each of which is represented by an unknown mapping from individual features to target labels). With this model in hand, we can now ask that the error rates (or false positive or negative rates) be equalized across all individuals — where now rate is defined as the average over classification tasks drawn from of the probability of a particular individual being incorrectly classified.

We then derive a new oracle-efficient algorithm for satisfying this guarantee in-sample, and prove novel generalization guarantees showing that the guarantees of our algorithm hold also out of sample. Oracle efficiency is an attractive framework in which to circumvent the worst-case hardness of even unconstrained learning problems, and focus on the additional computational difficulty imposed by fairness constraints. It assumes the existence of “oracles” (in practice, implemented with a heuristic) that can solve weighted classification problems absent fairness constraints, and asks for efficient reductions from the fairness constrained learning problems to unconstrained problems. This has become a popular technique in the fair machine learning literature (see e.g. [agarwal, KNRW18]) — and one that often leads to practical algorithms. The generalization guarantees we prove require the development of new techniques because they refer to generalization in two orthogonal directions — over both individuals and classification problems. Our algorithm is run on a sample of individuals sampled from and problems sampled from . It is given access to an oracle (in practice, implemented with a heuristic) for solving ordinary cost sensitive classification problems over some hypothesis space . The algorithm runs in polynomial time (it performs only elementary calculations except for calls to the learning oracle, and makes only a polynomial number of calls to the oracle) and returns a mapping from problems to hypotheses that have the following properties, so long as and are sufficiently large (polynomial in the VC-dimension of and the desired error parameters): For any , with high probability over the draw of the individuals from and the problems from

1. Accuracy: the error rate (computed in expectation over new individuals and new problems ) is within of the optimal mapping from problems to classifiers in , subject to the constraint that for every pair of individuals in the support of , the error rates (or false positive or negative rates) (computed in expectation over problems ) on and differ by at most .

2. Fairness: with probability over the draw of new individuals , the error rate (or false positive or negatives rates) of the output mapping (computed in expectation over problems ) on will be within of that of .

The mapping from new classification problems to hypotheses that we find is derived from the dual variables of the linear program representing our empirical risk minimization task, and we crucially rely on the structure of this mapping to prove our generalization guarantees for new problems .

The literature on fairness in machine learning has become much too large to comprehensively summarize, but see [survey] for a recent survey. Here we focus on the most conceptually related work, which has aimed to bridge the gap between the immediate applicability of statistical definitions of fairness with the strong individual level semantics of individual notions of fairness. One strand of this literature focuses on the “metric fairness” definition first proposed by [awareness], and aims to ease the assumption that the learning algorithm has access to a task specific fairness metric. [KRR18] imagine access to an oracle which can provide unbiased estimates to the metric distance between any pair of individuals, and show how to use this to satisfy a statistical notion of fairness representing “average metric fairness” over pre-defined groups. [GJKR18] study a contextual bandit learning setting in which a human judge points out metric fairness violations whenever they occur, and show that with this kind of feedback (under assumptions about consistency with a family of metrics), it is possible to quickly converge to the optimal fair policy. [YR18] consider a PAC-based relaxation of metric fair learning, and show that empirical metric-fairness generalizes to out-of-sample metric fairness. Another strand of this literature has focused on mitigating the problems that arise when statistical notions of fairness are imposed over coarsely defined groups, by instead asking for statistical notions of fairness over exponentially many or infinitely many groups with a well defined structure. This line includes [multical] (focusing on calibration), [KNRW18] (focusing on false positive and negative rates), and [KGZ18] (focusing on error rates).

2 Model and Preliminaries

We model each individual in our framework by a vector of features , and we let each learning problem1 be represented by a binary function mapping to . We assume probability measures and over the space of individuals and the space of problems , respectively. Without loss of generality, we assume throughout that the support of the distributions and are and , respectively. In the training phase there is a fixed (across problems) set of individuals sampled independently from for which we have available labels of learning tasks represented by drawn independently from 2. Therefore, a training data set of individuals for learning tasks takes the form: . We summarize the notations we use for individuals and problems in Table 1.

In general the function class will be unknown. We will aim to solve the (agnostic) learning problem over a hypothesis class , which need bear no relationship to . We assume throughout that contains the constant classifiers and where and for all . This assumption will allow us to argue about feasibility of the constrained optimization problems that we will solve. We allow for randomized classifiers, which we model as learning over , the probability simplex over .

Unlike usual learning settings where the primary goal is to learn a single hypothesis , our objective is to learn a mapping that maps learning tasks represented as labelings of the training data to hypotheses . We will therefore have to formally define the error rates incurred by a mapping and use them to formalize a learning task subject to our proposed fairness notion. For a mapping , we write to denote the classifier corresponding to under the mapping, i.e., .

Notice in the training phase, there are only learning problems to be solved, and therefore, the corresponding empirical fair learning problem reduces to learning randomized classifiers , where is the learned classifier for the th problem . In general, learning specific classifiers for the training problems might not give any generalizable rule mapping new problems to hypotheses — but the specific algorithm we propose will, in the form of a set of weights (derived from the dual variables of the ERM problem) over the training individuals.

2.1 AIF: Average Individual Fairness

Definition 2.1 (Individual and Overall Error Rates of a Mapping ψ).

For a given individual , a mapping , and distributions and over and , the individual error rate of incurred by is defined as follows:

The overall error rate of is:

 err(ψ;P,Q)=Ex∼P[E(x,ψ;Q)]

In the body of this paper, we will focus on a fairness constraint that asks that the individual error rate should be approximately equalized across all individuals. In Section A of the Appendix, we extend our techniques to equalizing false positive and negative rates across individuals.

Definition 2.2 (Average Individual Fairness (AIF)).

In our framework, we say a mapping satisfies “-AIF” (reads -approximate Average Individual Fairness) with respect to the distributions if there exists such that:

 Px∼P(|E(x,ψ;Q)−γ|>α)≤β

2.2 Notations

We briefly fix some notations:

• For an event , represents the indicator function of . if occurs.

• For a natural number , .

• represents the uniform distribution over the set .

• For a mapping and , represents restricted to the domain .

• For a hypothesis class , represents the probability simplex over .

• denotes the VC dimension of the hypothesis class .

• denotes a cost sensitive classification oracle for which is defined below.

Definition 2.3 (Cost Sensitive Classification (CSC) in H).

Let denote a data set of individuals where and are the costs of classifying as positive (1) and negative (0) respectively. Given , the cost sensitive classification problem defined over is the optimization problem:

 argminh∈Hn∑i=1{c1ih(xi)+c0i(1−h(xi))}

An oracle takes the data set as input and outputs the solution to the optimization problem. We use to denote the classifier returned by on an input data set . We say that an algorithm is oracle efficient if it runs in polynomial time given the ability to make unit-time calls to .

3 Learning subject to AIF

In this section we first cast the learning problem subject to the AIF fairness constraints as the constrained optimization problem (1) and then develop an oracle efficient algorithm for solving its corresponding empirical risk minimization (ERM) problem (in the spirit of [agarwal]). In the coming sections we give a full analysis of the developed algorithm including its in-sample accuracy/fairness guarantees, and define the mapping it induces from new problems to hypotheses, and finally establish out-of-sample bounds for this trained mapping.

{tcolorbox}

[title= Fair Learning Problem subject to ()-AIF]

 minψ∈Δ(H)F,γ∈[0,1] err(ψ;P,Q) (1) s.t. ∀x∈X: |E(x,ψ;Q)−γ|≤α
Definition 3.1 (Opt).

Consider the optimization problem (1). Given distributions and , and fairness approximation parameter , we denote the optimal solutions of (1) by and , and the value of the objective function at these optimal points by . In other words

 OPT(α;P,Q)=err(ψ⋆;P,Q)

We will use OPT as the benchmark with respect to which we evaluate the accuracy of our trained mapping. It is worth noticing that the optimization problem (1) has a nonempty set of feasible points for any and any distributions and because the following point is always feasible: and (i.e. random classification) for all where and are all-zero and all-one constant classifiers.

3.1 The Empirical Fair Learning Problem

We start to develop our algorithm by defining the empirical version of (1) for a given training data set of individuals and learning problems . We will formulate the empirical problem as finding a restricted mapping by which we mean the domain of the mapping is restricted to the training set . We will later see how the dynamics of our proposed algorithm allows us to extend the restricted mapping to a mapping from the entire space . We slightly change notation and represent a restricted mapping explicitly by a vector of randomized classifiers where corresponds to . Notice the empirical versions of the overall error rate and the individual error rates incurred by the mapping (see Definition 2.1) can be expressed as:

 err(p;ˆP,ˆQ)=1nn∑i=1E(xi,p;ˆQ)=1mm∑j=11nn∑i=1Phj∼pj[hj(xi)≠fj(xi)] (2)
 E(x,p;ˆQ)=1mm∑j=1Phj∼pj[hj(x)≠fj(x)] (3)

Using these empirical quantities, we cast the empirical version of the fair learning problem (1) as the constrained optimization problem (4) where there is one constraint for each individual in the training data set. Note that the optimization problem (4) forms a linear program and that we considered a slightly relaxed version of (1) where instead of -AIF, we require -AIF (of course now with respect to the empirical distributions) only to make sure the optimal solution of (1) (in fact restricted to ) is feasible in (4) as long as enough samples are acquired. This will appear later in the generalization analysis of our proposed algorithm.

{tcolorbox}

[title= Empirical Fair Learning Problem]

 minp∈Δ(H)m,γ∈[0,1] err(p;ˆP,ˆQ) (4) s.t. ∀i∈{1,…,n}: ∣∣E(xi,p;ˆQ)−γ∣∣≤2α

3.2 A Reductions Approach: Formulation as a Two-player Game

We use the dual perspective of constrained optimization problems to reduce the fair ERM (4) to a two-player game between a “Learner” (primal player) and an “Auditor” (dual player). Towards deriving the Lagrangian of (4), we first rewrite its constraints in form where

 r(p,γ;ˆQ)=⎡⎢⎣E(xi,p;ˆQ)−γ−2αγ−E(xi,p;ˆQ)−2α⎤⎥⎦ni=1∈R2n (5)

represents the “fairness violations” of the pair in one single vector. Let the corresponding dual variables be represented by , where . Note we place an upper bound on the -norm of in order to reason about the convergence of our proposed algorithm. will eventually factor into both the run-time and the approximation guarantees of our solution. Using Equation (5) and the introduced dual variables, we have that the Lagrangian of (4) is

 L(p,γ,λ) =err(p;ˆP,ˆQ)+λTr(p,γ;ˆQ) (6)

We therefore consider solving the following minmax problem:

 minp∈Δ(H)m,γ∈[0,1]maxλ∈Λ L(p,γ,λ) = maxλ∈Λminp∈Δ(H)m,γ∈[0,1] L(p,γ,λ) (7)

where strong duality holds because is linear in its arguments and the domains of and are convex and compact ([sion]). From a game theoretic perspective, the solution to this minmax problem can be seen as an equilibrium of a zero-sum game between two players. The primal player (Learner) has strategy space while the dual player (Auditor) has strategy space , and given a pair of chosen strategies , the Lagrangian represents how much the Learner has to pay to the Auditor — i.e. it defines the payoff function of a zero-sum game in which the Learner is the minimization player, and the Auditor is the maximization player.

Using no regret dynamics, an approximate equilibrium of this zero-sum game can be found in an iterative framework. In each iteration, we let the dual player run the exponentiated gradient descent algorithm and the primal player best respond. The best response problem of the Learner can be decoupled into separate minimization problems and that in particular, the optimal classifiers can be viewed as the solutions to weighted classification problems in where all problems share the same weights over the training individuals. In the following subsection, we derive and implement the best response of the Learner where we use the learning oracle (see Definition 2.3) to solve the weighted classification problems.

3.3 Best: The Learner’s Best Response

We formally describe and analyze the best response problem of the Learner in this subsection and summarize the results in a subroutine called BEST. In each iteration of the described iterative framework, the Learner is given some picked by the Auditor and she wants to solve the following minimization problem.

 argminp∈Δ(H)m,γ∈[0,1]L(p,γ,λ)

We will use Equations (2) and (3) to expand the Lagrangian (6) and decouple the above minimization problem into minimization problems, each depends only on one variable the Learner has to pick.

 argminp∈Δ(H)m,γ∈[0,1]L(p,γ,λ) ≡ argminp∈Δ(H)m,γ∈[0,1]err(p;ˆP,ˆQ)+n∑i=1{λ+i(E(xi,p;ˆQ)−γ)+λ−i(γ−E(xi,p;ˆQ))} ≡ argminp∈Δ(H)m,γ∈[0,1]1nn∑i=1E(xi,p;ˆQ)+n∑i=1{λ+i(E(xi,p;ˆQ)−γ)+λ−i(γ−E(xi;p;ˆQ))} ≡ argminp∈Δ(H)m,γ∈[0,1]n∑i=1{λ−i−λ+i}γ+n∑i=1(1n+λ+i−λ−i)E(xi,p;ˆQ) ≡ argminp∈Δ(H)m,γ∈[0,1]n∑i=1{λ−i−λ+i}γ+1mm∑j=1{n∑i=1(1n+λ+i−λ−i)Phj∼pj[hj(xi)≠fj(xi)]}

Therefore, the minimization problem of the Learner gets nicely decoupled into minimization problems. Let for all , and acccordingly, let . First, the optimal value for is chosen according to

 γ=\mathds1[n∑i=1wi>0] (8)

And for learning problem , the following minimization problem must be solved.

 argminpj∈Δ(H)n∑i=1(1/n+wi)Phj∼pj[hj(xi)≠fj(xi)]≡argminhj∈Hn∑i=1(1/n+wi)\mathds1[hj(xi)≠fj(xi)]

where the equivalence holds since the Learner can choose to put all the probability mass on a single classifier. This minimization problem represents exactly a weighted classification problem. Since we work with cost sensitive classification oracles in this paper, we further reduce the weighted classification problem to a CSC problem that can be solved by a call to the CSC oracle for (). For problem , let

 c1i,j=(wi+1/n)(1−fj(xi)),c0i,j=(wi+1/n)fj(xi)

represent the costs associated with individual . Observe that the above weighted classification problem can be now casted as the following CSC problem.

 hj=argminh∈Hn∑i=1c1i,jh(xi)+c0i,j(1−h(xi)) (9)

To sum up, in each iteration of the algorithm the Auditor first uses the exponentiated gradient descent algorithm to update the dual variable (or correspondingly, the vector of weights over the training individuals) and then the Learner picks and solves cost sensitive classification problems casted in (9) by calling the oracle for all . We have the best response of the Learner written in Subroutine 1. This subroutine will be called in each iteration of the final AIF learning algorithm.

3.4 AIF-Learn: Implementation and In-sample Guarantees

In Algorithm 2 (AIF-Learn), with a slight deviation from what we described in the previous subsections, we implement the proposed algorithm. The deviation arises when the Auditor updates the dual variables in each round, and is introduced in the service of arguing for generalization. To counteract the inherent adaptivity of the algorithm (which makes the quantities estimated at each round data dependent), at each round of the algorithm, we draw a fresh batch of problems to estimate the fairness violation vector (5). From another viewpoint – which is the way the algorithm is actually implemented – similar to usual batch learning models we assume we have a training set of learning problems upfront. However, in our proposed algorithm that runs for iterations, we partition into equally-sized () subsets uniformly at random and use only the batch of problems at round to update the dual variables . Without loss of generality and to avoid technical complications, we assume is a natural number. This is represented in Algorithm 2 by writing for the uniform distribution over the batch of problems , and for the associated learned classifiers for . We will see this modification will only introduce an extra term to the regret of the Auditor and thus we have to assume is sufficiently large (Assumption 3.1) so that the Auditor has in fact low regret.

Notice AIF-Learn takes as input an approximation parameter which will quantify how close the output of the algorithm is to an equilibrium of the introduced game, and it will accordingly propagate to the accuracy bounds. One important aspect of AIF-Learn is that the algorithm maintains a vector of weights over the training individuals and that each learned by our algorithm is in fact an average over classifiers where classifier is the solution to a CSC problem on weighted by . As a consequence, we propose to extend the learned restricted mapping to a mapping that takes any problem as input (represented to by the labels it induces on the training individuals), uses along with the set of weights to solve CSC problems in a similar fashion, and outputs the average of the learned classifiers denoted by . This extension is consistent with in the sense that restricted to will be exactly the output by our algorithm. We have the pseudocode for written in detail in Mapping 3 and we let AIF-Learn output .

We start the analysis of Algorithm 2 by establishing the regret bound of the Auditor over rounds of the algorithm. The regret bound will help us pick the number of iterations and the learning rate so that the Auditor has sufficiently small regret (bounded by ). Notice the Learner uses her best response in each round of the algorithm which implies that she has zero regret. Since in this subsection we eventually want to state in-sample guarantees (i.e., guarantees with respect to the distributions and ), we work with the restricted mapping . In the next subsection we will focus on generalizations in our framework and there we will have to state the guarantees for the mapping . We defer all the proofs to the Appendix

Lemma 3.1 (Regret of the Auditor).

Let . Let be the sequence of exponentiated gradient descent plays (with learning rate ) by the Auditor to the given of the Learner over rounds of Algorithm 2. We have that for any set of individuals , with probability at least over the problems , the (average) regret of the Auditor is bounded as follows.

 1Tmaxλ∈ΛT∑t=1L(ht,γt,λ)−1TT∑t=1L(ht,γt,λt)≤B√log(2nT/δ)2m0+Blog(2n+1)ηT+ηB(1+2α)2

The last two terms appearing in the above regret bound come from the usual regret analysis of the exponentiated gradient descent algorithm. However, the first term originates from a high probability Chernoff-Hoeffding bound because as explained before, the Auditor is using — instead of the whole set of problems  — only randomly selected problem to estimate the vector of fairness violations at round . Hence at each round , the difference of fairness violation estimates — one with respect to and another with respect to  — will appear in the regret of the Auditor which can be bounded by the Chernoff-Hoeffding’s inequality. We will therefore have to assume that is sufficiently large to make the above regret bound small enough.

Assumption 3.1.

For a given confidence parameter , inputs and of Algorithm 2, we suppose throughout this section that the number of fresh problems used in each round of Algorithm 2 satisfies , or equivalently .

Following Lemma 3.1 and Assumption 3.1, we characterize the average plays output by Algorithm 2 in the following lemma. Informally speaking, this lemma guarantees that neither player would gain more than if they deviated from these proposed strategies output by the algorithm. This is what we call a -approximate equilibrium of the game. The proof of the lemma follows from the regret analysis of the Auditor and is fully presented in the Appendix.

Lemma 3.2 (Average Play Characterization).

Let . We have that under Assumption 3.1, for any set of individuals , with probability at least over the labelings , the average plays output by Algorithm 2 forms a -approximate equilibrium of the game, i.e.,

 L(ˆp,ˆγ,ˆλ)≤L(p,γ,ˆλ)+νfor% all  p∈Δ(H)m,γ∈[0,1] L(ˆp,ˆγ,ˆλ)≥L(ˆp,ˆγ,λ)−νfor% all  λ∈Λ

We are now ready to present the main theorem of this subsection which takes the guarantees provided in Lemma 3.2 and turns them into accuracy and fairness guarantees of the pair using the specific form of the Lagrangian (6). The theorem will in fact show that the set of randomized classifiers achieves optimal accuracy up to and that it also satisfies -AIF notion of fairness, all with respect to the empirical distributions and .

Theorem 3.3 (In-sample Accuracy and Fairness).

Let and suppose Assumption 3.1 holds. Let be the output of Algorithm 2 and let be any feasible pair of variables for the empirical fair learning problem (4). We have that for any set of individuals , with probability at least over the labelings ,

 err(ˆp;ˆP,ˆQ)≤err(p;ˆP,ˆQ)+2ν

and that satisfies -AIF with respect to the empirical distributions . In other words, for all ,

 ∣∣E(xi,ˆp;ˆQ)−ˆγ∣∣≤3α

3.5 Generalization Theorems

When it comes to out-of-sample performance in our framework, unlike in usual learning settings, there are two distributions we need to reason about: the individual distribution and the problem distribution (see Figure 1 for a visual illustration of generalization directions in our framework). We need to argue that induces a mapping that is accurate with respect to and , and is fair for almost every individual , where fairness is defined with respect to the true problem distribution . Given these two directions for generalization, we state our generalization guarantees in three steps visualized by arrows in Figure 1. First, in Theorem 3.4, we fix the empirical distribution of the problems and show that the output of Algorithm 2 is accurate and fair with respect to the underlying individual distribution as long as is sufficiently large. Second, in Theorem 3.5, we fix the empirical distribution of individuals and consider generalization along the underlying problem generating distribution . It will follow from the dynamics of the algorithm, as well as the structure of , that the learned mapping will remain accurate and fair with respect to . We will eventually put these pieces together in Theorem 3.6 and argue that is accurate and fair with respect to the underlying distributions simultaniously, given that both and are large enough. We will use OPT (see Definition 3.1) as a benchmark to evaluate the accuracy of the mapping .

Theorem 3.4 (Generalization over P).

Let . Let be the outputs of Algorithm 2, and suppose

 n≥˜O(mdH+log(1/ν2δ)α2β2)

where is the VC dimension of . We have that with probability at least over the observed data set , the mapping satisfies -AIF with respect to the distributions , i.e.,

 Px∼P(∣∣E(x,ˆψ;ˆQ)−ˆγ∣∣>5α)≤β

and that,

 err(ˆψ;P,ˆQ)≤OPT(α;P,ˆQ)+O(ν)+O(αβ)

The proof of this theorem will use standard VC-type generalization techniques where a Chernoff-Hoeffding bound is followed by a union bound (accompanied with the two-sample trick and Sauer’s Lemma) to guarantee a uniform convergence of the empirical estimates to their true expectation. However, compared to the standard VC-based sample complexity bounds in learning theory, there is an extra factor of because there are hypotheses to be learned and that the factor appears in the denominator since in our setting a uniform convergence for all – pure – classifiers will not simply lead to a uniform convergence for all – randomized – classifiers without blowing up the sample complexity (specifically when proving generalization for fairness). We will therefore directly prove uniform convergence for randomized classifiers and that our argument will go through by sparsifying the distributions (taking samples of size from ) coupled with a uniform convergence for -sparse classifiers (randomized classifiers with support size ) and this is how shows up in the sample complexity bound.

Theorem 3.5 (Generalization over Q).

Let . Let be the outputs of Algorithm 2 and suppose

 m≥˜O(log(n)log(n/δ)ν4α4)

We have that for any set of observed individuals , with probability at least over the observed problems , the learned mapping satisfies -AIF with respect to the distributions , i.e.,

 Px∼ˆP(∣∣E(x,ˆψ;Q)−ˆγ∣∣>4α)=0

and that,

 err(ˆψ;ˆP,Q)≤OPT(α;ˆP,Q)+O(ν)

This theorem will follow directly from Chernoff-type concentration inequalities where the fact that in each round the Auditor in our algorithm is using only a fresh batch of randomly selected problems to estimate the fairness violations will help us to prove concentration without appealing to a uniform convergence. The sample complexity for stated in this theorem is equivalent to that of Assumption 3.1 because we needed almost the same type of concentration for controlling the regret of the Auditor in the previous subsection. Having proved generalization separately for and , we are now ready to state the final theorem of this section which provides generalization guarantees simultaneously over both distributions and .

Theorem 3.6 (Simultaneous Generalization over P and Q).

Let . Let be the outputs of Algorithm 2 and suppose

 n≥˜O(mdH+log(1/ν2δ)α2β2),m≥˜O(log(n)log(n/δ)ν4α4)

where is the VC dimension of . We have that with probability at least over the observed data set , the learned mapping satisfies -AIF with respect to the distributions , i.e.,

 Px∼P(∣∣E(x,ˆψ;Q)−ˆγ∣∣>6α)≤2β

and that,

 err(ˆψ;P,Q)≤OPT(α;P,Q)+O(ν)+O(αβ)

To prove this theorem we basically start with the guarantees we have for the empirical distributions and lift them into their corresponding guarantees for by Theorem 3.4. We will then have to take another “lifting” step from to which is not quite similar to what we have shown in Theorem 3.5 and will be proved as a separate lemma in the Appendix. Note that the bounds on and in Theorem 3.6 are mutually dependent: must be linear in , but need only be logarithmic in , and so both bounds can be simultaneously satisfied with sample complexity that is only polynomial in the parameters of the problem.

4 Experimental Evaluation

We have implemented the AIF-Learn algorithm and conclude with a brief experimental demonstration of its practical efficacy using the Communities and Crime dataset3, which contains U.S. census records with demographic information at the neighborhood level. To obtain a challenging instance of our multi-problem framework, we treated each of the first neighborhoods as the “individuals” in our sample, and binarized versions of the first variables as distinct prediction problems. Another of the variables were used as features for learning. For the base learning oracle assumed by AIF-Learn, we used a linear threshold learning heuristic that has worked well in other oracle-efficient reductions ([KNRW18]).

Despite the absence of worst-case guarantees for the linear threshold heuristic, AIF-Learn seems to empirically enjoy the strong convergence properties suggested by the theory. In Figure 2(a) we show trajectory plots of the learned model’s error ( axis) versus its fairness violation (variation in cross-problem individual error rates, axis) over 1000 iterations of the algorithm for varying values of the allowed fairness violation (dashed lines). In each case we see the trajectory eventually converges to a point which saturates the fairness constraint with the optimal error.

In Figure 2(b) we provide a more detailed view of the behavior and performance of AIF-Learn. The axis measures error rates, while the axis measures the allowed fairness violation. For each value of the allowed fairness violation (which is the allowed gap between the smallest and largest individual errors on input ), there is a horizontal row of 200 blue dots showing the error rates for each individual, and a single red dot representing the overall average of those individual error rates. As expected, for large (weak or no fairness constraint), the overall error rate is lowest, but the spread of individual error rates (unfairness) is greatest. As is decreased, the spread of individual error rates is greatly narrowed, at a cost of greater overall error.

A trivial way of achieving zero variability in individual error rates is to make all predictions randomly. So as a baseline comparison for AIF-Learn, the gray dots in Figure 2(b) show the individual error rates achieved by different mixtures of the unconstrained error-optimal model with random classifications, with a black dot representing the overall average of these rates. When the weight on random classification is low (weak or no fairness, top row of gray dots), the overall error is lowest and the individual variation (unfairness) is highest. As we increase the weight on random classification, variation or unfairness decreases and the overall error gets worse. It is clear from the figure that AIF-Learn is considerably outperforming this baseline, both in terms of the average errors (red vs. black lines) and the individual errors (blue vs. gray dots).

Finally, we present out-of-sample performance of AIF-Learn, for which we provided theoretical guarantees in Section 3.5, in Figure 3. To be consistent with in-sample results reported in Figure 2(b), for each value of , we trained a mapping on exactly the same subset of the Communities and Crime data set ( individuals, problems) that we used before. Thus the red curve labelled “training” in Figure 3 is the same as the red curve appearing in Figure 2(b). We used a completely fresh holdout consisting of individuals and problems (binarized features from the dataset that weren’t previously used) to evaluate our generalization performance over both individuals and problems, in terms of both accuracy and fairness violation. Similar to the presentation of generalization theorems in Section 3.5, we demonstrate experimental evaluation of generalization in three steps. The blue and green curves in Figure 3 represent generalization results over individuals (test data: test individuals and training problems) and problems (test data: training individuals and test problems) respectively. The black curve represent generalization across both individuals and problems where test individuals and test problems were used to evaluate the performance of the trained models.

Two things stand out from Figure 3:

1. As predicted by the theory, our test curves track our training curves, but with higher error and unfairness. In particular, the ordering of the models (each corresponds to one ) on the Pareto frontier is the same in testing as in training, meaning that the training curve can indeed be used to manage the trade-off out-of-sample as well.

2. The gap in error is substantially smaller than would be predicted by our theory: since our training data set is so small, our theoretical guarantees are vacuous, but all points plotted in our test Pareto curves are non-trivial in terms of both accuracy and fairness. Presumably the gap in error would narrow on larger training data sets.

We present additional experimental results on a synthetic data set in the supplement.

Acknowledgements

AR is supported in part by NSF grants AF-1763307 and CNS-1253345, and an Amazon Research Award.

Appendix A Learning subject to False Positive AIF (FPAIF)

Definition A.1 (Individual False Positive/Negative Error Rates).

For a given individual , a mapping , and distribution over the space of problems, the Individual false positive/negative rate incurred by on are defined as follows:

 EFP(x,ψ;Q)=1Pf∼Q[f(x)=0]⋅Ef∼Q[Ph∼ψf[h(x)=1,f(x)=0]]
 EFN(x,ψ;Q)=1Pf∼Q[f(x)=1]⋅Ef∼Q[Ph∼ψf[h(x)=0,f(x)=1]]
Definition A.2 (FPAIF fairness notion).

In our framework, we say a mapping satisfies “-FPAIF” (reads -approximate False Positive Average Individual Fairness) with respect to the distributions if there exists such that

 Px∼P(∣∣EFP(x,ψ;Q)−γ∣∣>α)≤β

In this section we consider learning subject to the FPAIF notion of fairness. The FPAIF fairness notion basically asks that the individual false positive rates be approximately (corresponds to ) equalized across almost all (corresponds to ) individuals. Learning subject to equalizing false negative rates can be developed similarly. We will be less wordy in this section as the ideas and the approach that we take are mostly similar to those developed in Section 3. We start off by casting the fair learning problem as the constrained optimization problem (10) where a mapping is to be found such that all individual false positive rates incurred by are within of some . As before, we denote the optimal error of the optimization problem (10) by OPT and will consider that as a benchmark to evaluate the accuracy of our algorithm’s trained mapping.

{tcolorbox}

[title= Fair Learning Problem subject to ()-FPAIF]

 minψ∈Δ(H)F,γ∈[0,1] err(ψ;P,Q) (10) s.t. ∀x∈X: ∣∣EFP(x,ψ;Q)−γ∣∣≤α
Definition A.3.

Consider the optimization problem (10). Given distributions and and fairness approximation parameter , we denote the optimal solutions of (10) by and