Wasserstein Fair Classification

# Wasserstein Fair Classification

Ray Jiang
DeepMind
&Aldo Pacchiano11footnotemark: 1
UC Berkeley, DeepMind
pacchiano@berkeley.edu
&Tom Stepleton
DeepMind
\ANDHeinrich Jiang
&Silvia Chiappa
DeepMind
Equal contribution.
###### Abstract

We propose an approach to fair classification that enforces independence between the classifier outputs and sensitive information by minimizing Wasserstein-1 distances. The approach has desirable theoretical properties and is robust to specific choices of the threshold used to obtain class predictions from model outputs. We introduce different methods that enable hiding sensitive information at test time or have a simple and fast implementation. We show empirical performance against different fairness baselines on several benchmark fairness datasets.

oddsidemargin has been altered.
textheight has been altered.
marginparsep has been altered.
paperwidth has been altered.
textwidth has been altered.
marginparwidth has been altered.
marginparpush has been altered.
paperheight has been altered.
The page layout violates the UAI style. Please do not change the page layout, or include packages like geometry, savetrees, or fullpage, which change it for you. We’re not able to reliably undo arbitrary changes to the style. Please remove the offending package(s), or layout-changing commands and try again.

Wasserstein Fair Classification

Ray Jiangthanks:    Equal contribution. DeepMind rayjiang@google.com                        Aldo Pacchiano11footnotemark: 1 UC Berkeley, DeepMind pacchiano@berkeley.edu                        Tom Stepleton DeepMind stepleton@google.com

## 1 Introduction

TheIn Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence, 2019. Code available at github.com/deepmind/wasserstein_fairness. increasing use of machine learning in decision-making scenarios that have serious implications for individuals and society, such as health care, criminal risk assessment, social services, hiring, financial lending, and online advertising (De Fauw et al., 2018; Dieterich et al., 2016; Eubanks, 2018; Hoffman et al., 2018; Malekipirbazari and Aksakalli, 2015; Perlich et al., 2014), is raising concern that bias in the data and model inaccuracies can lead to decisions that are “unfair” towards underrepresented or historically discriminated groups.

This concern has motivated researchers to investigate ways of ensuring that sensitive information (e.g. race and gender) does not âunfairlyâ influence the decisions. In the classification case considered in this paper, the most widely used approach is to enforce statistical independence between class predictions and sensitive attributes, a criterion called demographic parity (Feldman et al., 2015).

In the common scenario in which the model outputs continuous values from which class predictions are obtained through thresholds, this approach would however ensure fairness only with respect to the particular choice of thresholds. Furthermore, as independence constraints on the class predictions are difficult to impose in practice, uncorrelation constraints on the model outputs are often imposed instead.

In this paper, we propose an approach that overcomes these limitations by imposing independence constraints directly on the model outputs. This is achieved through enforcing small Wasserstein distances between the distributions of the model outputs corresponding to groups of individuals with different sensitive attributes. We demonstrate that using Wasserstein-1 distances to the barycenter is optimal, in the sense that it achieves independence with minimal changes to the class predictions that would have been obtained without constraints. We introduce a Wasserstein-1 penalized logistic regression method that learns the optimal transport map in the logistic model parameters, with a variation that has the advantage of being demographically blind at test time. In addition, we provide a simpler and faster post-processing method. We show that the proposed methods outperform previous approaches in the literature on four benchmark fairness datasets.

## 2 Strong Demographic Parity

Let be a sequence of i.i.d. samples drawn from an unknown probability distribution over . Each datapoint corresponds to information from an individual (or community): indicates a binary class, each element of corresponds to a different sensitive attribute, e.g. to the gender of the individual, and is a feature vector that, possibly together with , can be used to form a prediction of the class . We denote with the set of individuals belonging to group . We indicate with and the random variables corresponding to and , and with or probability density functions (pdfs), where the latter is used to emphasize the associated random variable.

Many classifiers, rather than a binary class prediction , output a non-binary value . In the logistic regression case considered in this paper, indicates the model belief that individual belongs to class 1, i.e. 222Throughout the paper, we use to indicate probability measures associated with the corresponding probability spaces where is a -algebra on the sample output space .. From , a class prediction is obtained using a threshold , i.e. , where equals to one if and zero otherwise. We call the random variable corresponding to the belief variable, and denote with the belief variable for group , i.e. with pdf .

We are interested in ensuring that sensitive information does not influence the decisions. This is often achieved by imposing that the model satisfies a fairness criterion called demographic parity (DP), defined as

 \mathbbmP(^Y=1|A=a)=\mathbbmP(^Y=1|A=¯a),∀a,¯a∈A.

DP can equivalently be expressed as requiring statistical independence between and , denoted as .

Enforcing demographic parity at a given threshold does not necessarily imply that the criterion is satisfied for other thresholds. Furthermore, to alleviate difficulties in optimizing on the class prediction , relaxations are often considered, such as imposing the constraint , where denotes expectation (Goh et al., 2016; Zafar et al., 2017).

To deal with these limitations, we propose an approach that enforces statistical independence between and , . We call this fairness criterion strong demographic parity (SDP), as it ensures that the decision does not depend on the sensitive attribute regardless of the threshold used, since implies for any value of . SDP can be defined as

 pSa=pS¯a,∀a,¯a∈A.

In Remark 1, we prove that this definition is equivalent to333We omit the brackets from the expectation to simplify the notation.

 Eτ∼U(Ω)|\mathbbmP(Sa>τ)−\mathbbmP(S¯a>τ)|=0,∀a,¯a∈A,

where denotes the uniform distribution over . This result leads us to use

 ∑a,¯a∈A\emphs.t. a≠¯aEτ∼U(Ω)|\mathbbmP(Sa>τ)−\mathbbmP(S¯a>τ)|,

as a measure of dependence of on , the we call strong pairwise demographic disparity (SPDD).

## 3 Wasserstein Fair Classification

We suggest to achieve SDP by enforcing the model output pdfs corresponding to groups of individuals with different sensitive attributes, , to coincide with their Wasserstein-1 barycenter distribution . The use of the Wasserstein distance is motivated because this distance is defined and computable even between distributions with disjoint supports. This is critical because the empirical estimates , of and used to implement the methods and their supports are typically disjoint.

### 3.1 Optimality of Wasserstein-1 Distance

#### Preliminary.

Given two pdfs and on and , a transportation map is defined by for any measurable subset (indicating that the mass of the set with respect to the density equals the mass of the set with respect to the density ). Let be the set of transportation maps from to , and be a cost function such that indicates the cost of transporting to . In the original formulation (Monge, 1781), the optimal transport map is the one that minimizes the total transportation cost, i.e.

 T∗=argminT∈T∫x∈Xc(x,T(x))pX(x)dx.

To address limitations of this formulation, Kantorovich (1942) reformulated the optimal transport problem as finding an optimal pdf in the set of joint pdfs on with marginals over and given by and such that

 γ∗=argminγ∈Γ(fX,fY)∫X×Yc(x,y)pX×Ydxdy.

The -Wasserstein distance is defined as

 Wp(pX,pY)=minγ∈Γ(fX,fY)(∫X×Yd(x,y)ppX×Ydxdy)1p,

where , d is a distance on , and .

#### Fair Optimal Post-Processing.

Let us first consider the problem of post-processing the beliefs of a model to achieve SDP while making minimal model class prediction changes.

Let and be two belief variables with values in and pdfs and , and let be a transportation map satisfying for any measurable subset . Let be the set of all such transportation maps. A class prediction changes due to transportation if and only if where and . This observation leads to the following result.

###### Proposition 1.

Given two belief variables and in with pdfs and , the following three quantities are equal:

1. .

2. .

3. Expected class prediction changes due to transporting into through the map

 Eτ∼U(Ω),x∼pS1\mathbbmP(τ∈(mT∗x,MT∗x)).
###### Proof.

In the one-dimensional case of and , the total transportation cost can be written as

 W1(pS1,pS2) =\lx@notemark{footnote}∫1x=0|P−1S1(x)−P−1S2(x)|dx =∫1τ=0|PS1(τ)−PS2(τ)|dτ (by Lemma 6 in Appendix C) =Eτ∼U(Ω)|\mathbbmP(S1≤τ)−\mathbbmP(S2≤τ)| =Eτ∼U(Ω)|\mathbbmP(S1>τ)−\mathbbmP(S2>τ)|,

where and are the cumulative distribution functions of and respectively. This prove that (i) equals (ii).

The expected class prediction changes due to applying the transportation map is given by

 Eτ∼U(Ω)x∼pS1\mathbbmP(τ∈(mTx,MTx)) =∫1τ=0∫x|x−T(x)|pS1(x)dxdτ =∫x|x−T(x)|pS1(x)dx.

Thus,

 W1(pS1,pS2) =minT∈T∫x|x−T(x)|pS1(x)dx =∫x|x−T∗(x)|pS1(x)dx =Eτ∼U(Ω)x∼pS1\mathbbmP(τ∈(mT∗x,MT∗x)).

This prove that (i) equals (iii). ∎

###### Remark 1.

Notice that if and only if . Indeed, by Proposition 1 and the property of the metric, .

To reach SDP, we need to achieve , where , the space of pdfs on . We would like to choose transportation maps and a target distribution such that the transportation process from to incurs minimal total expected class prediction changes. Assume that the groups are all disjoint, so that the per-group transportation maps are independent from each other. Let be the set of transportation maps with elements such that, restricted to group , is a transportation map from to (i.e.  where denotes the space of transportation maps from to ). We would like to obtain

 minT∈\mathbbmT(p∗)p∗∈P(Ω)Eτ∼U(Ω)x∼pS\mathbbmP(τ∈(mTx,MTx)) =minT∈\mathbbmT(p∗)p∗∈P(Ω)∑a∈Ap(A=a)paEcτ∼U(Ω)x∼pSa\mathbbmP(τ∈(mTx,MTx)) =minp∗∈P(Ω)∑a∈ApaminT∈TaEτ∼U(Ω)x∼pSa\mathbbmP(τ∈(mTx,MTx)) =minp∗∈P(Ω)∑a∈ApaminT∈Ta∫x∈Ω|x−T(x)|pSa(x)dx =minp∗∈P(Ω)∑a∈ApaW1(pSa,p∗).

Therefore we are interested in

 p¯S=argminp∗∈P(Ω)∑a∈ApaW1(pSa,p∗), (1)

which coincides with the Wasserstein-1 barycenter with normalized subgroup size as weight to every group distribution (Agueh and Carlier, 2011).

In summary, we have demonstrated that the optimal post-processing procedure that minimizes total expected model prediction changes is to use the Wasserstein-1 optimal transport map to transport all group distributions to their weighted barycenter distribution .

We have shown that post-processing the beliefs of a model through optimal transportation achieves SDP (and therefore ) whilst minimizing expected prediction changes. We now examine the case in which, after transportation, SDP is not attained, i.e. SPDD is positive. By triangle inequality

 SPDD ≤2(|A|−1)∑a∈AEτ∼U(Ω)|\mathbbmP(Sa>τ)−\mathbbmP(¯S>τ)| =2(|A|−1)∑a∈AW1(pSa,p¯S).

We call this upper bound on SPDD pseudo-SPDD. Pseudo-SPDD is the tightest upper bound to SPDD among all possible target distributions by the definition of the barycenter and Proposition 1. Indeed

 ∑a∈AEτ∼U(Ω)|\mathbbmP(Sa>τ)−\mathbbmP(¯S>τ)| =∑a∈AW1(pSa,p¯S)≤∑a∈AW1(pSa,pS0) =∑a∈AEτ∼U(Ω)|\mathbbmP(Sa>τ)−\mathbbmP(S0>τ)|,

for any distribution . Since SPDD is difficult to derive optimal trade-offs for, we do that with respect to the pseudo-SPDD as the measure of fairness instead.

We are interested in changing to , , to reach a fairness bound for pseudo-SPDD such that the required model prediction changes are minimal in expectation. This is obtained by choosing the that minimizes the total expected prediction changes, which equals by Proposition 1, while bounding the pseudo-SPDD by , i.e. . Assuming that the groups are disjoint, we can optimize each group transportation in turn independently assuming the other groups are fixed. This gives

 pS∗a =argminp∗∈P(Ω)s.t.W1(p∗,p¯S)≤λ−γpaW1(pSa,p∗) =argminp∗∈P(Ω)s.t.W1(p∗,p¯S)≤λ−γW1(pSa,p∗),

where . By triangle inequality, . The distance reaches its minimum if and only if lies on a shortest path between and . Thus it is optimal to transport along any shortest path between itself and in the Wasserstein-1 metric space. In the approach proposed in the next section, we approximate transporting group distributions along these shortest paths with hyperparameter tuning of a gradient descent method to minimize for every group.

#### Empirical Computation of the Barycenter.

In practice, as building the barycenter from the population distributions is impossible, we use the empirical distributions obtained from . The choice is justified by the following result:

###### Lemma 1.

If the samples in are i.i.d., as , if for all , the empirical barycenter distribution satisfies almost surely555See Klenke (2013) for a formal definition of almost sure convergence of random variables..

The proof is given in Appendix A.

In the next two sections we introduce two different approaches to achieve SDP with Wasserstein-1 distances: A penalization approach to logistic regression and a simpler practical approach consisting in post-processing model beliefs.

### 3.2 Wasserstein-1 Penalized Logistic Regression

The average logistic regression loss function over is given by

 JD(θ)=1NN∑n=1−ynlogsn−(1−yn)log(1−sn),

where the model belief that individual belongs to class 1, , is obtained as , with , and where are the model parameters. We denote with the model beliefs for group and with the atoms of .

The gradient of with respect to is given by

 ∇θJD(θ)=1NN∑n=1wn(σ(θ⊤wn)−yn).

We propose to find model parameters that minimize the population level logistic loss under the constraint of small Wasserstein-1 distances between and the empirical barycenter , .

The Wasserstein-1 distance between any two empirical distributions and underlying two datasets is given by

 W1(^pb,^pc)=minTb,c∈U(b,c)⟨Tb,c,C⟩, (2)

where with denoting a vector of ones of size . The brackets denote the trace dot product and C is the cost matrix associated with the Wasserstein-1 cost function c of elements .

In particular, the Wasserstein-1 distance can be computed by solving the optimization problem of Eq. (2) with cost matrix satisfying

 (Cθa)i,j =∣∣sia−¯sj∣∣,

where the upper script in is maintained to remind the reader that model predictions are a function of the model parameter .

The Wasserstein-1 penalized logistic regression objective is given by

 JW1(θ)=αJD(θ)+(1−α)β∑a∈AW1(^pSa,^p¯S), (3)

where and are penalization coefficients.

###### Lemma 2.

If the datasets have empirical distributions and , and C is the cost matrix of elements :

 ∇CW1(^pb,^pc)=T∗b,c,

where is the optimal coupling resulting from the optimization objective of Eq. (2).

###### Proof.

The result follows immediately from the subgradient rule for a pointwise max function (see Boyd and Vandenberghe (2004)). ∎

###### Lemma 3.

 α∇θ JD(θ)+(1−α)β(∑a∈A∑i,jT∗a(θ)∇θ∣∣sia−¯si∣∣),

where is the optimal coupling between and 666Recall that is a function of ..

###### Proof.

This formula is a consequence of the chain rule and Lemma 1. ∎

#### Computation Method.

We propose to optimize the Wasserstein penalized logistic loss objective (Eq. (3)) via gradient descent. The procedure is detailed in Algorithm 1. We start by describing how to perform Step 2. under the assumption that and have been computed. The computation of the optimal coupling family hinges on the following Lemma.

###### Lemma 4.

If , and for all and for all , then: .

This lemma characterizes the coupling matrix between the empirical distributions of two datasets made of real numbers. When and the datasets are and , with , and , then the optimal coupling equals where denotes the identity matrix. Lemma 4 extends this simple case to the general case of datasets of arbitrary orderings and sizes, see Deshpande et al. (2018) for a proof. It is easy to see that the optimal coupling is sparse and has at most nonzero entries (see Cuturi (2013)). As a consequence, the computation of can be performed in linear time where . In the computation of only the nonzero entries of matter.

We compute the empirical barycenter and , using the POT library by Flamary and Courty (2017). We fix the support of potential barycenters to bins of equal-width spanning the interval, and use the iterative KL-projection method proposed by Benamou et al. (2015). We then generate a number of samples from the normalized probability distribution of the computed barycenter.

#### Demographically-Blind Wasserstein-1 Penalized Logistic Regression.

In real-world applications, the use of sensitive attributes might be prohibited when deploying a system. We therefore consider the variation where . This variation still uses the sensitive attributes to calculate the Wasserstein-1 loss but, by not including them into the feature set, does not require knowledge of sensitive information at test time.

### 3.3 Wasserstein-1 Post-Processing

In this section, we propose a simple, fast quantile matching method to post-process the beliefs of a classifier trained on . This method corresponds to an approximate Wasserstein-1 optimal transport map by the formulation of Rachev and Rüschendorf (1998):

 W1(pSa,p¯S)=∫1τ=0|P−1Sa(τ)−P−1¯S(τ)|dτ.

The procedure is detailed in Algorithm 2. For each group , we compute quantiles of and map all group beliefs belonging in each quantile bin to the supremum of those belonging to the corresponding quantile bin of .

### 3.4 Generalization

The following lemma addresses generalization of the Wasserstein-1 objective. Assume for all . Let and be the cumulative density functions of , and . Assume these random variables all have domain and that all are continuous, then:

###### Lemma 5.

For any , if , with probability :

 ∑a∈ApaW1(pSa,p¯S)≤∑a∈A^paW1(^pSa,^p¯S)+ϵ.

In other words, provided access to sufficient samples, a low value of implies a low value for with high probability and therefore good performance at test time.

The proof is given in Appendix B.

Lemma 5 implies that under appropriate conditions, the value of the population objective of the Wasserstein cost is upper bounded by the empirical Wasserstein cost plus a small constant.

## 4 Related Work

Broadly speaking, we can group current literature on fair classification and regression into three main approaches. The first approach consists in pre-processing the data to remove bias, or in extracting representations that do not contain sensitive information during training (Beutel et al., 2017; Calders et al., 2009; Calmon et al., 2017; Edwards and Storkey, 2016; Feldman et al., 2015; Fish et al., 2015; Kamiran and Calders, 2009, 2012; Louizos et al., 2016; Zemel et al., 2013; Žliobaite et al., 2011). This approach includes current methods to fairness using Wasserstein distances consisting in achieving SDP through transportation of features (Del Barrio et al., 2019; Johndrow and Lum, 2019). The second approach consists in performing a post-processing of the model outputs (Chiappa, 2019; Doherty et al., 2012; Feldman, 2015; Hardt et al., 2016; Kusner et al., 2017). The third approach consists in enforcing fairness notions by imposing constraints into the optimization, or by using an adversary. Some methods transform the constrained optimization problem via the method of Lagrange multipliers (Goh et al., 2016; Zafar et al., 2017; Wu et al., 2018; Agarwal et al., 2018; Cotter et al., 2018; Corbett-Davies et al., 2017; Narasimhan, 2018). Other work similar in spirit adds penalties to the objective (Komiyama et al., 2018; Donini et al., 2018). Adversarial methods maximize the system ability to predict while minimizing the ability to predict (Zhang et al., 2018).

## 5 Experiments

In this section, we evaluate the methods introduced in Sections 3.2 and 3.3 on four datasets from the UCI repository (Lichman, 2013). For penalized logistic regression, we refer to the method in which sensitive information is included in the feature set, i.e. , as Wass-1 Penalty; and to the demographically-blind variant in which sensitive information is not included, i.e. , as Wass-1 Penalty DB. We refer to the post-processing method as Wass-1 Post-Process. We also include a variant of this method using instead of the barycenter (Wass-1 Post-Process ), which gives a simpler algorithm that only requires computing basic quantile functions. We compare these methods with the following baselines:

Unconstrained:

Logistic regression with no fairness constraints.

Hardt’s Post-Process:

Post-processing of the logistic regression beliefs of all individuals in group by adding , where the threshold is found using the method of Hardt et al. (2016). This ensures that DP is satisfied at threshold .

Constrained Optimization:

Lagrangian-based method (see e.g. Eban et al. (2017); Goh et al. (2016)) using a linear model as the underlying predictor and equal positive prediction rate between each group and as fairness constraints with threshold .

The same as the previous method, but with more fairness constraints. Specifically, the fairness constraints are equal positive prediction rates for a set of thresholds from to in increments of on the output of the linear model.

### 5.1 Training Details

In the approaches Unconstrained, Hardt’s Post-Process, Wass-1 Penalty, and Wass-1 Post-Process, we trained a logistic regression model using Scikit-Learn with default hyper-parameters (Pedregosa and et al., 2011).

For Wass-1 Penalty (Algorithm 1), as initial model parameters we used the ones given by the trained logistic regression. We swept over penalization coefficients , , gradient step sizes , set the maximum number of training steps to , and computed the barycenter once every steps, effectively only once after the initialization of . In the computation of the barycenter (using the POT library by Flamary and Courty (2017)), we swept over numbers of bins , entropy penalty , and used number of iterations . The time complexity of our implementation is . Our gradient steps take on average 0.02 seconds.

For Wass-1 Post-Process (Algorithm 2), we used a number of bins .

For Constrained Optimization, we used the hinge loss as objective and the hinge relaxation for the fairness constraints. We trained by jointly optimizing the model parameters and Lagrange multipliers on the Lagrangian using ADAM with the default step-size of and mini-batch size of , and trained for steps. We allowed an additive slack of on the constraints, as otherwise we found feasibility issues leading to degenerate classifiers.

### 5.2 Datasets

The UCI Adult Dataset. The Adult dataset contains 14 attributes including age, working class, education level, marital status, occupation, relationship, race, gender, capital gain and loss, working hours, and nationality for 48,842 individuals; 32,561 and 16,281 for the training and test sets respectively. The goal is to predict whether the individual’s annual income is above or below \$50,000.

Pre-processing and Sensitive Attributes. We pre-processed the data in the same way as done in Zafar et al. (2017); Goh et al. (2016). The categorical features were encoded into binary features (one for each category), and the continuous features were transformed into binary encodings depending on five quantile values, obtaining a total of features. As sensitive attributes, we considered race (Black and White) and gender (female and male), obtaining four groups corresponding to black females, white females, black males, and white males.

The UCI German Credit Dataset. This dataset contains 20 attributes for 1,000 individuals applying for loans. Each applicant is classified as a good or bad credit risk, i.e. as likely or not likely to repay the loan. We randomly divided the dataset into training and test sets of sizes 670 and 330 respectively.

Pre-processing and Sensitive Attributes. We did not do any pre-processing. As sensitive attributes, we considered age ( and years old), obtaining two groups.

The UCI Bank Marketing Dataset. This dataset contains 20 attributes for 41,188 individuals. Each individual is classified as subscribed or not to a term deposit. We divided the dataset into train and test sets of sizes 32,950 and 8,238 respectively.

Pre-processing and Sensitive Attributes. We pre-processed the data as for the Adult dataset. We transformed the categorical features into binary ones, and the continuous features into five binary features based on five quantile bins, obtaining a total of 60 features. We also subtracted the mean from cons.price.idx, cons.conf.idx, euribor3m, and nr.employed to make them zero-centered. As sensitive attributes, we considered age, which was discretized based on five quantiles leading to five groups.

The UCI Communities & Crime Dataset. This dataset contains 135 attributes for 1994 communities; 1495 and 499 for the training and test sets respectively. The goal is to predict whether a community has high (above the 70-th percentile) crime rate.

Pre-processing and Sensitive Attributes. We pre-processed the data as in Wu et al. (2018). As sensitive attributes, we considered race (Black, White, Asian and Hispanic), thresholded at the median to form height groups.

### 5.3 Results

We compared the different methods using the following metrics:

Err-.5:

Binary classification error using threshold , i.e. .

Err-Exp:

As above, but averaging over 100 uniformly-spaced thresholds .

DD-.5:

Demographic disparity at , summed over all groups , i.e. , where e.g.  is estimated as