Taking Advantage of Multitask Learning for Fair Classification

Taking Advantage of Multitask Learning for Fair Classification

Abstract.

A central goal of algorithmic fairness is to reduce bias in automated decision making. An unavoidable tension exists between accuracy gains obtained by using sensitive information (e.g., gender or ethnic group) as part of a statistical model, and any commitment to protect these characteristics. Often, due to biases present in the data, using the sensitive information in the functional form of a classifier improves classification accuracy. In this paper we show how it is possible to get the best of both worlds: optimize model accuracy and fairness without explicitly using the sensitive feature in the functional form of the model, thereby treating different individuals equally. Our method is based on two key ideas. On the one hand, we propose to use Multitask Learning (MTL), enhanced with fairness constraints, to jointly learn group specific classifiers that leverage information between sensitive groups. On the other hand, since learning group specific models might not be permitted, we propose to first predict the sensitive features by any learning method and then to use the predicted sensitive feature to train MTL with fairness constraints. This enables us to tackle fairness with a three-pronged approach, that is, by increasing accuracy on each group, enforcing measures of fairness during training, and protecting sensitive information during testing. Experimental results on two real datasets support our proposal, showing substantial improvements in both accuracy and fairness.

12

1. Introduction

In recent years there has been a lot of interest in the problem of enhancing learning methods with “fairness” requirements, see (pleiss2017fairness; beutel2017data; hardt2016equality; feldman2015certifying; agarwal2017reductions; agarwal2018reductions; woodworth2017learning; zafar2017fairness; menon2018cost; zafar2017parity; bechavod2018Penalizing; zafar2017fairnessARXIV; kamishima2011fairness; kearns2017preventing; Prez-Suay2017Fair; dwork2018decoupled; berk2017convex; alabi2018optimizing; adebayo2016iterative; calmon2017optimized; kamiran2009classifying; zemel2013learning; kamiran2012data; kamiran2010classification) and references therein. The general aim is to ensure that sensitive information (e.g. knowledge about gender or ethnic group of an individual) does not “unfairly” influence the outcome of a learning algorithm. For example, if the learning problem is to predict what salary a person should earn based on her skills and previous employment records, we would like to build a model which does not unfairly use additional sensitive information such as gender or race.

A central question is how sensitive information should be used during the training and testing phases of a model. From a statistical perspective, sensitive information can improve model performance: removing this information may result in a less accurate model, without necessarily improving the fairness of the solution, (dwork2018decoupled; zafar2017fairness; pedreshi2008discrimination). However, it is well known, that in some jurisdictions using different classifiers, either explicitly or implicitly, for members of different groups, may not be permitted, we refer to the remark at page 3 in  (dwork2018decoupled) and references therein. These imply that we can access the sensitive information during the training phase of a model but not during the testing phase. Our principal objective is then to optimize model accuracy while still protecting sensitive information in the data.

As a first step towards not discriminating minority groups we focus on maximizing average accuracy with respect to each group as opposed to maximizing the overall accuracy (chouldechova2017fair). For the underlying generic learning method, we consider both Single Task Learning (STL) and Independent Task Learning (ITL). While the latter independently learns a different function for each group, the former aims to learn a function that is common between all groups. A well-known weakness of these methods is that they tend to generalize poorly on smaller groups: while STL may learn a model which better represents the largest group, ITL may overfit minority groups  (baxter2000model). A common approach to overcome such limitations is offered by Multitask Learning (MTL), see (baxter2000model; caruana1997multitask; evgeniou2004regularized; bakker2003task; argyriou2008convex) and references therein. This methodology leverages information between the groups (tasks) to learn more accurate models. Surprisingly, to the best of our knowledge, MTL has received little attention in the algorithmic fairness domain. We are only aware of the work (dwork2018decoupled) which proposes to learn different classifiers per group, combined with MTL to ameliorate the issue of potentially having too little data on minority groups.

We build upon a particular instance of MTL which jointly learns a shared model between the groups as well as a specific model per group. We show how fairness constraints, measured with Equalized Odds or Equal Opportunities introduced in (hardt2016equality), can be built in MTL directly during the training phase. This is in contrast to other approaches which impose the fairness constraint as a post-processing step (pleiss2017fairness; beutel2017data; hardt2016equality; feldman2015certifying) or by modifying the data representation before employing standard machine learning methods (adebayo2016iterative; calmon2017optimized; kamiran2009classifying; zemel2013learning; kamiran2012data; kamiran2010classification). In many recent works (donini2018empirical; agarwal2017reductions; agarwal2018reductions; woodworth2017learning; zafar2017fairness; menon2018cost; zafar2017parity; bechavod2018Penalizing; zafar2017fairnessARXIV; kamishima2011fairness; kearns2017preventing; Prez-Suay2017Fair; dwork2018decoupled; berk2017convex; alabi2018optimizing; dwork2018decoupled) it has been shown how to enforce these constraints during the learning phase of a classifier. Here we opt for the approach proposed in (donini2018empirical) since it is convex, theoretically grounded, and performs favorably against state-of-the-art alternatives. We present experiments on two real-datasets which demonstrate that the shared classifier learned by MTL works better than STL and in turn MTL’s group specific classifiers perform better than both ITL as well as the shared MTL model. These results are in line with previous studies on MTL, which suggest the benefit offered by this methodology, see (Efros; evgeniou2004regularized; donini2016distributed) and references therein. Moreover, we observe that the fairness constraint is effective in controlling the fairness measure.

Unfortunately, as remarked before, all the models which employ the sensitive feature in the testing phase may not be adoptable. Independent models cannot be employed since we are using different classifiers for members of different groups. Even the shared model may not be a feasible option, if the sensitive feature is used as a predictor (e.g. if the model is linear, including the sensitive feature entails using a group specific threshold). Therefore, the only feasible3 option would be to learn a shared model based on the non-sensitive features. This constraint may limit our ability to learn classifiers of high generalization ability.

Figure 1. Our proposal in a graphical abstract: rather than using the sensitive feature as a predictor we propose to learn, with any learning algorithm, a function , which captures the relationship between and , and then use , instead of , to learn group specific models via MTL.

In order to overcome such limitations, we propose to first use the non-sensitive features to predict the value of the sensitive one and then use the predicted sensitive feature to learn group specific models via MTL. The proposal is depicted in the graphical abstract of Figure 1. We experimentally demonstrate that the proposed approach matches the classification accuracy of the best performing model which uses the sensitive information during testing, in addition to further improving upon measures of fairness.

The rest of the paper is organized as follows. Section 2 presents some preliminary definitions and notions concerning the fair classification framework. Section 3 outlines the central problem that we face in the paper: exploiting the sensitive feature while still treating different groups equally. Section 4 presents our proposal: predicting the sensitive feature based on the non-sensitive ones and then exploiting MTL with fairness constraints in order to increase both accuracy and fairness measures (see Figure 1). In Section 5 we test the proposal on two well known fairness related datasets (Adult and COMPAS) demonstrating the potentiality of it. We conclude the paper with a brief discussion in Section 6.

2. Preliminaries

We let be a training set formed by samples drawn independently from an unknown probability distribution over , where is the set of binary output labels, represents group membership, and is the input space.

For every and operator , we define the subset of training points with sensitive feature equal to as and the subset of training point negatively and positively labeled with sensitive feature equal to as . We also let and for , we let .

Let us consider a function (or model) chosen from a set of possible models. The error (risk) of is measured by a prescribed loss function . The average accuracy with respect to each group of a model , together with its empirical counterparts , are defined respectively as

and

The fairness of the model can be measured w.r.t. many notions of fairness as mentioned in Section 1. In this work we choose to opt for the Equal Opportunity (EOp) and the Equal Odds (EOd). For , the EOp constraint is defined as (hardt2016equality)

(1)

where . The EOd, instead is just the concurrent verification of the EOp and EOp, then

(2)

Since a model , in general, will not be able to exactly fulfill the EOp with nor the EOd constraints we define the Difference of EOp (DEOp) with as

where . Finally, the Difference of EOd (DEOd) is defined as

3. Paradigm

A central problem, when learning a model from data under fairness requirements, is that using a different classification method, or even using different weights on attributes for members of different groups may not be allowed for certain classification tasks (dwork2018decoupled). In other words, it may not be permitted to use the sensitive feature explicitly or implicitly in the functional form of the model4. This means that should be a function of only, that is, .

For instance, if and the sensitive feature is encoded with a one-hot encoding, and we use a linear classifier then

which is forbidden since the model involves a different bias for each of the sensitive groups. The problem is even more apparent when we use a different model per each group, namely we set

(3)

Unfortunately, the above requirement can be highly constraining, resulting in a model with poor accuracy. In practice, due to bias present in the data, learning a model which involves the sensitive feature in its functional form may substantially improve model accuracy.

Our proposal to overcome the above limitation is to use the input to predict the sensitive group . That is, we learn a function , such that is the prediction of the sensitive feature of . Therefore, our method replaces the specific model with the composite model , thereby treating different individuals equally. Indeed if and are two instances, then provided irrespective of the values of and . Hence, we can freely use in the functional since, during the testing phase, we do not require any knowledge of . As we shall see, on the one hand, in the regions of the input space where the classifier predicts well, this approach allows us to exploit MTL to learn group specific models. On the other hand, when the prediction error is high, this approach acts as a randomization procedure5 which, as we will empirically show, improves the fairness measure of the overall model.

In this paper we investigate (i) the effect of having the sensitive feature as part of the functional form of the model, (ii) the effect of using a shared model between the groups or a different model per group, (iii) the effect of learning a shared model with STL or MTL and the effect of learning group specific models with ITL or MTL, and (iv) the effect of using the predicted sensitive feature instead of its actual value inside the functional form of the model. Then we will show that it is possible to take the best result of the different approaches with substantial benefits in terms of both model accuracy and fairness, while still treating different individuals equally.

4. Methodology

In this section, we describe our approach to learning fair and accurate models and highlight the connection to MTL (evgeniou2004regularized). We consider the following functional form

(4)

where is the inner product between two vectors in a Hilbert space6 , is a vector of parameters, and is a prescribed feature mapping7.

We can then learn the parameter vector by regularized empirical risk minimization, using the square Euclidean norm of the parameter vector as the regularizer. The generality of this approach comes from the general form of the feature mapping which may be implicitly defined by a kernel function, see e.g. (shawe2004kernel; smola2001) and references therein. In the following, first we will briefly discuss three approaches for learning the parameter vector which correspond to the three methods investigated in this paper. Then, we will explain how these methods can be enhanced with fairness constraints.

4.1. Single Task Learning

As we argued above, we may not be allowed to explicitly use the sensitive feature in the functional form of the model. A simple approach to overcome this problem, would be to train a shared model between the groups, that is, we choose and in Eq. (4), where and , so that (a potentially unregularized threshold may be built in the feature map to include a bias term). We learn the model parameters by solving the Tikhonov regularization problem8

(5)

where is a regularization parameter. This method, which we will call Single Task Learning (STL), searches for the linear separator which minimizes a trade-off between the empirical average risk per group and the complexity (smoothness) of the models.

As we shall see in our experiments below, STL performs poorly, because it does not capture variance across groups. A slight variation which may improve performance is to introduce group specific thresholds. However, we remark again that this approach may not be permitted. Specifically, we choose and where are the canonical basis vectors in and , so that .

4.2. Independent Task Learning

An approach to overcome the potentially underfitting performance of STL is to learn different models for each of the groups, we refer to this approach as independent task learning (ITL). It corresponds to setting and in Eq. (4), where and , so that . As before, the feature map may account for a constant component to accommodate a threshold for each of the groups. To find the vectors we solve independent Tikhonov regularization problems of the form

(6)

Note that, similar to STL, if we substitute to in this last functional form then the method treats members of different groups equally, since , as we mentioned before, learning independent models may not be allowed. Furthermore, we remark that from a statistical point of view, minority groups (small sample sizes) will be prone to overfitting. Nevertheless, as we shall see, ITL works better than STL in our experiments, suggesting that there is a lot of bias in the data. Still one would expect that by leveraging similarities between the groups ITL can be further improved. We discuss this next.

4.3. Multitask Learning

Let us now discuss the multitask learning approach used in the paper, which is based on regularization around a common mean (evgeniou2004regularized). We choose and in Eq. (4), where and , so that . MTL jointly learns a shared model as well as task specific models by encouraging the specific models and the shared model to be close to each other. To this end, we solve the following Tikhonov regularization problem

(7)

where the parameter forces the dependency between shared and specific models and the parameter captures the relative importance of the loss of the shared model and the group-specific models. This MTL approach is general enough to include STL and ITL, which are recovered by setting and , respectively. Similar to STL and ITL, regularized group specific thresholds could be added in the shared model and in the group specific models.

Again, note that the group specific models trained by MTL may not be permitted. Likewise the shared model trained by MTL may not be permitted if we include the sensitive variable to the input. However if the sensitive variable is predicted from an external classifier and then MTL retrained with the predicted values, then this model treats different groups equally (see Figure 1).

4.4. Adding Fairness Constraints

Note that both STL, ITL and MTL problems are convex provided the the loss function used to measure the empirical errors and in Eqns. (5), (6), and (4.3) are convex. Since we are dealing with binary classification problems, we will use the hinge loss (see e.g. (shalev2014understanding)), which is defined as .

In many recent papers (pleiss2017fairness; beutel2017data; hardt2016equality; feldman2015certifying; agarwal2017reductions; agarwal2018reductions; woodworth2017learning; zafar2017fairness; menon2018cost; zafar2017parity; bechavod2018Penalizing; zafar2017fairnessARXIV; kamishima2011fairness; kearns2017preventing; Prez-Suay2017Fair; dwork2018decoupled; berk2017convex; alabi2018optimizing; adebayo2016iterative; calmon2017optimized; kamiran2009classifying; zemel2013learning; kamiran2012data; kamiran2010classification; donini2018empirical) it has been shown how to enforce EOp constraints for , during the learning phase of the model . Here we build upon the approach proposed in (donini2018empirical) since it is convex, theoretically grounded, and showed to perform favorably against state-of-the-art alternatives. To this end, we first observe that

(8)

where is the hard loss function. Then, by substituting Eq. (4.4) in Eqs. (1) and (2), replacing the deterministic quantities with their empirical counterpart, and by approximating the hard loss function with the linear one we have that the convex EOp constraints with is defined as follows

(9)

while for the EOd we just have to enforce both the EOp and EOp constraints.

In order to plug the constraint of Eq. (9) inside STL, ITL and MTL we first define the quantities

(10)

It is then straightforward to show that if we wish to enforce the EOp constraint onto the shared model one has to add these constraints to the STL and MTL

(11)

We remark again that for the EOd constraints we just have to insert which means constraints.

If, instead, we want to enforce the EOp constraint onto group specific models we have to add these constraints to the MTL and ITL

(12)

while for the EOd we just have to insert .

Al last we note that by the representer theorem, as shown in (donini2018empirical), it is straightforward to derive the kernelized version of the fair STL, ITL, and MTL convex problems which can be solved with any solver, in our case CPLEX (cplex2018ibm).

5. Experiments

The aim of the experiments is to address the questions raised before. Namely, we wish to: (a) study the effect of using the sensitive feature as a way to bias the decision of a common model or to learn group specific models, (b) show the advantage of training either the shared or group specific models via MTL, and (c) show that MTL can be effectively used even when the sensitive feature is not available during testing by predicting the sensitive feature based on the non-sensitive ones.

5.1. Datasets and Setting

We employed the Adult dataset from the UCI repository9 and the Correctional Offender Management Profilingfor Alternative Sanctions (COMPAS) dataset10.

The Adult dataset contains features concerning demographic characteristics of instances ( for training and for testing), features, Gender (G) and Race (R), can be considered sensitive. The task is to predict if a person has an income per year that is more (or less) than . Some statistics of the adult dataset with reference to the sensitive features are reported in Table 1.

Sens. Group
G Male (M) 66.9
Female(F) 33.2
R White (W) 85.5
Black (B) 9.6
Asian-Pac-Islander (API) 3.1
Amer-Indian-Eskimo (AIE) 1.0
Other (O) 0.8
W&M 58.8
W&F 26.7
B&M 4.9
B&F 4.7
G+R API&M 2.1
API&F 1.1
AIE&M 0.6
AIE&F 0.4
O&M 0.5
O&F 0.3
Table 1. Adult dataset: statistics with reference to the sensitive features.

The COMPAS dataset is constructed by the commercial algorithm COMPAS, which is used by judges and parole officers for scoring criminal defendants likelihood of reoffending (recidivism). It has been shown that the algorithm is biased in favor of white defendants based on a 2-years follow up study. This dataset contains variables used by the COMPAS algorithm in scoring defendants, along with their outcomes within two years of the decision, for over 10000 criminal defendants in Broward County, Florida. In the original data, 3 subsets are provided. We concentrate on the one that includes only violent recividism. Table 2, analogously to Table 1, reports the statistics with reference to the sensitive features.

Sens. Group
G Female (F) 19.34
Male (M) 80.66
R African-American (AA) 51.23
Asian (A) 0.44
Caucasian (C) 34.02
Hispanic (H) 8.83
Native American (NA) 0.25
Other (O) 5.23
Female African-American 9.04
Female Asian 0.03
Female Caucasian 7.86
Female Hispanic 1.48
Female Native American 0.06
Female Other 0.93
G+R Male African-American 42.20
Male Asian 0.45
Male Caucasian 26.16
Male Hispanic 7.40
Male Native American 0.19
Male Other 4.30
Table 2. COMPAS dataset: statistics with reference to the sensitive features.
G M F
M
F
R W B API AIE O
W
B
API
AIE
O
Table 3. Adult Dataset: confusion matrices in percentage (true class in columns and predicted classes in rows) obtained by predicting Gender and Race from the other non-sensitive features using Random Forests.
G M F
M
F
R AA A C H NA O
AA
A
C
H
NA
O
Table 4. COMPAS Dataset: confusion matrices in percentage (true class in columns and predicted classes in rows) obtained by predicting Gender and Race from the other non-sensitive features using Random Forests.

In all the experiments, we compare STL, ITL, and MTL in different settings. Specifically we test each method in the following cases: when the models use the sensitive feature (S) or not (S), when the fairness constraint is active (F) or not (F), when we consider the group specific models (D) or the shared model between groups (D), and when we use the true sensitive feature (P) or the predicted one (P). Note that when D we can only compare STL with MTL, since only these two models produce a shared model between the groups, and furthermore, when D we can only compare ITL with MTL, since these produce group specific models.

We collect statistics concerning the classification average accuracy per group in percentage (ACC) on the test set, difference of equal opportunities on both the positive and negative class (denoted as DEO and DEO, respectively), and the difference of equalized odds (DEOd) of the selected model - see Section 2 for a definition of these quantities.

We selected the best hyperparameters11 by the two steps 10-fold cross validation (CV) procedure described in (donini2018empirical). In the first step, the value of the hyperparameters with highest accuracy is identified. In the second step, we shortlist all the hyperparameters with accuracy close to the best one (in our case, above of the best accuracy). Finally, from this list, we select the hyperparameters with the lowest fairness measure. This validation procedure, ensures that fairness cannot be achieved by a mere modification of hyperparameter selection procedure.

5.2. Results

The results for all possible combinations described above, are reported in Table 5. In Figures 23, and 4, we present a visualization of Table 5 for the Adult dataset (results are analogous for the COMPAS one). Where both the error (i.e., 1-ACC), and the EOd are normalized to be between 0 and 1, column-wise. The closer a point is to the origin, the better the result.

Figure 2. Adult dataset: complete results set for Gender (text close to the symbols in plot are P, D, F, and S).
Figure 3. Adult dataset: complete results set for Race (text close to the symbols in plot are P, D, F, and S).
Figure 4. Adult dataset: complete results set for Gender+Race (text close to the symbols in plot are P, D, F, and S).
Adult Dataset COMPAS Dataset
STL MTL STL MTL STL MTL STL MTL STL MTL STL MTL
ITL ITL ITL ITL ITL ITL
P D F S ACC DEOp ACC DEOp ACC DEOp ACC DEOp ACC DEOd ACC DEOd ACC DEOp ACC DEOp ACC DEOp ACC DEOp ACC DEOd ACC DEOd
G
R
G+R