Learning to Ignore:Fair and Task Independent Representations

# Learning to Ignore: Fair and Task Independent Representations

## Abstract

Training fair machine learning models, aiming for their interpretability and solving the problem of domain shift has gained a lot of interest in the last years. There is a vast amount of work addressing these topics, mostly in separation. In this work we show that they can be seen as a common framework of learning invariant representations. The representations should allow to predict the target while at the same time being invariant to sensitive attributes which split the dataset into subgroups. Our approach is based on the simple observation that it is impossible for any learning algorithm to differentiate samples if they have the same feature representation. This is formulated as an additional loss (regularizer) enforcing a common feature representation across subgroups. We apply it to learn fair models and interpret the influence of the sensitive attribute. Furthermore it can be used for domain adaptation, transferring knowledge and learning effectively from very few examples. In all applications it is essential not only to learn to predict the target, but also to learn what to ignore.

## 1 Introduction

In June 2020 MIT withdrew Tiny Images1 a popular vision dataset as researchers found that it is socially biased. Biases in training data are a major issue for machine learning algorithms [19]. Especially, as they are increasingly used to make critical decisions. First, it is important to ensure that those systems are fair and do not discriminate certain groups. Secondly, interpretability of the decisions - ”why” the system comes to that conclusion or how important a certain factor for decision making is - are desirable for better understanding. Thirdly, these trained models should generalize well. For many real world situations the data seen during training is different then the data which the models are applied to in production. Domain adaptation tries to transfer knowledge from a source domain (training set) to a particular target domain. In order to cope with these challenges many different approaches have been proposed over the last decade.

Fairness and domain adaptation seem to be very different topics, but the goal for both is actually learning invariant feature representations. In this paper we propose a yet simple approach for learning fair representation. A deep learning model is forced to ignore certain information that would allow to draw conclusions about sensitive attributes (fair classifier) or certain areas (domain independent classifier). As seen in Fig. 1 the target variable should be still predictable, but at the same time it should not be distinguishable from which subgroup the examples were taken. The main insight is that similar feature representations for different groups or datasets do not allow us to differentiate between them anymore. To accomplish this, we introduce an affinity loss which is additionally used during the training of a model. Once a fair representation is established, the sensitive attribute can be added back and its impact measured. This paves the way for interpretability or causal reasoning. Furthermore, by reducing the distance between different domains in latent space a more general representation of the dataset is learned which helps to better generalize across domains.

#### Related Work.

In order to make machine learning models ”fair”, works aim at modifying the feature representations of the data [30], the class label annotations [24] or the data itself [20]. Learning this latent representation includes an additional cross entropy classification loss [2], a decomposition loss [20], an additional hidden layer for adversarial optimization [1], distribution matching [21], using Variational Autoencoders [12], or by learning the representation as an adversarial minimax game [27]. The goal is not only to improve fairness but also to interpret how fairness is enforced. Such methods build on special network architectures [4, 11] or a combination of different machine learning algorithms [8]. For the domain adaptation task approaches make use of re-weighting the source samples to better match the target domain [10, 18], learning shared weights [28] or a common subspace [5], modifying the network architecture [14, 15], or using Generative Adversarial Networks [12, 6]. Interestingly, if causal aspects are taken into account, predictions can be improved [3, 23]. In the same line, recent work aims for analysing those areas in a common way [12, 22, 13].

#### Contribution.

Our main aim is not to beat any particular method for fairness or domain adaption, it is rather to highlight the commonalities. From a technical side, the most related work probably is the one by Ganin et al. [5] based on Zemel et al. [30]. However instead of adding a new gradient reversal layer, we simply reduce an additional affinity loss during training. Hence, our method can be easily applied to any existing network architecture including classification, regression tasks or auto-encoders2. Experiments demonstrate that a pretrained network with fixed weights can be simply debiased by adding a fair representation layer. While many approaches struggle with unbalanced datasets both in terms of the target and the sensitive attribute [2], our approach is not very negatively impacted. We are able to improve fairness, interpretability and domain adaptation within one very simple approach.

## 2 Learning Invariant Representation

#### Problem formulation.

Let be the entire data set. Each is an example represented by attributes and its corresponding target variable. Furthermore, let be the sensitive attribute. We aim to learn a classifier to predict the target variable from the attributes, but at the same time being unable to predict the sensitive attribute .

In order to achieve this we build a (low dim.) representation which allows for predicting the target but not the sensitive attribute . Our aim is to make the representation as similar as possible with respect to the sensitive attribute. Hence, being invariant features. Many modern deep learning architectures can be seen as having such a representation layer, having the advantage that any pre-trained model can be used and later fine-tuned.

{SCfigure}

[1][bt]Example architecture of a neural network with one ore more hidden layers (indicated with the dotted lines). The last hidden layer (), is reffed as representation and used for calculated our proposed affinity loss.

In the setting of training a neural network, typically a loss (e.g., cross entropy) is minimized in order to predict the target. We propose to add another loss term which serves as regularizer. See a visualization in Fig. 2. The neural network is then trained on the combined loss

 ltotal=ltarget+λ⋅laffinity. (1)

If the weight of the affinity loss is set to zero, the model is trained normally without the new loss. If is very large the neural network optimizes on the affinity loss, ignoring the target loss. The fairness of a model increases with but might result in lower accuracy.

In the following we derive . The sensitive attribute splits the dataset in one or more subgroups. For simplicity we focus on two subgroups in the following. Considering the two sets split by the sensitive attribute . To be unable to distinguish between these two sets the following must hold

 ∀x1∈X1 ∃x2∈X2:g(x1)=g(x2). (2)

In other words, for each sample there must be at least another sample with a different sensitive attribute having the same representation. Technically the loss is minimizing the closest distance of them.

The learnt representation should still allow to predict the target . Trivial representations are avoided by the combination of the loss term (see above). However, our experiments show that it is beneficial to add a more strict constraint so that the two examples and are from the same class, i.e., . This avoids mixing up classes and yields significantly better performance. Averaging over all examples and all classes gives

 laffinity=1|Y||X1|∑y∈Y∑x1∈(X1|y1=y)minx2∈(X2|y2=y)d(g(x1),g(x2)), (3)

where is an arbitrary distance function.

#### Implementation Details.

In order to implement Eq. (3) we use a nearest neighbor with the norm. To speed the training up we do not use the whole dataset, but only calculate the affinity loss on the mini-batches.

### 2.1 Experiments

We show the basic behavior of our method on an illustrative experiment based on the well known MNIST dataset of handwritten digits3. Additionally we created the MNIST-I, which contains all original MNIST images, but inverted. Together it forms our dataset where the sensitive attribute indicates if the digit originates from the MNIST-I or the original MNIST. As target we still want to predict which number is depicted.

We train a simple neural network with two 128- and a single 20-width ReLU hidden layer as representation layer. For training, a batch size of 128 samples is used and the weight of the proposed affinity loss is set to .

#### Embedding.

In order to analyze our learnt representation, we perform a t-Distributed Stochastic Neighbor Embedding (t-SNE) on the representation layer. It models the higher-dim. data by a low-dim. point such that similar objects lay closer together and dissimilar ones further away.

{SCfigure}
 baseline our approach target (digits) sensitive attribute (MNIST/ MNIST-I)

2-dim. t-SNE plot showing the learnt feature representations colored based on digit (top row) and on the sensitive attribute (bottom row). By using our approach, the representation shows distinguishable digit clusters but the dataset origin cannot be traced. Best viewed in colour!

Fig. 2.1 depicts the comparison of two models, trained without (baseline) and with the affinity loss. The baseline model learns two clusters for each digit (one normal and one inverted) and the groups can easily be separated by the sensitive attribute (MNIST/MNIST-I). In contrast, adding the proposed affinity loss into the training process shows that the two groups are highly overlapping, i.e., not being distinguishable anymore, creating a fair (more in Sec. 3) and more general (only one cluster per digit) feature representation. The digits can still be predicted very accurately as their clusters are kept very distinct from each other. The effect of generality is also used for domain adaptation in Sec. 4.

#### Predictions

After training, we fix all hidden layers and do a normal retrain of the output layer. The output layer is not only trained to predict the digits but also if the sample comes from the original MNIST or inverted MNIST-I dataset. The model trained with our approach should struggle in learning the origin of the sample (inverted/not inverted). In fact, our fair model predicts the target class with 93% accuracy (4% higher than the baseline), whereas the sensitive attribution is hardly predictable anymore (around 57% accuracy). The baseline model can easily predict (nearly 100%) the sensitive attribute.

#### Hyperparameter.

If the weight of the affinity loss is set to zero, we train the model only based on the target loss not focusing on making the model fair (see Tab. 1(a), trained 5 times and averaged). For equals 0.01 we get the fairest model, as the numbers are predicted well (even better than the baseline model does) and the origin dataset of the input samples can hardly be predicted. For a of 0.1 the model gets as fair as possible by not learning anything at all. If the representation layer is chosen too small (smaller than 5 nodes) the model does not perform well in predicting the digits accurately (see Tab. 1(b)). If the number of nodes is getting too large (more than 50 in this example) the model gets unfair again suffering from the curse of dimensionality.

## 3 Fairness and Interpretability

Simply removing the sensitive attributes from a dataset is insufficient for eliminating their biases as there almost always exists an indirect influence of the sensitive information [16]. Our approach learns a feature representation of the data preserving general information but enforcing not to learn sensitive characteristic information.

After the fair training of the model we are able to interpret the classification and investigate in the influence of the sensitive attribute on the classification task [17]. We do so by reattaching the sensitive attribute to the fair model again, see Fig. 3. For better interpretability the fair feature representation is is linearly combined, forming . The reattachment of and its interpretation is possible as is trained to be independent of (see also [25])

 ^y=f(wrr+wzz+b),with z⊥⊥r (4)

where and are the learnt weights of the neural network, b the bias term and the sensitive attribute and the transfer function (e.g., linear or sigmoid). The weights and of and , respectively, indicate how large the influence of the sensitive attribute on the classification is [17]. In the following experiments a model trained without the affinity loss using the same architecture as the fair one is reffed as baseline.

{SCfigure}

[1][bt] A simple linear unit is added to the fair model representation. Furthermore the sensitive attribute is reattached. As the both units are uncorrelated, we use the weights as interpretation for the importance of the sensitive attribute for the final classification.

### 3.1 Fairness Measures

There are a lot of different fairness measures used for classification [26, 2, 7, 29]. Two commonly used ones are summarized in the following. Let be the output of the classifier, the true label and the sensitive attribute.

#### Equality of Opportunity/ Equality Gap.

The most common measure is the so-called equality of opportunity. It is reached if the groups and defined by the sensitive characteristic have equal true positive rates (TPR), i.e., . The equality gap is then calculated as

 P(^y=1|z=z1,y=1)−P(^y=1|z=z2,y=1)=|TPRz=z1−TPRz=z2|. (5)

#### Parity Gap.

The parity gap is calculated as independence between prediction and sensitive attribute for positive predictions, i.e.

 |P(^y=1|Z=z1)−P(^y=1|Z=z2)| (6)

For binary case of the sensitive attribute, in medical settings, it is the same as the average treatment effect (ATE) [25].

It is important to note that it is difficult to minimize all fairness metrics at the same time. The appropriate metric depends on the application, but most often the equality of opportunity is targeted. Be aware, that there are some trivial models which yield good results (very small gaps) such as models with very low TPR. For some tasks, the compromise may not even be possible, such as predicting whether someone can give birth. There is a clear causal relationship to the gender; thus, if this information (including implicit information) is removed, it becomes impossible for any classifier to make a correct prediction.

### 3.3 Experiment: CelebA

The CelebA image dataset5 is significantly more complex than the MNIST or Adult dataset. This record contains a total of 202,599 images of celebrities, each with 40 attributes. 162,770 images are used for training, 19,867 for validating and the rest for testing. The annotated attributes reflect appearance of the celebrities as well as the emotional state (e.g. smiling), gender, attractiveness and age. The gender attribute is used as a binary sensitive characteristic and attractiveness as a target label for the classification of the images.

As model we use a fixed VGG19 net trained on imagenet (to speed up the training process and reduce complexity) and an additional hidden layer with 124 nodes. Tab. 4 compares results with different weights and shows that we can in fact debias the pretrained VGG net. The CelebA dataset is heavily skewed; around of the images showing women are labeled as attractive, compared to of men. If is strong enough, the influence of the skew on the fairness disappears. The downside is the decrease of accuracy to only as the TNRs for female and male are getting low. Please note, the comparison with Quadrianto [20] is not too accurate as our baseline already has a lower accuracy.

#### Fair representation & Intepretability.

The histograms of the the fair one-dim. representation (see Fig. 3) for the male and female samples in Fig. 3 show a very similar distribution, supporting the assumption that our model contains a fair representation of the data. The influence of the gender attribute on the classification, see Tab. 5, is checked with the same approach as described in Sec. 3.2. The similar accuracies of the baseline models show that information about the gender attribute is indeed still hidden in the input data. The reattached sensitive attribute helps the fair model to perform better in classifying faces as attractive. This can also be seen in the weight of the sensitive attribute with around 1.5, compared to the one of the fair one-dim. feature representation with around 0.7.

## 4 Domain Adaptation

Data used for training a model might not be the same as during test time. This is a big problem for robust real world applications. The sensitive attribute relates now to the different domains or environments [3]. As seen in Sec. 2.1 we enforce to learn a more general feature representation and to ignore domain specific attributes. This is leveraged to learn representations which are generic across (related) domains and hence would generalize better.

#### From MNIST to MNIST-R.

Additionally to the MNIST dataset we created the MNIST-R dataset containing all original MNIST images rotated by 30 degrees. We train a simple neural net with two 128- and one 20-width ReLU representations. MNIST is used as source while the performance is measured on MNIST-R (target). Inspired by Heinze-Deml et al. [9] few samples of the target set are used to improve the performance. We compare the results in Tab. 6 with a baseline model trained on the same amount of samples (20 per class) of the target dataset using data augmentation. Our approach can keep up with data augmentation, respectively even performs better if the imbalance in the amount of samples used during training becomes larger. It can better leverage the structure in the source data and map it to the target domain than simple data augmentation which relies on predefined transformations. {SCfigure}[1][tb] Accuracy of MNIST with different number of training samples from the target domain (MNIST-R). Our approach (orange) clearly outperforms the simple baseline, especially when only few data from the target domain is provided.

#### From SVHN to MNIST.

The Street-View House Number (SVHN) dataset6, contains house numbers from Google Street View. The challenge of the SVHN dataset is the structured clutter in the background of images. A Convolutional Neural Network (CNN) with two double-Convolutional layers containing 32 and 64 nodes, respectively, is used. A 20-dim. feature representation on top of this architecture is applied to calculate the affinity loss. Results and comparison to Ganin et al. [5] are shown in Tab. 7. The affinity loss does indeed improve the performance on the target dataset with only a little amount of samples. The performance of our model trained on the SVHN dataset with 10 MNIST samples reaches an accuracy of around 75% and can be compared with Ganin et al. [5]. In comparison, the baseline model achieves only an accuracy of 3% (see Fig. 4) on the same data. If there are only a few MNIST samples available, the neural net trained with SVHN and our affinity loss outperforms the baseline model trained solely on the same amount of MNIST samples.

## 5 Discussion and Conclusions

We proposed a new approach for learning invariant feature representation. The main idea is to bring the feature representation of different distributions closer together by introducing an additional loss. We applied this strategy to three different areas: fairness, interpretability and domain adaptation. Our proposed method can be used for different model architectures as well as for readjusting the feature representation of existing, already trained models. Experiments show that the equality gap can be significantly reduced while the accuracy is still kept at an acceptable level. The results are comparable with state-of-the-art methods for each task. We demonstrate how to understand how a sensitive attribute influences the classification of an input sample. A challenge in our approach is to efficiently find the nearest neighbors in the embedding space. We rely on effective, approximated methods here. Not much thematized in this paper is that our approach allows using multiple source and target datasets. Thus a model can be trained to be fair regarding multiple attributes. A further extension might be using real-valued attributes as sensitive attributes.

### Footnotes

1. https://groups.csail.mit.edu/vision/TinyImages/, 2020/07/10.
2. In this paper we focus on the classification tasks with one categorical sensitive variable.
3. http://yann.lecun.com/exdb/mnist/, 2020/07/10.
5. http://mmlab~.ie.cuhk.edu.hk/projects/CelebA.html, 2020/07/10.
6. http://ufldl.stanford.edu/housenumbers/, 2020/07/10.

### References

1. T. Adel, I. Valera, Z. Ghahramani and A. Weller (2019) One-network adversarial fairness. In AAAI Conference on Artificial Intelligence, Cited by: §1, Table 2.
2. A. Beutel, J. Chen, Z. Zhao and E. H. Chi (2017) Data Decisions and Theoretical Implications when Adversarially Learning Fair Representations. Technical report Google Research. Cited by: §1, §1, §3.1, Table 2.
3. P. Bühlmann (2018) Invariance, Causality and Robustness. Technical report ETH Zurich. Cited by: §1, §4.
4. A. Chattopadhyay, P. Manupriya, A. Sarkar and V. N. Balasubramanian (2019) Neural Network Attributions: A Causal Perspective. In International Conference on Machine Learning, Cited by: §1.
5. Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand and V. Lempitsky (2017) Domain-adversarial training of neural networks. Journal of Machine Learning Research 17. Cited by: §1, §1, §4, Table 7.
6. M. Ghifary, W. B. Kleijn, M. Zhang and D. Balduzzi (2015) Domain Generalization for Object Recognition with Multi-task Autoencoders. In International Conference on Computer Vision, Cited by: §1.
7. M. Hardt, E. P. Price and N. Srebro (2016) Equality of opportunity in supervised learning. In Conference on Neural Information Processing Systems, Cited by: §3.1.
8. J. Hartford, G. Lewis, K. Leyton-Brown and M. Taddy (2016) Counterfactual Prediction with Deep Instrumental Variables Networks. Technical report Microsoft Research and University of British Columbia. Cited by: §1.
9. C. Heinze-Deml and N. Meinshausen (2017) Conditional Variance Penalties and Domain Shift Robustness. Technical report ETH Zurich. Cited by: §4.
10. J. Huang, A. Smola, A. Gretton, K. Borgwardt and B. Schölkopf (2006) Correcting sample selection bias by unlabeled data.. In Conference on Neural Information Processing Systems, Cited by: §1.
11. C. Louizos, U. Shalit, J. Mooij, D. Sontag, R. Zemel and M. Welling (2017) Causal effect inference with deep latent-variable models. In Conference on Neural Information Processing Systems, Cited by: §1.
12. C. Louizos, K. Swersky, Y. Li, M. Welling and R. Zemel (2015) The Variational Fair Autoencoder. In International Conference on Learning Representations, Cited by: §1, Table 2.
13. S. Magliacane, T. van Ommen, T. Claassen, S. Bongers, P. Versteeg and J. M. Mooij (2017) Domain Adaptation by Using Causal Inference to Predict Invariant Conditional Distributions. In Advances in Neural Information Processing Systems, Cited by: §1.
14. M. Mancini, L. Porzi, S. R. Bulò, B. Caputo and E. Ricci (2018) Boosting Domain Adaptation by Discovering Latent Domains. In Conference on Computer Vision and Pattern Recognition, Cited by: §1.
15. S. Motiian, M. Piccirilli, D. A. Adjeroh and G. Doretto (2017) Unified Deep Supervised Domain Adaptation and Generalization. In International Conference on Computer Vision, Cited by: §1.
16. D. Pedreschi, S. Ruggieri and F. Turini (2008) Discrimination-aware Data Mining. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Cited by: §3.2, §3.
17. J. Peters, D. Janzing and B. Schölkopf (2017) Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press, Cambridge, MA. Cited by: §3.
18. P. O. Pinheiro (2018) Unsupervised Domain Adaptation with Similarity Learning. In Conference on Computer Vision and Pattern Recognition, Cited by: §1.
19. V. Prabhu and A. Birhane (2020) Large image datasets: A pyrrhic win for computer vision?. Technical report UnifyID Inc.. Cited by: §1.
20. N. Quadrianto, V. Sharmanska and O. Thomas (2018) Discovering Fair Representations in the Data Domain. In Conference on Computer Vision and Pattern Recognition, Cited by: §1, §3.3, Table 2, Table 4.
21. N. Quadrianto and V. Sharmanska (2017) Recycling privileged learning and distribution matching for fairness. In Conference on Neural Information Processing Systems, Cited by: §1, Table 2.
22. C. Schumann, X. Wang, A. Beutel, J. Chen, H. Qian and E. H. Chi (2019) Transfer of Machine Learning Fairness across Domains. Technical report Google. Cited by: §1.
23. H. Singh, R. Singh, V. Mhasawade and R. Chunara (2019) Fair Predictors under Distribution Shift. In Conference on Neural Information Processing Systems, Cited by: §1.
24. B. Thanh Luong, S. Ruggieri and F. Turini (2011) k-NN as an Implementation of Situation Testing for Discrimination Discovery and Prevention. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Cited by: §1.
25. W. A. C. van Amsterdam, J. J. C. Verhoeff, P. A. de Jong, T. Leiner and M. J. C. Eijkemans (2019) Eliminating biasing signals in lung cancer images for prognosis predictions with deep learning. npj Digital Medicine 2. Cited by: §3.1, §3.
26. S. Verma and J. Rubin (2018) Fairness Definitions Explained. In IEEE/ACM International Workshop on Software Fairness, Cited by: §3.1.
27. Q. Xie, Z. Dai, Y. Du, E. Hovy and G. Neubig (2017) Controllable Invariance through Adversarial Feature Learning. In Conference on Neural Information Processing Systems, Cited by: §1, Table 2.
28. J. Yang, R. Yang and A. G. Hauptmann (2007) Adapting svm classifiers to data with shifted distributions. In International Conference on Data Mining Workshops, Cited by: §1.
29. M. B. Zafar, I. Valera, M. G. Rodriguez and K. P. Gummadi (2017) Fairness Beyond Disparate Treatment & Disparate Impact: Learning Classification without Disparate Mistreatment. In International World Wide Web Conference, Cited by: §3.1.
30. R. Zemel, Y. (. Ledell, ). Wu, K. Swersky, T. Pitassi and C. Dwork (2013) Learning fair representations. In International Conference on Machine Learning, Vol. 3. Cited by: §1, §1, Table 2.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters