Intersectionality: Multiple Group Fairness in Expectation Constraints
Group fairness is an important concern for machine learning researchers, developers, and regulators. However, the strictness to which models must be constrained to be considered fair is still under debate. The focus of this work is on constraining the expected outcome of subpopulations in kernel regression and, in particular, decision tree regression, with application to random forests, boosted trees and other ensemble models. While individual constraints were previously addressed, this work addresses concerns about incorporating multiple constraints simultaneously. The proposed solution does not affect the order of computational or memory complexity of the decision trees and is easily integrated into models post training.
Intersectionality: Multiple Group Fairness in Expectation Constraints
Jack Fitzsimons Machine Learning Research Group University of Oxford, UK firstname.lastname@example.org Michael Osborne Machine Learning Research Group University of Oxford, UK email@example.com Stephen Roberts Machine Learning Research Group University of Oxford, UK firstname.lastname@example.org
noticebox[b]Presented at the Workshop on Ethical, Social and Governance Issues in AI at 32nd Conference on Neural Information Processing Systems (NIPS 2018). Do not distribute.\end@float
The widespread use of machine learning algorithms and fully autonomous systems has greatly transformed the industrial landscape of the twenty-first century. However, with these great advances comes a responsibility for researchers, developers, and regulators to consider the impact of these systems on the broader society. In 2014, the US presidential administration published a report on big data collection and analysis, finding that “big data technologies can cause societal harms beyond damages to privacy" (united2014big, ). The report raised the concern that algorithmic decisions inferred from big data may have harmful biases, potentially leading to discrimination against disadvantaged groups.
This drive towards ethical practices in machine learning has led to many developments in algorithmic fairness. One such advance has been towards developing algorithms which display group fairness, also referred to as statistical, conditional or demographical parity. From a regulatory viewpoint, group fairness is particularly interesting as affirmative action policies have already been passed to address discrimination against caste, race and gender weisskopf2004affirmative (); dumont1980homo (); deshpande2017affirmative (). However, it is worth noting there may be a considerable cost involved in achieving such fairness in some cases corbett2017algorithmic ().
In a machine learning context, there have primarily been two approaches towards developing systems which demonstrate group fairness; data alteration endeavors to modify the original dataset in order to prevent discrimination between groups luong2011k (); kamiran2009classifying () in contrast to regularisation which penalizes models for unfair behavior kamishima2011fairness (); berk2017convex (); calders2013controlling (); calders2010three (); raff2017fair ().
More recently, there has been an effort towards constraining models such that they prohibit unfair behavior, a stricter assertion than regularisation. This work directly follows aistats_submission () in which group fairness in expectation for regression models is investigated, defined as:
Group Fairness in Expectation (GFE): A regressor achieves group fairness in expectation with respect to groups and generative distributions and respectively iff,
This work addresses an important issue raised in dwork2018group (), a model that satisfies conditional parity with respect to race and gender independently may fail to satisfy conditional parity with respect to the conjunction of race and gender. In the social science literature concerns about, potentially discriminated against, sub-demographics are referred to as intersectionality mccall2008complexity (). More formally, this work proposes a simple approach to ensure group fairness in expectation across an arbitrary set of subgroups. Applied to the popular decision trees, it is shown that provided the number of parity conditions is negligible compared to the number of training points, the order of computational and memory complexity is not increased.
2 Constrained Kernel Methods
As shown in aistats_submission (), kernel regressors may be constrained in terms of their expectation by adding auxiliary noiseless quadrature observations. Take for instance a Gaussian distribution with two dimensions. Given the distribution is zero mean, without loss in generality, correlation and variance and respectively. With independent identically distributed noise , we can constrain the values of the expected outcomes by incorporating a noiseless observation on as follows,
The above covariance matrix, , has rank 2 with one perfectly observed value; namely the mean equality constraint. Thus inference on the two dimensions is constrained for .
As shown visually in Figure 1, this principle can be extended to Gaussian processes and other kernel regression techniques by using the differences in quadrature observations in order to incorporate mean equality constraints. Multiple constraints can also be created by simply adding more columns and rows to the kernel matrix accordingly.
3 Constrained Decision Trees
While the above, constrained kernel inference is interesting, the widespread impact comes into play when we extend the result to decision tree regression, random forests, boosted trees and other ensemble techniques. This is said not to dismiss the importance or value in kernel methods more broadly, but rather due to their popularity amongst data scientists, a profession more common than machine learning researchers wu2008top ().
While decision trees can be represented in either compressed or explicit kernel representation aistats_submission (), for the sake of conciseness this work will present results only for compressed representation. Thus, we will endeavor to minimize the perturbations induced on a per leaf bases, irrespective of the number of data points per leaf. The core difference between single and multiple constraints is that we can no longer use the arrowhead matrix lemma, instead, we must work out the update using the block matrix inversion lemma. Importantly, and for each constraint are defined as the empirical distributions of each subgroup considered. This is an important point as small subgroups may have empirical distributions which are not good approximations of the true generative distributions and hence our constrained space for inference may not constrain predictions to equate accordingly.
The kernel function in the compressed representation is simple the identity matrix, , that is to say each leaf is independent of one another. The kernel regression equation can be denoted as,
where the first values (number of leaves) of indicate to which leaf belongs and the remaining values are set as zero, as the point to predict will not contribute to the empirical distributions of the subgroups under an inductive learning paradigm.
Using the block matrix inversion lemma we find,
By simply inserting this into the kernel regression equation and noting that the elements of are necessarily zero, the following update to the expected mean can be found as,
with indicating the row of relating to the difference of subgroup distributions on leaf . The effect of the noise can be removed by post multiplying by .
4.1 ProPublica & the COMPAS System
The first experiment reproduces the experiment in aistats_submission () which uses a random forest to estimate the recidivism decile scores of the COMPAS algorithm applied to the ProPublica dataset while adding a GFE constraint between African Americans and Non-African Americans. However, it can also be noted that Hispanics also receive a similar discrimination. Figure 2, visualizes the effect of GFE constraints on the predicted distributions of the three demographics.
4.2 Illinois State Employee Salaries
The Illinois state employee salaries111https://data.illinois.gov/datastore/dump/1a0cd05c-7d17-4e3d-938d-c2bfa2a4a0b1 since 2011 can be seen to have a gender bias and bias between veterans and non-veterans. The motivation of this experiment was if one wished to predict a fair salary for future employees based on current staff. Gender labels were inferred using the employees’ first names, parsed through the gender-geusser python library. GFE constraints were applied between all intersections of gender and veteran / non-veterans, the marginals of gender and the marginals of veteran / non-veterans. Table 1 shows the expected outcome of each group before and after GFE constraints are applied and Figure 3 visualizes the perturbations to the marginals of each demographic intersection due to the GFE constraints. The train-test split was set as 80%-20% and the incorporation of the GFE constraints increase the root mean squared error from $12,086 to $12,772, the cost of fairness.
|Group||Female Non-Vet.||Male Non-Vet.||Female Vet.||Male Vet.||Male||Female||Vet.||Non-Vet.|
Regulatory bodies have shown precedent in developing affirmative action and other group fairness policy. This work extends previous efforts to develop group fairness constrained machine learning techniques. While relatively simple to understand and easy to incorporate into models used by practitioners, the methodology of this paper has a direct impact to four of the ten top data science algorithms according to wu2008top (). All source code used in the experiments are available at https://github.com/OxfordML/Fair_Regression.git.
-  R. Berk, H. Heidari, S. Jabbari, M. Joseph, M. Kearns, J. Morgenstern, S. Neel, and A. Roth. A convex framework for fair regression. arXiv preprint arXiv:1706.02409, 2017.
-  T. Calders, A. Karim, F. Kamiran, W. Ali, and X. Zhang. Controlling attribute effect in linear regression. In Data Mining (ICDM), 2013 IEEE 13th International Conference on, pages 71–80. IEEE, 2013.
-  T. Calders and S. Verwer. Three naive bayes approaches for discrimination-free classification. Data Mining and Knowledge Discovery, 21(2):277–292, 2010.
-  S. Corbett-Davies, E. Pierson, A. Feller, S. Goel, and A. Huq. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 797–806. ACM, 2017.
-  A. Deshpande. Affirmative action in india. In Race and Inequality, pages 77–90. Routledge, 2017.
-  L. Dumont. Homo hierarchicus: The caste system and its implications. University of Chicago Press, 1980.
-  C. Dwork and C. Ilvento. Group fairness under composition. FATML, 2018.
-  J. Fitzsimons, A. Al Ali, M. Osborne, and S. Roberts. Group fairness under composition. arXiv preprint arXiv:1810.05041, 2018.
-  F. Kamiran and T. Calders. Classifying without discriminating. In Computer, Control and Communication, 2009. IC4 2009. 2nd International Conference on, pages 1–6. IEEE, 2009.
-  T. Kamishima, S. Akaho, and J. Sakuma. Fairness-aware learning through regularization approach. In Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on, pages 643–650. IEEE, 2011.
-  B. T. Luong, S. Ruggieri, and F. Turini. k-nn as an implementation of situation testing for discrimination discovery and prevention. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 502–510. ACM, 2011.
-  L. McCall. The complexity of intersectionality. In Intersectionality and Beyond, pages 65–92. Routledge-Cavendish, 2008.
-  E. Raff, J. Sylvester, and S. Mills. Fair forests: Regularized tree induction to minimize model bias. arXiv preprint arXiv:1712.08197, 2017.
-  United States. Executive Office of the President and J. Podesta. Big data: Seizing opportunities, preserving values. White House, Executive Office of the President, 2014.
-  T. E. Weisskopf. Affirmative action in the United States and India. A Comparative Perspective, London and New York: Routledge, 2004.
-  X. Wu, V. Kumar, J. R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. Ng, B. Liu, S. Y. Philip, et al. Top 10 algorithms in data mining. Knowledge and information systems, 14(1):1–37, 2008.