We investigate the problem of algorithmic fairness in the case where sensitive and non-sensitive features are available and one aims to generate new, ‘oblivious’, features that closely approximate the non-sensitive features, and are only minimally dependent on the sensitive ones. We study this question in the context of kernel methods. We analyze a relaxed version of the Maximum Mean Discrepancy criterion which does not guarantee full independence but makes the optimization problem tractable. We derive a closed-form solution for this relaxed optimization problem and complement the result with a study of the dependencies between the newly generated features and the sensitive ones. Our key ingredient for generating such oblivious features is a Hilbert-space-valued conditional expectation, which needs to be estimated from data. We propose a plug-in approach and demonstrate how the estimation errors can be controlled. Our theoretical results are accompanied by experimental evaluations.
Oblivious Data \printAffiliationsAndNotice
Machine learning algorithms trained on historical data may inherit implicit biases which can in turn lead to potentially unfair outcomes for some individuals or minority groups. For instance, gender-bias may be present in a historical dataset on which a model is trained to automate the postgraduate admission process at a university. This may in turn render the algorithm biased, leading it to inadvertently generate unfair decisions. In recent years, a large body of work has been dedicated to systematically addressing this problem, whereby various notions of fairness have been considered, see, e.g. (Calders et al., 2009; R. Zemel and Dwork, 2013; Louizos et al., 2015; Hardt et al., 2016; M. Joseph and Roth, 2016; N. Kilbertus and Schölkopf, 2017; M. J. Kusner and Silva, 2017; F. Calmon and Varshney, 2017; Zafar et al., 2017; Kleinberg et al., 2017; Donini et al., 2018; Madras et al., 2018), and references therein. Among the several algorithmic fairness criteria, one important objective is to ensure that a model’s prediction is not influenced by the presence of sensitive information in the data.
In this paper, we address this objective from the perspective of (fair) representation learning. Thus, a central question which forms the basis of our work is as follows.
Can the observed features be replaced by close approximations that are independent of the sensitive ones?
More formally, assume that we have a dataset such that each data-point is a realization of a random variable where and are in turn vector-valued random variables corresponding to the sensitive and non-sensitive features respectively. We further allow and to be arbitrarily dependent, and ask whether it is possible to generate a new random variable which is ideally independent of and close to in some meaningful probabilistic sense. As an initial step, we may assume that is zero-mean, and aim for decorrelation between and . This can be achieved by letting where is the conditional expectation of given . The random variable so-defined is not correlated with and is close to . In particular, it recovers if and are independent. In fact, under mild assumptions, gives the best approximation (in the mean-squared sense) of , while being uncorrelated with . Observe that while the distribution of differs from that of , this new random variable seems to serve the purpose well. For instance, if corresponds to a subject’s gender and to a subject’s height, then corresponds to height of the subject centered around the average height of the class corresponding to the subject’s gender.
Building upon this intuition, and using results inspired by testing for independence using the Maximum Mean Discrepancy (MMD) criterion (see e.g. Gretton et al. (2008)), we obtain a related optimization problem in which and are replaced with Hilbert-space-valued random variables and Hilbert-space-valued conditional expectations. While the move to Hilbert spaces does not enforce complete independence between the new features and the sensitive features, it helps to significantly reduce the dependencies between the features. The new features have various useful properties which we explore in this paper. They are also easy to generate from samples . The main challenge in generating the oblivious features is that we do not have access to the Hilbert-space-valued conditional expectation and need to estimate it from data. Since we are concerned with Reproducing Kernel Hilbert Spaces (RKHSs) here, we use the reproducing property to extend the plugin approach of Grünewälder (2018) to the RKHS setting and tackle the estimation problem. We further show how estimation errors can be controlled. Having obtained the empirical estimates of the conditional expectations, we generate oblivious features and an oblivious kernel matrix to be used as input to any kernel method. This guarantees a significant reduction in the dependence between the predictions and the sensitive features. Our main contributions are as follows.
We cast the objective of finding oblivious features which approximate the original features well while maintaining minimal dependence on the sensitive features , as a constrained optimization problem.
Making use of Hilbert-space-valued conditional expectations, we provide a closed form solution to the optimization problem proposed. Specifically, we first prove in Section 5.2 that our solution satisfies the constraint of the optimization problem at hand, and show via Proposition 5.3 that it is indeed optimal.
Through Proposition 1 we relate the strength of the dependencies between and to how close lies to the low-dimensional manifold corresponding to the image under the feature map . This result is key in providing some insight into the interplay between probabilistic independence and approximations in the Hilbert space.
We extend known estimators for real-valued conditional expectations to estimate those taking values in a Hilbert space, and show via Proposition 3 how to control their estimation errors. This result in itself may be of independent interest in future research concerning Hilbert-space-valued conditional expectations.
We provide a method to generate oblivious features and the oblivious kernel matrix which can be used instead of the kernel matrix to reduce the dependence of the prediction on the sensitive features; the computational complexity of the approach is .
While the key contributions of this work are theoretical, we also provide an evaluation of the proposed approach through examples and some experiments.
Among the vast literature on algorithmic fairness, Donini et al. (2018); Madras et al. (2018), which fit into the larger body of work on fair representation learning, are closest to our approach. Madras et al. (2018) describe a general framework for fair representation learning. The approach taken is inspired by generative adversarial networks and is based on a game played between generative models and adversarial evaluations. Depending on which function classes one considers for the generative models and for the adversarial evaluations one can describe a vast array of approaches. Interestingly, it is possible to interpret our approach in this general context: the encoder corresponds to a map from and to , where our new features live. We do not have a decoder but compare features directly (one could also take our decoder to be the identity map). Our adversary is different to that used by Madras et al. (2018). In their approach a regressor is inferred which maps the features to the sensitive features, while we compare sensitive features and new features by applying test functions to them. The regression approach performs well in their context because they only consider finitely many sensitive features. In the more general framework considered in the present paper where the sensitive features are allowed to take on continuous values, this approach would be sub-optimal since it cannot capture all dependencies. Finally, we ignore labels when inferring new features. It is also worth pointing out that our approach is not based on a game played between generative models and an adversary but we provide closed form solutions.
On other hand, while the focus of Donini et al. (2018) is mostly on empirical risk minimization under fairness constraints, the authors briefly discuss representation learning for fairness as well. In particular, Equation (13) in the reference paper effectively describes a conditional expectation in Hilbert space, though it is not denoted or motivated as such. The conditional expectation is based on the binary features only and the construction is applied in the linear kernel context to derive new features. The authors do not go beyond the linear case for representation learning but there is a clear link to the more general notions of conditional expectation on which we base our work.
The rest of the paper is organized as follows. In Section 2 we introduce our notation and provide preliminary definitions used in the paper. Our problem formulation and optimization objective are stated in Section 3. As part of the formulation we also define the notion of -independence between Hilbert-space-valued features and the sensitive features. In Section 4 we study the relation between -independence and bounds on the dependencies between oblivious and sensitive features. In Section 5 we provide a solution to the optimization objective. In Section 6 we derive an estimator for the conditional expectation and use it to generate oblivious features and the oblivious kernel matrix. We provide some examples and empirical evaluations in Section 7. We conclude in Section 8 with a discussion of the results and future directions.
In this section we introduce some notation and basic definitions. Consider a probability space . For any we let be the indicator function such that if and only if . Let be a measurable space in which a random variable takes values. We denote by the -algebra generated by . Let be an RKHS composed of functions and denote its feature map by where, for some positive definite kernel . As follows from the reproducing kernel property of we have for all . Moreover, observe that is in turn a random variable attaining values in . In Appendix A we provide some technical details concerning Hilbert-space-valued random variables such as .
Let be a random variable taking values in a measurable space . For the random variable defined above, we denote by the random variable corresponding to Kolmogorov’s conditional expectation of given , i.e. , see, e.g. (Shiryaev, 1989)). Recall that in a special case where we simply have
where, is the familiar conditional expectation of given the event for . Thus, in this case, the random variable is equal to if attains value and is equal to otherwise. Note that the above example is for illustration only, and that and may be arbitrary random variables: they are not required to be binary or discrete-valued. Unless otherwise stated, in this paper we use Kolmogorov’s notion of conditional expectation. We will also be concerned with conditional expectations that attain values in a Hilbert space , which mostly behave like real-valued conditional expectations (see Pisier (2016) and Appendix B for details). Next, we introduce Hilbert-space-valued -spaces which play a prominent role in our results.
For a Hilbert space , we denote by the -valued space. If is an RKHS with a bounded and measurable kernel function then is an element of . The space consists of all (Bochner)-measurable functions from to such that (see Appendix A for more details). We call these functions random variables or Hilbert-space-valued random variables and denote them with bold capital letters. As in the scalar case we have a corresponding space of equivalence classes which we denote by . For we use for the corresponding equivalence classes in . The space is itself a Hilbert space with norm and inner product given by and , where we use a subscript to distinguish this norm and inner product from the ones from . The norm and inner product have a corresponding pseudo-norm and bilinear form acting on and we also denote these by and .
3 Problem Formulation
We formulate the problem as follows. Given two random variables and corresponding to non-sensitive and sensitive features in a dataset, we wish to devise a random variable which is independent of and closely approximates in the sense that for all we have,
Dependencies between random variables can be very subtle and difficult to detect. Similarly, completely removing the dependence of on without changing drastically is an intricate task that is rife with difficulties. Thus, we aim for a more tractable objective, described below, which still gives us control over the dependencies.
We start by a strategic shift from probabilistic concepts to interactions between functions and random variables. Consider the RKHS of functions with feature map as introduced in Section 2, and assume that is large enough to allow for the approximation of arbitrary indicator functions in the -pseudo-norm for any -valued random variable . Observe that if
for all then and are, indeed, independent.
This is because and can be used to approximate arbitrary indicator functions, which together with (2) gives,
This means that the independence constraint of the optimization problem of (1) translates to (2). Note that using RKHS elements as test functions is a common approach for detecting dependencies and is used in the MMD-criterion (e.g. Gretton et al. (2008)).
On the other hand, due to the reproducing property of the kernel of , we can also rewrite the constraint (2) as
Observe that is a random variable that attains values in an arbitrary low-dimensional manifold; the image of under is visualized as the blue curve in Figure 1. Therefore, while Equation (3) is linear in , depending on the shape of the manifold, it can lead to an arbitrarily complex optimization problem.
We propose to relax (3) by moving away from the manifold, replacing with a random variable which potentially has all of as its range. This simplifies the original optimization problem to one over a vector space under a linear constraint. To formalize the problem, we rely on a notion of -independence introduced below.
Definition 1 (-Independence).
We say that and are -independent if and only if for all and all bounded measurable it holds that,
Thus, instead of solving for in (1), we seek a solution to the following optimization problem.
Find that is -independent from (in the sense of Definition 1) and is close to in the sense that
for all which are also -independent of .
If lies in the image of and is a ‘large’ RKHS then -independence also implies complete independence between the estimator and . To see this, assume that there exists a random variable such that and that the RKHS is characteristic. Since for any and bounded measurable
we can deduce that and is independent. Moreover, since is a function of it is also independent of . In general, can not be represented as some and there can be dependencies between and . In Section 4 below we generalize the above argument to bound the dependence between and depending on how well can be approximated by , for some appropriately chosen to minimize the distance between and .
4 Bounds on the dependence between &
A common approach to quantifying a measure of dependence between random variables is to consider
where and run over suitable families of events. In our setting, these families are the -algebras and , and the difference between and , , quantifies the dependency between the random variables and . Upper bounds on the absolute difference of these two quantities, which are independent of and , correspond to the notion of -dependence which underlies -mixing. In times-series analysis mixing conditions like -mixing play a significant role since they provide means to control temporal dependencies (see, e.g., (Bradley, 2007; Doukhan, 1994)). We present Proposition 1, which gives a bound on the dependence between and .
To improve readability, we summarize the notation used in the proposition statement below and give an intuitive exposition before stating the result.
For fix constants , a function , and a random variable with , such that
for all there exists an with
In words, for a given , we first specify some whose dependence on is controlled by some in that any event can be coupled with some event such that . Note that such a always exists, e.g. it could be a function of in which case the condition would be trivially satisfied. Next, we let denote an upper-bound on the error (as measured by the Hilbert space-valued -norm) of approximating by some appropriate (translation of) . Observe that the error could be arbitrary large, as we do not require to be particularly small. The result stated below bounds the dependence between and as a function of , , and the size and approximation capacity of .
Suppose that is separable and its feature map satisfies . Consider some that is -independent from . Let
and let where , and are specified by Notation 1. For any and it holds that
Proof is provided in Appendix B.3. ∎
A key factor in the bound is given by (16) which measures how well indicator functions , , can be approximated using RKHS functions acting on the random variable when we penalize with the norm of the RKHS function. The penalization is scaled by the bound on the -norm between the random variables and . The ‘size’ of the RKHS also factors into the bound. When the RKHS is ‘small’ then not many indicator functions , can be approximated well and can be large. On the other hand, if the RKHS lies dense in a certain space, then any relevant indicator can in principle be approximated arbitrary well. This is not saying that will be small since the norm of the element that approximates the indicator might be large. But the approximation error, which is in the proposition, can be made arbitrary small. See also Remark 1 in the Appendix.
Intuitively, as visualized in Figure 1, the proposition states that if mostly attains values in the gray area then the dependence between and is low.
5 Best -independent features
In this section we discuss how to obtain as a closed-form solution to Problem 1. To this end, inspired by the sub-problem in the linear case, we obtain in Section 5.1 using Hilbert-space-valued conditional expectations. In Sections 5.2 and 5.3 we respectively that these features are -independent of and that is the best -independent approximation of .
5.1 Specification of the oblivious features
In the linear case discussed in the Introduction it turned out that is a good candidate for the new features . In the Hilbert-space-valued case a similar result holds. The main difference here is that we do have to work with Hilbert-space-valued conditional expectations. For any random variable , and any -subalgebra of , conditional expectation is defined and is again an element of . We are particularly interested in conditioning with respect to the sensitive random variable . In this case, is chosen as , the smallest -subalgebra which makes measurable, and we denote this conditional expectation by . In the following, we use the notation . A natural choice for the new features is
The expectation is to be interpreted as the Bochner-integral of given measure . Importantly, if and are independent, we have with this choice that and we are back to the standard kernel setting. Also, if then so is .
5.2 and are -independent
We can verify that the features are, in fact, -independent of . In particular, for any and ,
Since is a constant this implies that A similar argument shows that . Thus, is -independent of .
In Figure 3 the effect of the move from to is visualized. In the figure is plotted against and (blue dots), where corresponds to the quadratic function and to the sinus function. The dependencies between and , as well as and , are high and there is clear trend in the data. The two red curves correspond to the best regression functions, using to predict and . The relation between the new features and is shown in the other two plots (gray dots). In the case of one can observe that the dependence between and is much smaller and, by the design of , and are uncorrelated. Similarly, for , whereas here the dependence to seems to be even lower and it is difficult to visually verify any remaining dependence between and .
An interesting aspect of this transformation from to is that is automatically uncorrelated with for all functions in the corresponding RKHS, without the need to ever explicitly consider a particular .
5.3 is the best -independent approximation
Besides being -independent of these new features also closely approximates our original features if the influence from is not too strong, i.e. the mean squared distance is which is equal to zero if is independent of . In fact, is the best approximation of in the mean squared sense under the -independent constraint. This is essentially a property of the conditional expectation which corresponds to an orthogonal projection in . We summarize this property in the following result.
Given such that is -independent of , then
where . Furthermore, is the unique minimizer (up to almost sure equivalence).
Proof provided in Appendix B.4. ∎
When replacing by we lose information (we reduce the influence of the sensitive features). An interesting question to ask is, ‘how much does the reduction in information change our predictions?’ A simple way to bound the difference in predictions is as follows. Consider any , for instance corresponding to a regression function, then
where effectively measures the influence of . Hence, the difference in prediction is upper bound by the norm of the predictor (here ) and a quantity that measures the dependence between and .
6 Generating oblivious features from data
To be able to generate the features we need to first estimate the conditional expectation from data. To this end, we devise a plugin-approach based on an extension of the method in (Grünewälder, 2018). After introducing this approach in Section 6.1 we show in Section 6.2 how the oblivious features can be generated and we introduce the oblivious kernel matrix. In Section 6.3 we discuss how the estimation errors of the plugin-estimator can be controlled. Finally, in Section 6.4, we demonstrate how the approach can be used for statistical problems.
6.1 Plug-in estimator
A common method for estimation is the plug-in approach whereby an unknown probability measure is replaced by the empirical measure. This approach is used in (Grünewälder, 2018) for deriving estimators of conditional expectations. To see how the approach can be generalized to our setting, first observe that we can write
where is a Bochner-measurable function (see Appendix A and Lemma 2 for details). Our aim is to estimate this function from i.i.d. observations . For any subset of the range space of the sensitive features define the empirical measure where the Dirac measure with mass one at location . We define an estimate of the conditional expectation of given that the sensitive variable falls into a set by
when and through otherwise. Observe that for we have,
We can also write this as . An estimate of the conditional expectation given is provided by
where is a finite partition of the range space of . A common choice for if is the hypercube , , are the dyadic sets. Observe, that we can move inner products inside the conditional expectation so that .
6.2 Generating an oblivious random variable
We consider a simple approach where we split our data into two equal parts of size . We use the second observations to infer the conditional expectation and and . We use the remaining observations to generate oblivious features through Most kernel methods work with the kernel matrix and do not need access to the observations themselves. The same holds in the oblivious case. Instead of the original kernel matrix algorithms use the oblivious kernel matrix, i.e.
The matrix is positive semi-definite since for any . Importantly, the oblivious kernel matrix can be calculated by using kernel evaluations and we never need to represent explicitly in the Hilbert space. The complexity to compute the matrix is . See Appendix D for details on the algorithm.
6.3 Controlling the estimation error
The estimation error when estimating using is relatively easy to control thanks to the plug-in approach. Essentially, standard results concerning the empirical measure carry over to conditional expectation estimates in the real-valued case Grünewälder (2018). But through scalarization we can transfer some of these results straight away to the Hilbert-space-valued case. For instance,
and bounds on the latter term are known. Similarly,
However, both and are random variables and a useful measure of their difference is the -pseudo-norm. This -pseudo-norm should in this case not be taken with respect to itself but conditional on the training sample. Hence, for i.i.d. pairs let and define the ‘conditional’ -pseudo-norm by
Together with Equation (8) we obtain,
The supremum cannot be taken out of the conditional expectation, however, by writing and as simple functions (see Appendix A.1) we can get around this difficulty and control the error in . The following proposition demonstrates this by showing that the rate of convergence of the estimator is , which is optimal.
Given a continuous kernel function acting on a compact set , sensitive features which attain only finitely many values, independent observations , it holds that
The proof is given in Appendix B.5. ∎
6.4 Oblivious ridge regression
In this section we discuss how this approach can be combined with kernel methods. We showcase this in the context of kernel ridge regression. We have three relevant random variables, namely, the non-sensitive features , the sensitive features and labels which are real valued. We assume that we have i.i.d. observations . We use the observations to generate the oblivious random variables and then use oblivious data for oblivious ridge regression (ORR).
The ORR problem has the following form. Given a positive definite kernel function , a corresponding RKHS and oblivious features . Our aim is to find a regression function such that the mean squared error between and is small. Replacing the mean squared error by the empirical least-squares error and adding a regularization term for gives us the optimization problem
where is the regularization parameter.
It is easy to see that the setting is not substantially different from standard kernel ridge regression and derive a closed form solution for . More specifically, we have a representer theorem in this setting which tells us that the minimizer lies in the span of . One can then solve the optimization problem in the same way as for standard kernel ridge regression, see C for details. The solution to the optimization problem is , where . The vector is given by . Predicting for a new observation is achieved by first generating the oblivious features and then by evaluating
7 Examples and experiments
We start with a fundamental example. Let and be standard normal random variables with covariance . First, let us consider the linear kernel , . In this case and is also normally distributed (see Bertsekas and Tsitsiklis (2002)[Sec4.7]). Hence, is normally distributed and . This implies that and are, in fact, fully independent, regardless of how large the dependence between the original features and the sensitive features may be. In the case where and are fully dependent, i.e. for some , the features are equal to zero and do not approximate .
Next, we consider a polynomial kernel of second order such that the quadratic function lies within the corresponding RKHS. The inner product between this and is equal to and is not independent of . Hence, the kernel function affects the dependence between and . Also, within the same RKHS we again have linear functions and is independent of for any linear function . Therefore, within the same RKHS we can have directions in which is independent of and directions where there are dependencies left.
Finally, we compare ORR and KRR in a simple experiment, see Figure 4. We have samples sensitive features and non-sensitive features which are both uniformly distributed between and are independent. The features are a convex combination of these two, i.e. , . The response variable is , where is normally distributed with variance and is independent of and . In particular, for and ORR behaves poorly in terms of the Mean Squared Error (MSE). On the other hand when we have and ORR has as much information about as the standard KRR. In the plot we can see that the MSE for ORR is slightly higher than the MSE for KRR for high values. This is due to the empirical estimation errors of the conditional expectations.
We have introduced a novel approach to derive oblivious features which approximate non-sensitive features well while maintaining only minimal dependence on sensitive features. We make use of Hilbert-space-valued conditional expectations and estimates thereof. The application of our approach to kernel methods is facilitated by an oblivious kernel matrix which we have derived to be used in place of the original kernel matrix. We characterize the dependencies between the oblivious and the sensitive features in terms of how ‘close’ the sensitive features are to the low-dimensional manifold . One may wonder if this relation can be used to further reduce dependencies, and hopefully achieve full independence. Another question concerns the interplay between the errors induced by the empirical estimation of the conditional expectations and those of the kernel methods applied to .
Appendix A Probability in Hilbert spaces: elementary results
We summarize in this section the few elementary results concerning random variables that attain values in a separable Hilbert space which we use in the main paper.
a.1 Measurable functions
There are three natural definitions of what it means for a function to be measurable. Denote the measure space in the following by with the understanding that these definitions apply, in particular, to and being the corresponding Borel -algebra.
is Bochner-measurable iff is the point-wise limit of a sequence of simple functions, where is a simple function if it can be written as
for some , and .
is strongly-measurable iff for every Borel-measurable subset of . The topology that is used here is the norm-topology.
is weakly-measurable iff for every element the function is measurable in the usual sense (using the Borel-algebra on ).
All three definitions of measurability are equivalent in our setting. We call a function a random variable if it is measurable in this sense.
Appendix B Hilbert space-valued conditional expectations
b.1 Basic properties
We recall a few important properties of Hilbert space valued conditional expectations. These often follow from properties of real valued conditional expectations through ‘scalarization’ Pisier (2016). In the following, let and some -subalgebra of . Due to Pisier (2016)[Eq. (1.7)], for any
and the right hand side is just the usual real valued conditional expectation. It is also worth highlighting that the same holds for the Bochner-integral , i.e. for any , . This can be used to derive properties of . For instance, since is a property of real-valued conditional expectations we find right away that
Because and are elements of and for all
it follows that .
Another result we need is that if is -measurable then
Showing this needs a bit more work. Since there exist -measurable simple functions such that converges point-wise to , and the sequence fulfills for all Pisier (2016)[Prop.1.2]. Consider some and write , for a suitable , then
because is -measurable. For the right hand side point-wise convergence of to tells us that for all we have . Because we also know that is finite almost surely. Therefore, for in the corresponding co-negligible set,
and almost surely.
By the same argument it follows that almost surely. Let and then . Furthermore, . The right hand side lies in and dominates . Using Shiryaev (1989)[II.§7.Thm.2(a)], we conclude that
and the result follows.
The operator is also idempotent and self-adjoint, i.e.
b.2 Representation of conditional expectations
A well known result in probability theory states that a conditional expectation of a real-valued random variable given another real-valued random variable can be written as with some suitable measurable function . This result generalizes to our setting. Here, we include the generalized result together with a short proof for reference.
Consider a probability space , and let be a separable Hilbert space. Let be a random variable and suppose that is a -measurable function. There exists a Bochner-measurable function such that
We first show the statement for simple functions, and observing that any arbitrary Bochner-measurable function can be written as the point-wise limit of a sequence of simple functions, we extend the result to arbitrary .
First, assume that for some and . Since is measurable with respect to there exists some such that . Define as , where denotes the indicator function on . We obtain, so that .
Next, let for some , and . As above, by measurability of , there exists a sequence such that . It follows that ; hence, for . Observe that in both cases is trivially Bochner-measurable by construction, since it is a simple function.
Now, let be an arbitrary Bochner-measurable function that is also measurable with respect to . There exists a sequence of simple functions such that for every we have
Since each is a simple function, by our argument above, there exists a sequence of Bochner-measurable functions such that where for each the function is simple of the form