Improving Consequential Decision Making under Imperfect Predictions
Abstract
Consequential decisions are increasingly informed by sophisticated datadriven predictive models. For accurate predictive models, deterministic threshold rules have been shown to be optimal in terms of utility, even under a variety of fairness constraints. However, consistently learning accurate models requires access to ground truth data. Unfortunately, in practice, some data can only be observed if a certain decision was taken. Thus, collected data always depends on potentially imperfect historical decision policies. As a result, learned deterministic threshold rules are often suboptimal. We address the above question from the perspective of sequential policy learning. We first show that, if decisions are taken by a faulty deterministic policy, the observed outcomes under this policy are insufficient to improve it. We then describe how this undesirable behavior can be avoided using stochastic policies. Finally, we introduce a practical gradientbased algorithm to learn stochastic policies that effectively leverage the outcomes of decisions to improve over time. Experiments on both synthetic and realworld data illustrate our theoretical results and show the efficacy of our proposed algorithm.
1em
1 Introduction
The use of machine learning models to assist consequential decision making—decision making which has significant consequences—is becoming widespread in a large variety of critical applications such as pretrial release decisions by judges, loan decisions by banks, or fraud detection by insurance companies. In pretrial release decisions, a judge may consult a machine learning model, which estimates the probability that the defendant would reoffend upon release, to decide whether she grants bail or not. In loan decisions, a bank may decide whether or not to offer a loan to an applicant on the basis of a machine learning model’s estimate of the probability that the individual would repay the loan. In fraud detection, an insurance company may identify suspicious claims, which are then sent in for closer manual inspection, on the basis of a machine learning model’s estimate of the probability that a claim is fraudulent. In all these scenarios, the goal of the decision maker (bank, law court, or insurance company) is to employ a decision policy that maximizes a given utility function. In contrast, the typical goal of the machine learning model is to provide an accurate prediction of the (binary) outcome of the process, often referred to as ground truth label (in short, label), from a set of observable features. Moreover, the decision maker does not get to observe the label if they decide to take a negative decision—if bail (a loan) is denied, one cannot observe whether the individual would have reoffended (paid back the loan).
In this context, there has been a flurry of work on computational mechanisms to ensure that the above machine learning models do not disproportionately harm particular demographic groups sharing one or more sensitive attributes, e.g., race or gender (Dwork et al., 2012; Feldman et al., 2015). However, most previous work does not distinguish between decisions and label predictions and, as a consequence, an inherent tradeoff between utility (or fairness) and prediction accuracy has been suggested (Chouldechova, 2017; Kleinberg et al., 2017b). Only recently has the distinction been made explicit (CorbettDavies et al., 2017; Kleinberg et al., 2017a; Mitchell et al., 2018; Valera et al., 2018). This recent line of work has also shown that deterministic threshold rules, a natural class of decision policies in which the decision is derived deterministically from a predictive model simply by thresholding, achieve maximum utility under various fairness constraints if the underlying machine learning model achieves perfect prediction accuracy. This lends support to focusing on deterministic threshold policies and seemingly justifies using predictions and decisions interchangeably.
Unfortunately, the underlying predictive models are likely imperfect in practice, because they are trained on label values from only those individuals, who have historically received positive decisions (Lakkaraju et al., 2017). To make things worse, deterministic threshold rules using even slightly imperfect prediction models, can be far from optimal (Woodworth et al., 2017). This negative result raises the following question: Can we do better if we learn directly to decide rather than to predict?
Contributions
We address the above question from the perspective of sequential policy learning. More specifically, we focus on a problem setting that fits a variety of realworld applications, including those mentioned previously: binary decisions are taken sequentially over time according to a policy and, whenever a positive decision is made, the corresponding label value is observed. It can then be used to further improve the policy. We assume that the decision maker aims to maximize a given utility without fairness constraints. It would be interesting, albeit challenging, to extend our work to scenarios in which the decision maker aims to maximize utility under additional fairness constraints.
In the above problem setting, we first show that the outcomes of deterministic policies, e.g., deterministic threshold rules, are not sufficient to improve the predictive model, nor the resulting decision policy. As a consequence, any systematic discrimination against underrepresented groups may be perpetuated. Then, we demonstrate how we can overcome this undesirable behavior by taking decisions according to a particular family of stochastic policies—exploring policies—and introduce a practical gradientbased algorithm to learn exploring policies. The proposed algorithm i) effectively leverages the observed label values over time, and ii) actively trades off exploitation—taking decisions currently believed to maximize utility—and exploration—taking decisions to learn about the true underlying label distribution.
Finally, we experiment both with synthetic and realworld data. Experiments on synthetic data demonstrate that there are scenarios in which threshold rules do fail at leveraging the outcomes of their decisions to improve the underlying predictive models they base their decisions on. As a consequence, exploring policies learned by our algorithm achieve higher utility than deterministic policies. Experiments on realworld data demonstrate that even if, in some scenarios, deterministic threshold rules and exploring policies achieve comparable performance in terms of utility, threshold rules are more susceptible to perpetuating systematic discrimination against the minority group. In contrast, exploring policies effectively reduces disparate impact.
Related work
Our work relates to the rich literature on counterfactual inference and policy learning (Athey & Wager, 2017; Ensign et al., 2018; Gillen et al., 2018; Heidari & Krause, 2018; Joseph et al., 2016; Jung et al., 2018; Kallus, 2018; Kallus & Zhou, 2018; Lakkaraju & Rudin, 2017). However, most previous work assumes that, given a decision, the label is always observed and, moreover, that the decision may influence the label distribution. In contrast, in our work the label is observed only if the decision is positive and the decision does not influence the (ground truth) label distribution. Two notable exceptions are by Kallus & Zhou (2018) and Ensign et al. (2018), which also consider limited feedback. However, they differ from our work in several key aspects. Kallus & Zhou (2018) focuses on the design of unbiased estimates for several fairness measures, rather than learning a policy. Ensign et al. (2018) consider a sequential decision making setting, assuming a deterministic mapping between features and labels. This allows them to reduce the problem to the apple tasting problem (Helmbold et al., 2000). Remarkably, in this deterministic setting, they also conclude that the optimal policy should be stochastic.
Finally, our work also relates to an emerging line of research analyzing the longterm effects of consequential decisions informed by datadriven predictive models on underrepresented groups (Hu & Chen, 2018; Liu et al., 2018; Mouzannar et al., 2019). However, their focus is on analyzing the evolution of several measures of wellbeing under a perfect predictive model, neglecting the data collection phase (Holstein et al., 2018). In contrast, we focus on analyzing how to improve a suboptimal policy over time using the features and labels of its positive decisions.
2 Decision Policies and Imperfect Predictive Models
Given an individual with a feature vector , a sensitive attribute ^{1}^{1}1For simplicity, we assume the sensitive attribute to be binary, potentially resulting in inadequate binary gender or race assignments. However, our work can be easily generalized to categorical sensitive attributes with . and a (ground truth) label , the decision controls whether the label is observed. For example, in loan decisions, the feature vector may include an individual’s salary, education, or credit history; the sensitive attribute may indicate gender, e.g., male or female; the label indicates whether an individual repays a loan () or defaults () upon receiving it; and the decision specifies whether the individual receives a loan () or their application is rejected ().
Further, we assume that the decision is sampled from a policy and, for each individual, the features , sensitive attribute and label are sampled from a ground truth distribution . Inspired by CorbettDavies et al. (2017), we measure the (immediate) utility as the expected overall profit provided by the policy, i.e.:
(1)  
where typically reflects economic considerations of the decision maker. For example, in a loan scenario, the utility gain is if a loan is granted and repaid, if a loan is granted but the individual defaults, and zero if the loan is not granted. In a pretrial release scenario, the utility gain is if an individual is released and does not reoffend (i.e., the institution saves the detention cost), if the individual goes on to reoffend upon release (i.e., cost to society), and zero if the individual is detained. Finally, we define the (immediate) group benefit as the fraction of beneficial decisions received by a group of individuals sharing a certain value of the sensitive attribute , i.e., , where the function is problem dependent Zafar et al. (2017).
Under perfect knowledge of the conditional distribution , we could apply the optimal policy that maximizes the above utility, which has been shown to be a simple deterministic threshold rule (CorbettDavies et al., 2017):
(2) 
However, in practice, we typically do not have access to the true conditional distribution , but instead to an imperfect predictive model , where is the error. Hence, one could think of naively implementing a deterministic threshold rule using
(3) 
Unfortunately, because of the mismatch between and , this policy will usually be suboptimal in terms of utility Woodworth et al. (2017). The same issue remains for policies that maximize utility under a variety of fairness constraints on group benefits, such as disparate impact, equality of opportunity, or disparate mistreatment Feldman et al. (2015); Hardt et al. (2016); Zafar et al. (2017), which have also been shown to be deterministic threshold rules, where the constant in eq. (2) may depend on the sensitive attribute (CorbettDavies et al., 2017).
To make things worse, it has been observed that the predictive error is often systematically larger for minority groups Angwin et al. (2016). To understand why, note that the predictive model is trained on data which comprises only those outcomes—features, sensitive attributes, and labels—that correspond to the positive decisions provided by a (potentially faulty) initial policy . In other words, the observed data, collected using policy , are not i.i.d. samples from the ground truth joint distribution , but instead from the weighted distribution:
(4) 
Consequently, the standard supervised learning theory bounds for the error do not apply. In the remainder, we will say that the joint distributions and are induced by the policy . In the next section, we study how to learn the optimal policy if the data is collected from an initial faulty policy . As discussed previously, we will focus on the unconstrained utility maximization problem, without fairness constraints on the group benefits, and leave the constrained problem for future work.
3 Deterministic vs Stochastic Policies
Consider a class of policies , within which we want to maximize utility, as defined in eq. (1):^{2}^{2}2We assume the maximum exists and is assumed in .
(5) 
under the assumption that we do not have access to samples from the ground truth distribution , which depends on, but instead only to samples from a distribution induced by a given initial policy , as defined by eq. (4). We first analyze this problem for deterministic threshold rules, before considering general deterministic policies, and finally general stochastic policies.
3.1 Deterministic policies
Assume the initial policy is a given deterministic threshold rule and is the set of all deterministic threshold rules, which means that each (and ) is of the form eq. (3) for some predictive model . Given a hypothesis class of predictive models , the optimization problem in eq. (5) becomes
(6) 
where the utility is simply^{3}^{3}3Here, is if the predicate is true and otherwise.
(7) 
Note that need not be unique, because the utility is not sensitive to the precise values above or below . Hence, we may have that and other may predict the labels more accurately than since the latter is learned to maximize utility rather than accuracy.
By assumption, we do not have access to samples from the ground truth distribution but instead to samples from the distribution induced by . In such a case, one may choose to simply ignore this mismatch and find a predictive model that maximizes , the utility with respect to the induced distribution , i.e.,
(8) 
However, the following negative result shows that, under mild conditions, would lead to a suboptimal deterministic threshold rule.
Proposition 1.
Assume contains the unique global optima and of, respectively, and . If there exists a subset of positive measure under on which , then .
Proof.
First, note that any deterministic policy is fully characterized by the sets for . For a deterministic threshold rule , we write . Because of eqs. (6) and (7), we see that whenever the symmetric difference between the sets and ), , has positive inner measure (induced by ) for and any . Thus it only remains to show that has positive inner measure for . Due to our assumption that , , and we have assumed that has positive measure. Therefore has positive measure. ∎
Appendix A supplements the above result by showing that a sequence of deterministic threshold rules, where the predictive model of each threshold rule is trained using an error based learning algorithm on data gathered by previous threshold rules, provably fails to converge to the optimal deterministic threshold rule if the initial policy is suboptimal.
Finally, while we have focused on deterministic threshold rules, our results readily generalize to all deterministic policies, because we can always express any arbitrary deterministic policy using eq. (3). For example, is a predictive model for which . Hence, Proposition 1 implies that, if we can only observe the outcomes of previous decisions taken by a deterministic initial policy , these outcomes may be insufficient to find the deterministic policy that maximizes utility. Moreover, the same issue may arise if the initial policy is stochastic, but the induced distribution assigns zero probability to a set with positive probability under the true distribution , which amounts to systematically ignoring part of the applicants.
3.2 Stochastic policies
To overcome the undesirable behavior exhibited by deterministic policies discussed in the previous section, one could just use a fully randomized initial policy , where for all . It readily follows from eq. (4) that samples from the induced distribution are i.i.d. samples from the ground truth distribution , specifically . As a result, if the hypothesis class of predictive models is rich enough, we could learn the optimal policy from data gathered using the policy . However, in practice, using a fully randomized initial policy is unacceptable in terms of utility—it would entail offering (releasing) loans (defendants) by a fair coin flip until sufficient data has been collected. Fortunately, we will show next that an initial stochastic policy does not need to be fully randomized to be able to learn the optimal policy. We only need to choose an initial policy such that on any subset with positive probability under , a requirement that will be more acceptable in terms of initial utility. We refer to any policy with this property as an exploring policy.^{4}^{4}4A policy is exploring, iff the true distribution is absolutely continuous w.r.t. the induced distribution . In simple words, the distribution from which data are collected must not ignore regions where the true distribution puts mass.
For an exploring policy , we can compute eq. (1) using weighted sampling, i.e.,
(9) 
The key insight is that the last equality in eq. (9) only depends on samples from the induced distribution , which are weighted to correct the bias with respect to the ground truth distribution . Thus, we arrive at the following positive result.
Proposition 2.
Let be the set of exploring policies and let . Then
Proof.
We already know that the supremum is upper bounded by , i.e., it suffices to construct a sequence of policies such that for . Using notation from the proof of Proposition 1, we define
It is clear that is exploring, i.e., , for all . Now we compute
∎
Finally, we would like to highlight that not all exploring policies may be equally acceptable to society. For example, in lending, it may appear wasteful to still deny a loan with probability greater than zero to individuals who are believed to repay by the current model. In those cases, one may like to consider exploring policies that, given sufficient evidence, decide deterministically, i.e., for some values of . At the end of section 4 we provide an example of how to design such policies.
4 Learning Exploring Policies
In this section, our goal is to put Proposition 2 into practice by designing an algorithm that finds an exploring policy that achieves the same utility as the optimal policy using data gathered by a given initial exploring policy , i.e., not from the ground truth distribution . To this end, we consider a class of parameterized exploring policies and we aim to find the policy that maximizes the utility defined in eq. (1).
We now introduce a general gradientbased algorithm to solve the above problem, which only requires the class of parameterized policies to be differentiable. In particular, we resort to stochastic gradient ascent (SGA) (Kiefer et al., 1952), i.e., , where is the learning rate at step .^{5}^{5}5Depending on the choice of parameterization and the learning rate schedule, our algorithm may enjoy theoretical guarantees, which we leave for future work. Here, we will demonstrate that our algorithm does perform well in practice. Here, it may seem challenging to compute a finite sample estimate of the gradient of the utility since, as before, the expectation in the utility is taken with respect to the ground truth distribution , to which we do not have sample access; and the derivative is taken with respect to the parameters of the policy , which we are trying to learn. However, we can overcome both challenges by using the logderivative trick in (Williams, 1992) and the reweightingtrick from eq. (9), which allow us to write the gradient of the utility as:
(10) 
where is often referred as the score function (Hyvärinen, 2005).
Unfortunately, the above procedure has two main drawbacks. First, it may require on an abundance of data drawn from , which can be unacceptable in terms of utility since may be far from optimal and should not be deployed for too long. Second, if is small in a region where often takes positive decisions, one may expect that an empirical estimate of the above gradient has high variance, due to similar arguments as in weighted inverse propensity scoring Sutton & Barto (1998).
To overcome the above drawbacks, we proceed sequentially, starting from an initial policy , building a sequence of policies with increasing utility, i.e., for all . More specifically, in step , we find the policy via SGA using samples from the distribution induced by the previous policy . Note that we can obtain an expression for by simply replacing with in eq. (10). Thus we can estimate the gradient with samples from the distribution induced by the previous policy , and sample the decisions from the policy under consideration . This yields an unbiased finite sample MonteCarlo estimator for the gradient
(11) 
where is the number of positive decisions taken by . Here, it is important to notice that, while the decisions by were actually taken and, as a result, (feature and label) data was gathered under , the decisions are just sampled to implement SGA. The overall policy learning process is summarized in Algorithm 1, where Minibatch samples a minibatch of size from the dataset and InitializePolicy initializes the policy parameters.
Remarks. In Algorithm 1, to learn each policy , we have limited ourselves to data gathered only by the previous policy . However, we may readily use samples from the distribution induced by any previous policy in eq. (11). The average of multiple gradient estimators for several is again an unbiased gradient estimator. In practice, one may decide to consider recent policies , which are more similar to , thus ensuring that the gradient estimator does not suffer from high variance.
The way in which we use weighted sampling to estimate the above gradients closely relates to the concept of weighted inverse propensity scoring (wIPS), commonly used in counterfactual learning Bottou et al. (2013); Swaminathan & Joachims (2015a), offpolicy reinforcement learning Sutton & Barto (1998), and contextual bandits Langford et al. (2008). However, a key difference is that, in wIPS, the labels are always observed. Despite this difference, we believe that recent advances to reduce the variance of the gradients in weighted inverse propensity scoring, such as clippedwIPS Bottou et al. (2013), selfnormalized estimator Swaminathan & Joachims (2015b), or doubly robust estimators Dudík et al. (2011), may be also applicable to our setting. This is left for future work.
Example: Logistic Policy
Let us now introduce a concrete parameterization of , a logistic policy given by
where is the logistic function, are the model parameters, and is a fixed feature map. Note that any logistic policy is an exploring policy and we can analytically compute its score function as
where . Using this expression, we can rewrite the empirical estimator for the gradient in eq. (11)
Given the above expression, we have all the necessary ingredients to implement Algorithm 1.
As discussed in the previous section, randomizing decisions may be questionable in certain practical scenarios. For example, in loan decisions, it may appear wasteful for the bank and contestable for the applicant to deny a loan with probability greater than zero to individuals who are believed to repay by the current model. In those cases, one may consider the following modification of the logistic policy, which we refer to as semilogistic policy:
Similarly as in the logistic policy, we can compute the score function analytically as:
and use this expression to compute an unbiased estimator for the gradient in eq. (11) as:
Finally, note that the semilogistic policy is an exploring policy and thus satisfies the assumptions of Proposition 2.
5 Experiments
In this section, we empirically evaluate our gradientbased algorithm on synthetic and realworld data. To this end, we learn a sequence of policies using the following algorithms:

Optimal: decisions are taken using the optimal deterministic threshold rule given by eq (2), i.e., for all . The ground truth distribution is known.

Threshold rule: decisions are taken using deterministic threshold policies , where are logistic models trained to maximize label likelihood using data gathered by all previous policies , .

Logistic: decisions are taken using logistic policies trained using Algorithm 1. Each policy is trained using data gathered either by the immediately previous policy , or by all previous policies for .

Semilogistic: decisions are taken using semilogistic policies trained using Algorithm 1. Each policy is trained using data gathered either by the immediately previous policy , or by all previous policies , for .
Here, note that both logistic and semilogistic are exploring policies. Moreover, we compare the performance of the above methods using the following quality metrics:

Effective utility: the utility realized during the learning process up to time , i.e.,
where are the data in which the policy took positive decisions . This is the utility that the decision maker accumulates while learning increasingly better policies.

Utility: the utility achieved by the current policy estimated empirically using an unbiased heldout dataset, sampled i.i.d. from the ground truth data distribution . This is the utility that the decision maker would obtain if they decide to keep the current learned policy and deploy it at large in the population of interest.
In addition, for realworld data, we also evaluate the difference in group benefits between sensitive groups, i.e., , where we define group benefits as . This definition of group benefits accounts for disparate impact, one of the most common notions of unfairness, where a decision policy is free of disparate impact iff . Here, the lower the absolute difference in group benefits , the less disparate impact. Finally, it is crucial that while each of the above methods decides over the same set of proposed at each time step , depending on their decisions, they may collect labels for differing subsets and thus receive different amounts of new training data.
5.1 Experiments on Synthetic Data
Experimental setup. For the ground truth data, we assume that there is only a single nonsensitive feature per individual and no sensitive attributes. A scenario with single features is highly relevant given the increasing use of scorebased decision support systems (e.g., credit scores for lending, or risk scores in pretrial risk assessment), where inputs, training data and the functional form used to estimate the score are not available due to, e.g., privacy or intellectual property reasons. More specifically, the fact that the optimal policy is a deterministic threshold rule lends support to the use of scores, because for any score that is monotonic in the true probability there exists a single decision threshold for the score that results in the optimal policy.
We consider two different settings, illustrated in Figure 1, where represents a score. In the first setting, is sampled from a truncated Normal distribution with and the conditional probability is strictly monotonic in the score. As a result, there exists a single decision boundary for the score that results in the optimal policy, which is also in the class of logistic policies. Note, however, that the score is not well calibrated, i.e., is not directly proportional to .
In the second setting, is sampled from a standard Normal distribution . Here, the conditional probability crosses the cost threshold multiple times, resulting in two disjoint intervals of scores, where the optimal decision is (green areas). Consequently, the optimal policy cannot be implemented by a deterministic threshold rule based on a logistic predictive model.
Results. Figure 2 summarizes the results, which show that our method outperforms deterministic threshold rules in terms of effective utility (first column) and utility (second column) in both experimental settings (rows). Moreover, Figure 3 shows the learned predictive model for the deterministic threshold rules (first row) and the learned policies for the logistic policies (second and third rows) and the semilogistic policies (fourth and fifth rows), where the columns correspond to the two different settings shown in Figure 1. In the first setting, the exploring policies locate the optimal decision boundary, whereas the deterministic threshold rules, which are based on learned predictive models, do not, even though is monotonic in and has a sigmoidal shape. In the second setting, our methods explore also the left green region and rightfully conclude that overall it is beneficial to take mostly positive decisions for right of the leftmost intersection of with . This is indeed the optimal policy within the model class of logistic policies. In contrast, since the deterministic threshold rules are non exploring policies, they confidently converge to the second threshold, ignoring the left region of promising candidates.
Note that, while both the deterministic threshold rules and the logistic policies learn a logistic function, the former uses the logistic functions as predictive models, whose outputs are thresholded to take a deterministic decision, while the latter use them directly as stochastic policies, i.e., their values determine the probability that the decision is . In the case of deterministic threshold rules, the logistic function is optimized using maximum likelihood on the labels .
These results show that the deterministic policies fail to identify the optimal threshold since their learned predictive models do not converge to the ground truth conditional probability . In contrast, the stochastic policies manage to successfully approximate the optimal threshold within the family of logistic functions with and and within the family of semilogistic functions with . That being said, the stochastic semilogistic policy with , fails to approximate the optimal threshold within the family of semilogistic functions. This is because the semilogistic policies collect datasets where examples with large remain overrepresented over time, due to their deterministic behavior in that region of the features space. As a consequence, the resulting threshold remains in that region of the feature space.
5.2 Experiments on Real Data
Experimental setup. Here, we use the COMPAS recidivism prediction dataset compiled by ProPublica Angwin et al. (2016), which comprises of information about criminal offenders screened through the COMPAS tool in Broward County, Florida during 20132014. For each offender, the dataset contains a set of demographic features, the offender’s criminal history, and the risk score assigned to the offender by COMPAS. Moreover, ProPublica also collected whether or not these individuals actually recidivated within two years after the screening. In our experiments, the sensitive attribute is the race ( if white and otherwise), the label indicates whether the individual recidivated () or not () and the decision policy determines whether an individual is released from jail () or not (). We use 80% of the data to learn the decision policies, where at each step , we sample (with replacement) individuals from this set, and the remaining 20% as a heldout set to compute the utility of each learned policy in the population of interest.
Results. Figure 4 summarizes the results. In terms of effective utility, the deterministic policies achieve a small, nonincreasing advantage with respect to our logistic policies () due to the early exploration of the logistic policies. In terms of utility (second column), the logistic policies and deterministic policies achieve comparable final values. Indeed, the difference in the final utilities for 50 runs each, is not statistically significant both according to a twosample Student’s test (with and without Welch correction) as well as a nonparametric, twosample KolmogorovSmirnov test. Finally, in terms of disparate impact (third column), the logistic and semilogistic policies beat the deterministic policies. In contrast with the final utility, this difference is statistically significant according to the same statistical tests (). In summary, while the deterministic policies and exploring policies achieve similar (effective) utility in this dataset, the exploring policies effectively reduce disparate impact—the deterministic policies perpetuate a systematic discrimination against the underrepresented group. Finally, note that for realworld experiments, we cannot evaluate the optimal policy, and we do not expect it to reside in our model class.
6 Conclusions
In this paper, we have analyzed consequential decision making under imperfect predictive models, which have been learned using data gathered by potentially biased, historical decisions. First, we have shown that if these decisions were taken according to a faulty deterministic policy, the observed outcomes under this policy are insufficient to improve it. Next, we have demonstrated that this undesirable behavior can be avoided by using a particular family of stochastic policies, which we refer to as exploring policies. Finally, we have introduced and evaluated a practical gradientbased algorithm to learn exploring policies that improve over time.
We also open interesting avenues for future work. For example, we have considered the unconstrained utility maximization problem. A natural extension would be utility maximization under a variety of fairness constraints such as, e.g., disparate impact and mistreatment. While the class of exploring policies is vast, we have only experimented with simple parameterized policies such as (semi)logistic policies. Acquiring a deeper understanding of other types of exploring policies could help extend our work to more powerful models and apply it to complex, highdimensional datasets. We have assumed that the ground truth distribution does not change over time. Incorporating feedback of the decisions or timevarying externalities to learn to decide under imperfect predictions even in changing environments could be of great interest for a large variety of applications. Finally, in this work, we have evaluated our algorithm using observational data, however, it would be very revealing to perform an evaluation based on interventional experiments.
References
 Angwin et al. (2016) Angwin, J., Larson, J., Mattu, S., and Kirchner, L. Machine bias: There is software used across the country to predict future criminals. and it is biased against blacks. ProPublica, May, 23, 2016. URL https://www.propublica.org/article/machinebiasriskassessmentsincriminalsentencing.
 Athey & Wager (2017) Athey, S. and Wager, S. Efficient policy learning. arXiv preprint arXiv:1702.02896, 2017.
 Bottou et al. (2013) Bottou, L., Peters, J., nonero Candela, J. Q., Charles, D. X., Chickering, D. M., Portugaly, E., Ray, D., Simard, P., and Snelson, E. Counterfactual reasoning and learning systems: The example of computational advertising. Journal of Machine Learning Research, 14:3207–3260, 2013.
 Chouldechova (2017) Chouldechova, A. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data, 5(2):153–163, 2017.
 CorbettDavies et al. (2017) CorbettDavies, S., Pierson, E., Feller, A., Goel, S., and Huq, A. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 797–806. ACM, 2017.
 Dudík et al. (2011) Dudík, M., Langford, J., and Li, L. Doubly Robust Policy Evaluation and Learning. In Proceedings of the 28th International Conference on Machine Learning, pp. 1097–1104. Omnipress, 2011.
 Dwork et al. (2012) Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel, R. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, ITCS ’12, pp. 214–226. ACM, 2012.
 Ensign et al. (2018) Ensign, D., Friedler, S. A., Neville, S., Scheidegger, C., and Venkatasubramanian, S. Decision making with limited feedback: Error bounds for recidivism prediction and predictive policing. JMLR, 2018.
 Feldman et al. (2015) Feldman, M., Friedler, S. A., Moeller, J., Scheidegger, C., and Venkatasubramanian, S. Certifying and removing disparate impact. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 259–268, 2015.
 Gillen et al. (2018) Gillen, S., Jung, C., Kearns, M., and Roth, A. Online learning with an unknown fairness metric. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., CesaBianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 2605–2614. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/7526onlinelearningwithanunknownfairnessmetric.pdf.
 Hardt et al. (2016) Hardt, M., Price, E., Srebro, N., et al. Equality of opportunity in supervised learning. In Advances in Neural Information Processing Systems, pp. 3315–3323, 2016.
 Heidari & Krause (2018) Heidari, H. and Krause, A. Preventing disparate treatment in sequential decision making. In IJCAI, pp. 2248–2254, 2018.
 Helmbold et al. (2000) Helmbold, D. P., Littlestone, N., and Long, P. M. Apple tasting. Information and Computation, 161(2):85–139, 2000.
 Holstein et al. (2018) Holstein, K., Vaughan, J. W., Daumé III, H., Dudík, M., and Wallach, H. Improving fairness in machine learning systems: What do industry practitioners need? arXiv preprint arXiv:1812.05239, 2018.
 Hu & Chen (2018) Hu, L. and Chen, Y. A shortterm intervention for longterm fairness in the labor market. In World Wide Web Conference, WWW ’18, pp. 1389–1398, Republic and Canton of Geneva, Switzerland, 2018. International World Wide Web Conferences Steering Committee.
 Hyvärinen (2005) Hyvärinen, A. Estimation of nonnormalized statistical models by score matching. Journal of Machine Learning Research, 6:695–709, 2005.
 Joseph et al. (2016) Joseph, M., Kearns, M., Morgenstern, J. H., and Roth, A. Fairness in learning: Classic and contextual bandits. In Advances in Neural Information Processing Systems, pp. 325–333, 2016.
 Jung et al. (2018) Jung, J., Shroff, R., Feller, A., and Goel, S. Algorithmic decision making in the presence of unmeasured confounding. arXiv preprint arXiv:1805.01868, 2018.
 Kallus (2018) Kallus, N. Balanced policy evaluation and learning. In Advances in Neural Information Processing Systems, pp. 8909–8920, 2018.
 Kallus & Zhou (2018) Kallus, N. and Zhou, A. Residual unfairness in fair machine learning from prejudiced data. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 2439–2448, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
 Kiefer et al. (1952) Kiefer, J., Wolfowitz, J., et al. Stochastic estimation of the maximum of a regression function. The Annals of Mathematical Statistics, 23(3):462–466, 1952.
 Kleinberg et al. (2017a) Kleinberg, J., Lakkaraju, H., Leskovec, J., Ludwig, J., and Mullainathan, S. Human decisions and machine predictions. The quarterly journal of economics, 133(1):237–293, 2017a.
 Kleinberg et al. (2017b) Kleinberg, J., Mullainathan, S., and Raghavan, M. Inherent TradeOffs in the Fair Determination of Risk Scores. In Papadimitriou, C. H. (ed.), 8th Innovations in Theoretical Computer Science Conference (ITCS 2017), volume 67 of Leibniz International Proceedings in Informatics (LIPIcs), pp. 43:1–43:23. Schloss Dagstuhl–LeibnizZentrum fuer Informatik, 2017b. ISBN 9783959770293.
 Lakkaraju & Rudin (2017) Lakkaraju, H. and Rudin, C. Learning CostEffective and Interpretable Treatment Regimes. In Singh, A. and Zhu, J. (eds.), Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pp. 166–175, Fort Lauderdale, FL, USA, 20–22 Apr 2017. PMLR.
 Lakkaraju et al. (2017) Lakkaraju, H., Kleinberg, J., Leskovec, J., Ludwig, J., and Mullainathan, S. The selective labels problem: Evaluating algorithmic predictions in the presence of unobservables. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 275–284. ACM, 2017.
 Langford et al. (2008) Langford, J., Strehl, A., and Wortman, J. Exploration scavenging. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, pp. 528–535, New York, NY, USA, 2008. ACM.
 Liu et al. (2018) Liu, L. T., Dean, S., Rolf, E., Simchowitz, M., and Hardt, M. Delayed impact of fair machine learning. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 3150–3158, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
 Mitchell et al. (2018) Mitchell, S., Potash, E., and Barocas, S. Predictionbased decisions and fairness: A catalogue of choices, assumptions, and definitions. arXiv preprint arXiv:1811.07867, 2018.
 Mouzannar et al. (2019) Mouzannar, H., Ohannessian, M. I., and Srebro, N. From fair decision making to social equality. In FAT, 2019.
 Sutton & Barto (1998) Sutton, R. S. and Barto, A. G. Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition, 1998. ISBN 0262193981.
 Swaminathan & Joachims (2015a) Swaminathan, A. and Joachims, T. Counterfactual risk minimization: Learning from logged bandit feedback. In ICML, volume 37 of JMLR Workshop and Conference Proceedings, pp. 814–823. JMLR.org, 2015a.
 Swaminathan & Joachims (2015b) Swaminathan, A. and Joachims, T. The selfnormalized estimator for counterfactual learning. In Proceedings of the 28th International Conference on Neural Information Processing Systems, pp. 3231–3239, Cambridge, MA, USA, 2015b. MIT Press.
 Valera et al. (2018) Valera, I., Singla, A., and GomezRodriguez, M. Enhancing the accuracy and fairness of human decision making. In Neural Information Processing Systems, 2018.
 Vapnik (1998) Vapnik, V. Statistical learning theory. Wiley, 1998.
 Williams (1992) Williams, R. J. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(34):229–256, 1992.
 Woodworth et al. (2017) Woodworth, B., Gunasekar, S., Ohannessian, M. I., and Srebro, N. Learning nondiscriminatory predictors. In Kale, S. and Shamir, O. (eds.), Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pp. 1920–1953, Amsterdam, Netherlands, 07–10 Jul 2017. PMLR.
 Zafar et al. (2017) Zafar, M. B., Valera, I., Gómez Rodriguez, M., and Gummadi, K. P. Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In Proceedings of the 26th International Conference on World Wide Web, pp. 1171–1180, 2017.
Appendix A Error Based Learning of Deterministic Threshold Rules is Guaranteed to Fail
We have shown in Proposition 1 that deterministic threshold rules cannot be guaranteed to converge to the optimal policy if only biased data is available. In this section, we want to supplement this result by showing concrete scenarios in which a sequence of deterministic threshold rules, where each threshold rule use the data gathered by previous threshold rules to train its associated predictive model, are guaranteed to fail to converge to the optimal deterministic threshold rule if the underlying predictive model is trained using any error based learning algorithm. First, we introduce the sequential learning setup considered more formally. Let be the space of features, the space of sensitive attributes, and the space of labels.^{6}^{6}6We assume the standard sigma algebras on these spaces. Moreover, the set of probability distributions on a space is denoted by . Then, a policy is a mapping and we denote the space of all policies by . The subset of deterministic policies is given by , where denote Dirac measures on and , respectively.
Definition 1.
A sequential policy learning task for the true distribution is given by a tuple , where

is the hypothesis class of policies

is the initial policy

is an update rule.^{7}^{7}7For a set , we write .
The update rule takes an existing policy within the class of allowed policies and a dataset of examples and produces an updated policy. Typically, we think of it as a learning algorithm that aims to improve the policy as measured by the utility in eq. (1).
In our setting, we collect new data according to at each step . This gives rise to two natural sequences of policies:

The iterative sequence with , where only the data gathered by the immediately previous policy are used to update the current policy.

The aggregated sequence with , where the data gathered by all previous policies are used to update the current policy.
As described in section 2, deterministic threshold rules are policies that rely on an underlying predictive model of the conditional distribution . This means that we can consider the sequence of predictive models instead of the sequence of policies and use any supervised learning algorithm for the update rule. If the datasets were i.i.d. samples from the true distribution, i.e., and the update rule is based on a class of learning algorithms such as empirical risk minimization, standard learning theory with asymptotic convergence results applies (Vapnik, 1998). Unfortunately, these guarantees do not hold in our setting where .
Now, we will prove that, in certain situations, sequential supervised learning of a predictive model to construct a sequence of deterministic threshold rules fails to recover the optimal policy despite it being in the hypothesis class.
First, note that any deterministic threshold policy is fully characterized by the sets for , i.e., we can partition the space into positive and negative decisions. Then, we say an update rule is nonexploring on iff . Intuitively, this means that no individual who has received a negative decision under the old policy would have received a positive decision under the new policy after updating on . Remarkably, common learning algorithms for classification, such as gradient boosted trees are error based, i.e., they only change the decision function if they make errors on the training set. As a result, they lead to nonexploring update rules on whenever the error is zero, i.e., .
Now, assume we start with a faulty predictive model assigning probability of a positive label above the threshold only to a subset of the population whose true probability is close to one. In particular, let us assume this is a strict subset of . Then, with high probability, we will exclusively observe labels in the data gathered by such a harsh policy, i.e., our current policy will make no errors on the gathered data even when aggregating over multiple time steps. For example, such a policy may be the one that always gives loans to those who will almost always repay them. The resulting update rule is nonexploring on the gathered data with high probability and will therefore never converge to .
Proposition 3.
Let be a sequential policy learning task, where are deterministic threshold policies based on a class of predictive models, and let the initial policy be more strict than the optimal one, i.e., . Finally, assume that once we have , we set for all . Then, if is nonexploring on with probability at least for , then
Proof.
At each step, we have
By the assumption that , we recursively get which concludes the proof. ∎
Corollary 1.
A deterministic threshold policy with under will never converge to under an error based learning algorithm for the underlying predictive model.
Proof.
Since error based learning algorithms lead to nonexploring policies whenever , using the assumption , we can use Proposition 3 with for all . ∎