Controlling for Unobserved Confounds in Classification
Using Correlational Constraints
Abstract
As statistical classifiers become integrated into realworld applications, it is important to consider not only their accuracy but also their robustness to changes in the data distribution. In this paper, we consider the case where there is an unobserved confounding variable that influences both the features and the class variable . When the influence of changes from training to testing data, we find that the classifier accuracy can degrade rapidly. In our approach, we assume that we can predict the value of at training time with some error. The prediction for is then fed to Pearl’s backdoor adjustment to build our model. Because of the attenuation bias caused by measurement error in , standard approaches to controlling for are ineffective. In response, we propose a method to properly control for the influence of by first estimating its relationship with the class variable , then updating predictions for to match that estimated relationship. By adjusting the influence of , we show that we can build a model that exceeds competing baselines on accuracy as well as on robustness over a range of confounding relationships.
Controlling for Unobserved Confounds in Classification
Using Correlational Constraints
Virgile Landeiro and Aron Culotta Department of Computer Science Illinois Institute of Technology Chicago, IL 60616 vlandeir@hawk.iit.edu, aculotta@iit.edu
Copyright © 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
1 Introduction
Statistical classifiers have become widely used to inform important decisions such as whether to approve a loan (?), hire a job candidate (?), or release a criminal defendant on bond (?). Given the significant realworld consequences of such decisions, it is critical that we can identify and remove sources of systematic bias in classification algorithms. For example, some evidence suggests that existing criminal recidivism models may be racially biased (?).
One important type of classifier bias arises from confounding variables. A confounder is a variable that is correlated both with the input variables (or features) and the target variable (or label) of a classifier. When is not included in the model, the true relationship between and can be improperly estimated; in the social sciences – originally in econometrics – this is called omitted variable bias. While omitted variable bias is a core focus of social science (?), it has received much less attention in machine learning communities, where prediction accuracy is the main concern. Confounding variables can be particularly problematic in highdimensional settings, such as text classification, where models may contain thousands or millions of parameters, making manual inspection of models impractical. The common use of text classification in computational social science applications (?) further adds to the urgency of the problem.
Several studies with interests in public health focused on tracking the influenza rates in the USA by using Twitter as a sensor (?). These studies demonstrated that machine learning offers more accurate, inexpensive, and fast tracking methods than what is currently used by the CDC. ? built models to predict postpartum changes in emotion and behavior using Twitter data and managed to identify mothers who will change significantly following childbirth with an accuracy of 71% using observations about their prenatal behavior (?). In a more recent study, ? collected Yik Yak data – an anonymous social network popular among students – to study anonymous health issues and substance use on college campuses (?). The results of these studies are encouraging for the field of computational social science but only a few of them are taking into account the effect of possible confounders. A growing body of work tries to mitigate the effect of observed confounding variables using causal inference techniques. For instance, ? use a matching approach for causal inference to estimate the effect of online support on weight loss using data from Reddit, and ? leverage propensity score matching to detect users that transition from posting about mental health concerns to posting about suicidal ideation on Reddit. In this paper, we wish to provide methods for researchers in computational social sciences to conduct observational studies while controlling for confounding variables even though these might not be directly observed.
In recent work (?), a text classification algorithm was proposed based on Pearl’s backdoor adjustment (?) as a framework for prediction that controls for an observed confounding variable. It was found that this approach results in classifiers that are significantly more robust to shifts in the relationship between confounder and class label . However, an important limitation of this prior work is that it assumes that a training set is available in which every instance is annotated for both class label and confounder . This is problematic because there are many confounders we may want to control for (e.g., income, age, gender, race/ethnicity) that are often rarely available and difficult for humans to label, particularly in addition to the primary label .
A natural solution is to build statistical classifiers for confounders , and use the predicted values of to control for these confounders. However, the measurement error of introduces attenuation bias (?) in the backdoor adjustment, resulting in classifiers that are still confounded by .
In this paper, we present a classification algorithm based on Pearl’s backdoor adjustment to control for an unobserved confounding variable. Our approach assumes we have a preliminary classifier that can predict the value of the confounder , and that we have an estimate of the error rate of this classifier. We offer two methods to adjust for the mislabeled to improve the effectiveness of backdoor adjustment. A straightforward approach is to remove training instances for which the confidence of the predicted label for is too low. While we do find this approach can reduce attenuation bias, it must discard many training examples, degrading the classifier. Our second approach instead uses the error rate of the classifier to estimate the correlation between and in the training set. The assignment to is then optimized to match this estimated correlation, while also maximizing classification accuracy. We compare our methods on two realworld text classification tasks: predicting the location of a Twitter user and predicting if a Twitter user is smoking or not. Both prediction tasks are using users’ tweets as input data and are confounded by gender. The resulting model exhibits significant improvements in both accuracy and robustness, with some settings producing similar results as fullyobserved backdoor adjustment.
2 Related Work
In the machine learning field, selection bias has received some attention (?; ?). It arises when the population of a study is not selected randomly. Instead, some users are more inclined to be selected for the study than others, making it more difficult to draw conclusions from the general population. If we denote whether or not an element of the population is selected, there is presence of selection bias when . Dataset shift (?) is a similar issue that appears when the joint distribution of features and labels changes between the training dataset and the testing dataset (i.e. ). Covariate shift (?; ?) is a specific case of dataset shift in which only the inputs distribution is different from training to testing (i.e. ). Similarly, when the underlying target distribution changes over time, either in a sudden way or gradually, then this is called concept drift (?; ?). Recent work has studied “fairness” in machine learning (?; ?) as well as attempted to remove features that introduce bias (?; ?). ? (?) propose an extension of backdoor adjustment to deal with measurement error in the confounder, but it does not scale well when is high dimensional, as in our setting of text classification.
Although all these types of biases are important to conduct a valid observational study, in this paper we direct our attention to the problem of learning under confounding bias shift. In other words, we aim to build a classifier that is robust to changes in the relation between the target variable of a classifier and an external confounding variable . ? (?) use backdoor adjustment for text classification, but assume confounders are observed at training time. This paper introduces methods to enable backdoor adjustment to work effectively when confounders are unobserved and when the features are high dimensional.
3 Methods
In this section, we first review prior work using backdoor adjustment to control for observed confounders in text classification. We then introduce two methods for applying backdoor adjustments when the confounder is unobserved at training time and must instead be predicted by a separate classifier.
3.1 Adjusting for observed confounders
Suppose one wishes to estimate the causal effect of a variable on a variable when a randomized controlled trial is not possible. If a sufficient set of confounding variables is available, one can use the backdoor adjustment equation as follows:
(1) 
The backdoor criterion (?) is a graphical test that determines whether is a sufficient set of variables to estimate the causal effect. This criterion requires that no node in is a descendant of and that blocks every path between and that contains an arrow pointing to . Notice : this donotation is used in causal inference to indicate that an intervention has been made on . Omitting the predicted confounder , it depicts a standard discriminative approach to classification, e.g., modeling with a logistic regression classifier conditioned on the observed term vector . We assume that the confounder influences both the term vector through as well as the target label through . The structure of this model ensures that meets the backdoor criterion for adjustment.
Backdoor adjustment was originally introduced for causal inference problems — i.e., to estimate the causal effect of performing action on outcome . Recently, ? (?) have shown that backdoor adjustment can also be used to improve classification robustness. By controlling for a confounder , the resulting classifier is robust to changes in the relationship between and .
From the perspective of standard supervised classification, the approach works as follows: Assume we are given a training set . If we suspect that a classifier trained on is confounded by some additional variable , we augment the training set by including as a feature for each instance: . We then fit a classifier on , and at testing time apply Equation 1 to classify new examples — — where is simply computed from the observed frequencies of in . By controlling for the effect of , the resulting classifier is robust to the case where changes from training to testing data.
In the experiments below, we consider the problem of predicting a user’s location based on the text of their tweets , confounded by the user’s gender . That is, in the training data, there exists a correlation between gender and location, but we want the classifier to ignore that correlation. When the above procedure is applied to a logistic regression classifier, the result is that the magnitudes of coefficients for terms that correlate with gender are greatly reduced, thereby minimizing the effect of gender on the classifier’s predictions.
3.2 Adjusting for unobserved confounders
In the previous approach, it was assumed that we had access to a training set ; that is, each instance is annotated both for the label and confounder . This is a burdensome assumption, given that ultimately we will need to control for many possible confounders (e.g., gender, race/ethnicity, age, etc.). Because many of these confounders are unobserved and/or difficult to obtain, it is necessary to develop adjustment methods that can handle noise in the assignment to in the training data.
Our approach assumes we have an (imperfect) classifier for , trained on a secondary training set — we call this the preliminary study, with the resulting preliminary classifier . This is combined with the dataset , used to train the primary classifier . The advantage of allowing for separate training sets and is that it is often easier to annotate variables for some users than others; for example, ? (?) build training data for ethnicity classification by searching for online users that explicitly state their ethnicity in their user profiles.
After training on , the preliminary classifier is applied to to augment it with predicted annotations for confounder : , where denotes the predicted value of . A tempting approach is to simply apply backdoor adjustment as usual to this dataset, ignoring the noise introduced by . However, the resulting classifier will no longer properly control for the confounder for at least two related reasons:

The observed correlation between and in the training data will underestimate the actual correlation (i.e., ). This attenuation bias reduces the coefficients for the features, which in turn prevents backdoor adjustment from reducing the coefficients of features in that correlate with .

Similarly, because some training instances have mislabeled annotations for , it is more difficult to detect which features in correlate with , thereby preventing backdoor adjustment from reducing those coefficients.
To verify this claim, we conduct an experiment in which we observe but we inject increasing amounts of noise in (e.g., with probability , change the assignment to to be incorrect). In other words, we synthetically decrease the quality of our observations of and we observe how that influences the performance of backdoor adjustment. We then measure how the accuracy of the primary classifier for varies on a testing set in which the influence of is decreased (i.e., correlates strongly with in the training set, but only weakly in the testing set). These experiments will be discussed in more detail in Section 4.
We can see in Figure 1 that the F1 score quickly decreases as we add more noise to the confounding variable annotations, indicating the need for new methods to adjust for unobserved confounders. Notice that when noise is 0, backdoor adjustment greatly improves F1 (from .79 F1 with no adjustment to .85 F1), demonstrating the effectiveness of this approach when the confounder is observed at training time. In the following two sections, we propose two methods to fix these issues.
Noise  0.00  0.05  0.10  0.15  0.20 

F1 std dev  0.028  0.037  0.052  0.056  0.062 
Thresholding on confidence of predictions
Our first approach is fairly simple; its objective is to directly reduce the number of mislabeled annotations in . Our preliminary model produces the value (the prediction of the true confounder ) as well as (the confidence of the prediction; i.e., the posterior distribution over ). We use these posteriors to remove predictions with low confidence. By setting a threshold , we filter the original dataset by keeping an instance only if it satisfies .
For wellcalibrated classifiers like logistic regression, we expect to remove mostly mislabeled data points by thresholding at . Making vary between and allows us to modify the output of the preliminary study in order to obtain a subdataset with as many points correctly labeled as possible. Moreover, when the error of our preliminary classifier is symmetric, this process will also move the estimated correlation towards the true correlation .
With this smaller set of training instances, we run backdoor adjustment without modification. However, one important drawback of this method is that we remove instances from our training dataset. Depending on the quality of the preliminary classifier and the setting of , only a small fraction of training instances may potentially remain. Thus, in the next section we consider an alternative approach that does not require discarding training instances.
Correlation matching
While the above approach aims to reduce errors in , and as a side effect improves the estimate of , in this section we propose an approach that directly tries to improve the estimate of while also reducing errors in . Let be the observed correlation between and , and let be the true (unobservable) correlation between and in the training data for , . Our proposed approach builds on the insight of ? (?), who show that can be estimated from using the variances of and as well as the variances of the errors in and :
(2) 
where is the variance of , and is the variance of error on , and analogously for , . Since in our setting is observed, we can set and solve for :
(3)  
(4) 
Thus, the factor by which underestimates is proportional to the ratio of the variance of the error in to the variance of .
We can estimate the terms and using crossvalidation on the preliminary training data . Let be the value predicted by the preliminary classifier on instance , where is in the testing fold of one crossvalidation split of the data. Let be the absolute error of on instance . Then, we can first compute the mean absolute error of as . The estimated variance of the errors in is then:
(5) 
Since this variance in the error of in turn affects the observed variance of , we can then estimate
(6) 
where is the variance of predictions in the target training data .
Plugging the estimates of Equations 5 and 6 into Equation 4 enables us to estimate the true correlation between and in the target training data . We will refer to this estimated correlation as .
As an example, consider a dataset . The original correlation may be .5, but the true correlation may be .8. Depending on the variances of and its error, the estimated correlation may be . The next step in the procedure is to optimize the assignment to to minimize the difference . That is, we use as a soft constraint, and attempt to match that constraint by changing the assignments to .
Let be the set of all possible assignments to in the training set (i.e., if is a binary variable and , then ). Let be a vector of assignments to , and let indicate the correlation . Then our objective is to choose an assignment from to minimize , while still maximizing the probability of that assignment according to the preliminary classifier for . We can write this objective as follows:
(7) 
Thus, we search for an optimal assignment that maximizes the average posterior of the predicted value, while minimizing the difference between the estimated correlation and the observed correlation .
This optimization problem can be approached in several ways. We implement a greedy hillclimbing algorithm that iterates through the values in sorted by confidence and flips the value if it reduces . The steps are as follows:

Initialize to the most probable assignment according to .

Initialize to be all instances sorted in descending order of confidence .

While is decreasing:

Pop the next instance from

If flipping the label reduces the error , do so. Else, skip to the next instance.


Return the final .
For example, consider the case where . If the instance popped in step 3(a) has labels (, ), then we know that flipping to 1 would increase the correlation between and . By considering flips in descending order of , we ensure that we first flip assignments that are likely to be incorrect. In the experiments below, we find that this approach often converges after a relatively small number of flips.
The advantages of this approach are that it not only produces assignments to that better align with the expected correlation , but it also results in more accurate assignments to . The latter is possible because we are using prior knowledge about the relationship between and to assign values of when the classifier is uncertain. As with the thresholding approach of the previous section, once the new assignments to are found, backdoor adjustment is run without modification.
4 Experiments
We conducted text classification experiments in which the relationship between the confounder and the class variable varies between the training and testing set. We consider the scenario in which we directly control the discrepancy between training and testing. Thus, we can determine how well a confounder has been controlled by measuring how robust the method performs across a range of discrepancy levels.
To sample train/test sets with different distributions, we assume we have labeled datasets , , with elements , where and are binary variables. We introduce a bias parameter ; by definition, . For each experiment, we sample without replacement from each set , . To simulate a change in , we use different bias terms for training and testing, , . We thus sample according to the following constraints: , , , and .
The last two constraints are to isolate the effect of changes to . Thus, we fix and , but vary from training to testing data. We emphasize that we do not alter any of the actual labels in the data; we merely sample instances to meet these constraints. In the rest of the paper, we note (respectively ) the correlation between and in the training set (resp. testing set). We also denote .
4.1 Datasets
Location / Gender
For our first dataset, we use the data from ? (?), where the task is to predict the location of a Twitter user from their messages, with gender as a potential confounder. Thus, is a term vector, is location, and is gender. The data contain geolocated tweets from New York City (NYC) and Los Angeles (LA). There are 246,930 tweets for NYC and 218,945 for LA over a fourday period (June 15th to June 18th, 2015). Gender labels are derived by crossreferencing the user’s name (from the profile) with U.S. Census name data, removing ambiguous names. For each user, we have up to the most recent 3,200 tweets, which we represent each as a single binary unigram vector per user, using standard tokenization. Finally, we subsample this collection and keep the tweets from 6,000 users such that gender and location are uniformly distributed over the users.
Smoker / Gender
In our second dataset, the task is to predict if a Twitter user is a smoker or not, with gender as a potential confounder. We start from approx. 3M tweets collected in January and February 2014 using cigarettes related keywords. We randomly pick 40K tweets for which we can identify the user’s gender using the Twitter screen name and the U.S. Census name data. We then manually annotate 4.5K of these tweets on whether they show that a user is a smoker (yes) or a nonsmoker (no) while discarding uncertain tweets (unknown). We use this data to train a classifier (F1 score = 0.84) to label the remaining 35.5K tweets on the smoker dimension. In order to avoid mislabeled tweets as much as possible, we only keep predictions with a confidence of at least 95%, yielding an additional 5.5K automatically labeled tweets. These 10K (4.5K manually annotated + 5.5K automatically annotated) tweets have been written by 9K unique users. For each of these users, we collect the most recent tweets (up to 200). Because some users set their profile to be private or because some users that existed in early 2014 have now deleted their account, we obtain at least 20 tweets for 4.6K users. Then we collect all the cigarettes related tweets published by a user in the first two months of 2014 and add them to our dataset. Finally, we balance the dataset on both annotated dimensions by removing users and eventually obtain a dataset of 4084 users.
5 Results
We use the following notations to describe the results below:

is the discrepancy between the correlation of and in training versus testing.

(respectively ) is the true (resp. observed) correlation between and .

(respectively ) is after it has been adjusting using the thresholding method (resp. the correlation matching method).

(respectively ) is the F1 score for a (resp. ) classifier, i.e. for the preliminary (resp. main) study.
5.1 Effects of correlation adjustments on
For this first part of our results, we obtain quasiidentical outcomes for both datasets. Therefore, we only present the results from the location/gender dataset.
thresholding method: We make vary between and and observe how this reduces the difference between and . Figure 2(a) shows the result of one setting where . The figure demonstrates that by increasing , gets closer to the true , and the performance of our external study is improved. This indicates that the classifier is well calibrated (since high confidence predictions are more likely to be correct). However, it takes a high value of to get a correct approximation of the true association between and , meaning that we need to discard a large amount of data points from our preliminary study to approximate . For example, at , roughly half of the training instances remain.
Correlation matching method: For this method, we make the true correlation change between and and we plot the results on Figure 2(b). We observe in the top plot that after adjustment, our estimate is within of the true correlation in the worst case against without adjustment. This is a clear improvement in the correlation estimation. (For comparison, achieving a similarly accurate estimate using thresholding requires removing 60% of 1500 instances.) We can also notice that the performance of our preliminary study greatly increases when we improve the estimation of , particularly when is high. For example, when is .8, the improves from .77 to .9, on average. Thus, correlation matching appears to both recover the true correlation while simultaneously improving the quality of the classifications of .
5.2 Effects of correlation adjustments on
No adjustment  Corr. matching  thresh.  

0.784  0.0640  0.0212  0.0610 
0.764  0.0674  0.0313  0.0671 
0.702  0.0677  0.0357  0.0803 
0.670  0.0672  0.0345  0.0783 
0.645  0.0705  0.0537  0.101 
0.557  0.0715  0.124  0.0954 
0.519  0.0709  0.0916  0.0941 
Location / Gender
Fixed : As our primary result, we report the obtained by different correlation adjustment methods across a range of shifts in the discrepancy between training and testing. For the Twitter dataset, the best performance we get in the preliminary study is . We then compare testing as and vary. The results are shown in Figure 3(a). Without any adjustment, the performance we get is close to Logistic Regression. When using thresholding, the performance is slightly improved in the extreme cases but only by a few points at most. However, when using the correlation matching method, we improve by 10 to 15 points in the most extreme cases. For comparison, the figure also shows the fully observed case (BA), which uses backdoor adjustment on the true values of . We can see that correlation matching is comparable to the fully observed case, even with a 20% error rate on . These results show that by getting a better estimate of the association between and , we can reduce attenuation bias and improve the robustness of our classifier, even though our observation of is noisy.
Variable : We showed in the previous section that when we use our preliminary study with , we can build a robust classifier using the correlation matching method combined with backdoor adjustment. We also saw in Figure 1 that backdoor adjustment when is observed at training time is sensitive to noise in . As a similar study, we want to see how sensitive the correlation adjustment methods are to the quality of . To do so, we increasingly add noise to the dataset used to train the preliminary classifier () to make decrease. Because we want to visualize against two variables ( and ), we visualize the results in a heatmap. In order to make the results clear to the reader, here are additional details to understand what is displayed on the heatmap: The xaxis of a heatmap is and the yaxis is . The line plot on the left of the heatmap shows given averaged over all possible values for . The error bars are the standard deviations of , indicating how sensitive the model is to variations of . Similarly, the scatter plot above the heatmap shows given averaged over all possible values for . The error bars are the standard deviations of for the matching .
Moreover, Table 2 displays the values of the standard deviations shown in the scatter plot at the left of each heatmap as a measure of robustness. Figure 4(a) shows the heatmap of results for backdoor adjustment when we use the predictions of the preliminary study but none of the methods to fix the mislabeled values in are used. Figures 4(b) and 4(c) respectively show the heatmaps of results when we use thresholding with and correlation matching. Similar to Figure 3(a), thresholding only brings small improvement to no adjustment at all. Furthermore, when decreases, the correlation adjustment using thresholding is performing worse than when we are not doing any correlation adjustment as well as it is less robust. Clearly, the thresholding method is more sensitive to the quality of the preliminary study than the other methods.
The correlation matching method (Figure 2(b)) does outperform the other methods in robustness and for most of the cases but when , as we can see by the wider range of red values in Figure 4(c). In this latter case, it performs worse than the method without adjustment. This method is also sensitive to the quality of the preliminary study as we can see that the averaged decreases with . Let us remind one more time that we are considering here only preliminary studies with an of at most . Therefore, could be up to 22 points greater with a different dataset. This would hopefully lead to similar results than when with correlation matching and better results in and robustness with thresholding.
Smoker / Gender
Fixed : Similarly to the previous experiment, we report while making vary as our primary result in Figure 3(b). We observe that predicting if a user smokes or not is a much more difficult task than our previous binary location prediction task, as the maximum yielded is around .75 when it was approximately .9 in the previous task. We also notice that the robustness of the backdoor adjustment methods is not as good as for the location/gender dataset. The correlation matching method manages to performs closely to for and outperforms all other methods for but we also witness an accuracy drop on the left part of the plot. In addition to this drop, our two most robust methods ( and correlation matching) are outperformed by approximately 5 points when there is no difference between the training correlation and the testing correlation (when ).
Variable : When making vary with the smoker/gender dataset, we observe comparable outcomes as the ones displayed in the heatmaps of Figure 4 but with a lesser overall accuracy. As backdoor adjustment was not performing as well as with the location/gender dataset in the fixed case, it logically also does not perform as well when varies. If we obtain a Vshaped heatmaps similar to Figures 4(b) and 4(c), the slope indicating that the classifier’s’ performance deteriorates when decreases is steeper. This may show that our adjustments methods are more sensitive to noise in the confounding variable when the classification task is overall harder. We do not display the resulting heatmap for the smoker/gender experiment in this paper for brevity but we will make the dataset and the code to reproduce the results available online.
6 Conclusion
In this paper, we have proposed two methods of using backdoor adjustment to control for an unobserved confounder. Using two reallife datasets extracted from Twitter, we have found that correlation matching on the predicted confounder associated with backdoor adjustment can retrieve the underlying correlation and perform closely to backdoor adjustment with an observed confounder. We also showed that thresholding can be used to slightly improve the predictions compared to logistic regression. If thresholding will not be able to adjust for the unobserved confounder when , we showed that correlation matching provides a way to adjust for an unobserved confounder and outperform plain backdoor adjustment as long as . In future work, we will consider hybrid methods that combine thresholding and correlation matching to increase robustness as decreases.
Acknowledgments
This research was funded in part by the National Science Foundation under awards #IIS1526674 and #IIS1618244.
References
 [Angwin et al. 2016] Angwin, J.; Larson, J.; Mattu, S.; and Kirchner, L. 2016. Machine bias. ProPublica 23.
 [Bareinboim, Tian, and Pearl 2014] Bareinboim, E.; Tian, J.; and Pearl, J. 2014. Recovering from selection bias in causal and statistical inference. In Proceedings of The TwentyEighth Conference on Artificial Intelligence (CE Brodley and P. Stone, eds.). AAAI Press, Menlo Park, CA.
 [Bickel, Brückner, and Scheffer 2009] Bickel, S.; Brückner, M.; and Scheffer, T. 2009. Discriminative learning under covariate shift. Journal of Machine Learning Research 10(Sep):2137–2155.
 [Chesher 1991] Chesher, A. 1991. The effect of measurement error. Biometrika 78(3):451–462.
 [Cunha, Weber, and Pappa 2017] Cunha, T. O.; Weber, I.; and Pappa, G. L. 2017. A warm welcome matters! the link between social feedback and weight loss in/r/loseit. arXiv preprint arXiv:1701.05225.
 [De Choudhury et al. 2016] De Choudhury, M.; Kiciman, E.; Dredze, M.; Coppersmith, G.; and Kumar, M. 2016. Discovering shifts to suicidal ideation from mental health content in social media. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, 2098–2110. ACM.
 [De Choudhury, Counts, and Horvitz 2013] De Choudhury, M.; Counts, S.; and Horvitz, E. 2013. Predicting postpartum changes in emotion and behavior via social media. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 3267–3276. ACM.
 [Francis, Coats, and Gibson 1999] Francis, D. P.; Coats, A. J.; and Gibson, D. G. 1999. How high can a correlation coefficient be? effects of limited reproducibility of common cardiological measures. International journal of cardiology 69(2):185–189.
 [Fukuchi, Sakuma, and Kamishima 2013] Fukuchi, K.; Sakuma, J.; and Kamishima, T. 2013. Prediction with modelbased neutrality. In Machine Learning and Knowledge Discovery in Databases. Springer. 499–514.
 [Hajian and DomingoFerrer 2013] Hajian, S., and DomingoFerrer, J. 2013. A methodology for direct and indirect discrimination prevention in data mining. Knowledge and Data Engineering, IEEE Transactions on 25(7):1445–1459.
 [Hand and Henley 1997] Hand, D. J., and Henley, W. E. 1997. Statistical classification methods in consumer credit scoring: a review. Journal of the Royal Statistical Society: Series A (Statistics in Society) 160(3):523–541.
 [King, Keohane, and Verba 1994] King, G.; Keohane, R. O.; and Verba, S. 1994. Designing social inquiry: Scientific inference in qualitative research. Princeton university press.
 [Koratana et al. 2016] Koratana, A.; Dredze, M.; Chisolm, M. S.; Johnson, M. W.; and Paul, M. J. 2016. Studying anonymous health issues and substance use on college campuses with yik yak. In Workshops at the Thirtieth AAAI Conference on Artificial Intelligence.
 [Kuroki and Pearl 2014] Kuroki, M., and Pearl, J. 2014. Measurement bias and effect restoration in causal inference. Biometrika 101(2):423–437.
 [Landeiro and Culotta 2016] Landeiro, V., and Culotta, A. 2016. Robust text classification in the presence of confounding bias. In Thirtieth AAAI Conference on Artificial Intelligence.
 [Lazer et al. 2009] Lazer, D.; Pentland, A. S.; Adamic, L.; Aral, S.; Barabasi, A. L.; Brewer, D.; Christakis, N.; Contractor, N.; Fowler, J.; Gutmann, M.; et al. 2009. Life in the network: the coming age of computational social science. Science (New York, NY) 323(5915):721.
 [Miller 2015] Miller, C. C. 2015. Can an algorithm hire better than a human? The New York Times 25.
 [Monahan and Skeem 2016] Monahan, J., and Skeem, J. L. 2016. Risk assessment in criminal sentencing. Annual Review of Clinical Psychology 12:489–513.
 [Paul and Dredze 2011] Paul, M. J., and Dredze, M. 2011. You are what you tweet: Analyzing twitter for public health. ICWSM 20:265–272.
 [Pearl 2003] Pearl, J. 2003. Causality: models, reasoning and inference. Econometric Theory 19:675–685.
 [Pedreshi, Ruggieri, and Turini 2008] Pedreshi, D.; Ruggieri, S.; and Turini, F. 2008. Discriminationaware data mining. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 560–568. ACM.
 [Pennacchiotti and Popescu 2011] Pennacchiotti, M., and Popescu, A.M. 2011. A machine learning approach to twitter user classification. ICWSM 11(1):281–288.
 [QuioneroCandela et al. 2009] QuioneroCandela, J.; Sugiyama, M.; Schwaighofer, A.; and Lawrence, N. D. 2009. Dataset shift in machine learning. The MIT Press.
 [Sugiyama, Krauledat, and Müller 2007] Sugiyama, M.; Krauledat, M.; and Müller, K.R. 2007. Covariate shift adaptation by importance weighted cross validation. The Journal of Machine Learning Research 8:985–1005.
 [Tsymbal 2004] Tsymbal, A. 2004. The problem of concept drift: definitions and related work. Computer Science Department, Trinity College Dublin 106.
 [Widmer and Kubat 1996] Widmer, G., and Kubat, M. 1996. Learning in the presence of concept drift and hidden contexts. Machine learning 23(1):69–101.
 [Zadrozny 2004] Zadrozny, B. 2004. Learning and evaluating classifiers under sample selection bias. In Proceedings of the twentyfirst international conference on Machine learning, 114. ACM.
 [Zemel et al. 2013] Zemel, R.; Wu, Y.; Swersky, K.; Pitassi, T.; and Dwork, C. 2013. Learning fair representations. In Proceedings of the 30th International Conference on Machine Learning (ICML13), 325–333.