Measuring the effects of confounders in medical supervised classification problems: the Confounding Index (CI)
Abstract
Over the years, there has been growing interest in using Machine Learning techniques for biomedical data processing. When tackling these tasks, one needs to bear in mind that biomedical data depends on a variety of characteristics, such as demographic aspects (age, gender, etc) or the acquisition technology, which might be unrelated with the target of the analysis. In supervised tasks, failing to match the ground truth targets with respect to such characteristics, called confounders, may lead to very misleading estimates of the predictive performance. Many strategies have been proposed to handle confounders, ranging from data selection, to normalization techniques, up to the use of training algorithm for learning with imbalanced data. However, all these solutions require the confounders to be known a priori. To this aim, we introduce a novel index that is able to measure the confounding effect of a data attribute in a biasagnostic way. This index can be used to quantitatively compare the confounding effects of different variables and to inform correction methods such as normalization procedures or adhocprepared learning algorithms. The effectiveness of this index is validated on both simulated data and realworld neuroimaging data.
keywords:
Machine Learning, Confounding variables, Biomedical Data, Classification1.0
1 Introduction
In the last years, there has been a growing interest in the use of supervised learning in biomedical contexts.
However, such biomedical applications are often subject to the detrimental effects of socalled confounders, that are characteristics of the data generation process that do not represent clinically relevant aspects, but might nevertheless bias the training process of the predictor neto2018using; neto2017learning; greenland2001confounding. In neuroimaging studies, for instance, the confounding effect of demographic characteristics such as gender and age is amply discussed rao2017predictive; brown2012adhd. Studies on biometric sensor data, instead, have shown that the relationship between features and disease class label learned by the classifier is confounded by the identity of the subjects, because the easier task of subject identification replaces the harder task of disease recognition saeb2017need; neto2017learning. Finally, learning algorithms trained with a collection of different databases, a common practice in biomedical applications, suffer from high generalization errors caused by the confounding effects of the different acquisition modalities or recruitment criteria zhao2018multiple. This phenomenon is often referred to as ’batch effect’ in geneexpression studies lazar2012batch and it is proved that it may lead to spurious findings and hide real patterns scherer2009batch; lazar2012batch; leek2010tackling; akey2007design; parker2012practical; soneson2014batch.
The acknowledgement of these problems has thus brought to a precise definition of a confounder as a variable that affects the features under examinations and has an association with the target variable in the training sample that differs from that in the population of interest rao2017predictive. In other words, the training set presents a bias with respect to such confounding variable. The approaches developed to deal with confounders can be summarized in three broad classes. The first and most intuitive one matches training data with respect to the confounder, thus eliminating the bias, at the cost of discarding subjects and impoverishing the dataset Rao2017PredictiveMU; neto2018using. A second approach corrects data with a normalization procedure, regressing out the contribution of the confounder before estimating the predictive model dukart2011age; abdulkadir2014reduction; Rao2017PredictiveMU. However, the dependency of the data from the confounders may not be trivial to capture in a normalization function and this problem is exacerbated when different confounders are considered together. For example, batch effects cannot be easily eliminated by the most common betweensample normalization methods leek2010tackling; luo2010comparison. Alternatively, confounders have been included as predictors along with the original input features during predictive modelling rao2015comparison; Rao2017PredictiveMU. However, it has been noted that the inclusion in the input data of a confounder that is highly associated with the correct response may actually increase its effect, since in this case the confounder alone can be used to predict the response neto2018using. Recently, more articulated approaches have been developed that operate on the learning model rather than on the data itself, for instance resorting to Domain Adaptation techniques zhao2018multiple. Similarly, some attempts have been made using approaches designed to enforce fairness requirements in learning algorithms, e.g. so that sensitive information (e.g. ethnicity) does not influence the outcome of a predictor hardt2016equality; zafar2017fairness; calders2009building; donini2018empirical. However, also in these models, it is very difficult to correct for multiple confounders, as it would be necessary in biomedical studies.
An effective solution to the confounders problem thus requires combining the techniques described above: normalizing for the confounders that have a known effect on the data, matching the subjects if this does not excessively reduce sample size, and adopting a learning algorithm able to manage the biases that have not been eliminated earlier. When planning such an articulated approach, it is useful to have an instrument that can quantify the effect of a confounding variable and assess the effectiveness of the possible countermeasures. To this aim, we present in this paper a novel figure of merit, called ’Confounding Index’ (CI) that measures the confounding effect of a variable in a classification task tackled through Machine Learning (ML) models. Previous works on this subject are limited, to our knowledge, to a single, still unpublished, study neto2018using. However, their measure of the confounding effect of a variable (thoroughly investigated in Section 3), is strictly related to the specific bias present in the training set, while our CI is independent from it.
The proposed CI founds on measuring the variation of the area under the receiver operating characteristic curves (AUCs) obtained using different, engineered biases during training, and thus depends on how the confounder and the class labels affect the input features. The CI ranges from to and allows:

to test the effect of a confounding variable for the specific classifier under use;

to rank variables with respect to their confounding effect;

to anticipate the effectiveness of a normalization procedure and assess the robustness of a training algorithm against confounding effects.
While the proposed approach is formulated for discretevalued variables, it can straightforwardly be applied to assess the confounding effect of continuous ones by simply discretizing their values (an example of this is shown in the empirical assessment on our index). In such a scenario, CI allows to identify the widest range of values for which the effect of such variables can be ignored.
The biomedical sector is the one that we believe to be more suitable for the application of our CI since biomedical data, far more than other data types, depend in complex ways on many known and hidden factors of the data generation process. However, the proposed CI is general enough to be applied in any supervised classification setup. The remainder of the paper is organized as follows: Section 2 introduces the formalization of the problem and the notation used throughout the paper, Section 3 discusses in detail the only other related work on this topic in literature. Section 4 and 5 describe the CI and its implementation. Sections 6 and 7 report the experimental setup and the results of the analysis performed on both simulated and realworld neuroimaging data, while Section 8 concludes the paper.
2 Notation
In this section we introduce the notation used in this paper to describe a binary classification framework and the problem of confounders we want to address.
2.1 Twoclass machine learning
In a typical twoclass classification task, we define the Ldimensional input feature vectors and their binary output labels , where is the sample index. The domain of the task can be defined as a subset of :
(1) 
We suppose that the feature vectors depend on their output label and other unknown variables, which can be considered irrelevant for the applicability and effectiveness of the classification algorithm.
Let us define two subsamples of called and as follows:
(2) 
From here on we will refer generically to any of these subsamples using .
Training a twoclass ML algorithm can be represented as a function as follows:
(3) 
where is the space of the possible inference models such that and denotes the power set of , i.e., the set of all possible subsets of . Thus, training a binary classifier can be viewed as finding the function , where and .
The function is chosen from the subset of the inference models that can be explored by the chosen algorithm as the one that minimizes the error made in labelling only the elements of the training set .
To evaluate the generalizability of , various figures of merit exist that quantify the error resulting from the application of on samples external to the training set. The commonest ones are Accuracy, Sensitivity, Specificity and AUC.
Between these quantities, the AUC is the only one that takes into account all the four possible outcomes of the classifiers (true positive, false positive, true negative and false negative).
It is calculated as the area under the curve obtained by plotting the true positive rate against the false positive rate at various discrimination threshold settings. Thus, the AUC is a measure of the classification performances of a classifier that is independent on the specific threshold used to assign the labels and it is commonly considered the most reliable metric for the evaluation of machine learning algorithms bradley1997use.
For this reason, both our work and the one described in neto2018using, are based on this quantity. In particular, for the definition of the CI we need to introduce as a function that returns the AUC of a model on a validation set , a subset of without any intersections with the training set .
2.2 The confounding effect
Let us suppose that among the various unknown variables that affect the values of , there is a variable (with no causative effect on the actual class label and with values in ) that might have a confounding effect on the classification algorithm, while the others remain irrelevant for our purpose. In this situation we consider as mainly dependent on two variables and and thus we can define four subsamples of : , , and .
For instance, is defined as follows:
(4) 
and similarly for the others. Every vector of features can be described as the sum of three contributions. One term represents the dependence from the unknown variables, while the other two represent the dependence from the variables and . For example, (and similarly for the other subsamples) can be written as:
(5) 
Where is the dimension of the feature vectors , is the canonical base and are scalar weights that depend on the characteristic of the specific example and on the type of feature .
and are the feature vector positions that depend, respectively, on the variable and on the label variable .
The terms , with , depend on the function , while those with depend on .
These functions describe how the the values of and affect the feature vectors.
The main objective of a binary classifier trained to distinguish between and is to learn a pattern based on the differences introduced in by and .
However, when the patterns due to and are more easily identifiable than the ones of interest, and the distribution of the values of is uneven across the training samples and , the classifier can be misled and we say that has a confounding effect on the classification task.
3 Related Works
Previous literature on confounders is divided mainly on statements of the problem and solution proposals (i.e. normalization procedures, corrected lossfunctions, etc.). However, every study estimates the confounding effects on its results differently, often taking arbitrary decisions about which possible confounders to consider and how.
Our objective is thus to propose our CI as a standardized tool that quantifies how much an analysis based on a specific classifier and with a specific classification task can be affected by the presence of a confounder. In this section we analyze the only work with a similar aim (neto2018using), discussing its limits and its differences with respect to our proposal. For consistency, we will use the same notation previously described.
Let us consider a training dataset biased with respect to the variable . This means that, for example, most subjects with have . The authors define a restricted permutation of the class labels of the training dataset as a permutation in which the correlation between and is preserved, i.e., after the permutation, the number of subjects for each pair of remains equal to the original situation (see Fig. 1).
The authors claim that in this kind of permutation ’the direct association between the response and the features is destroyed while the indirect association mediated by the confounder is still preserved’ neto2018using.
Based on this consideration, they suggest to compare the mean value of the AUCs, , obtained with several restricted permutations, to its expected value in the case in which is not confounding (i.e., ).
Our concerns arise from the fact that the restricted permutation does not actually remove the correlation between the confounder and the class label, thus leading to an overestimation of the confounding effect. We tested this hypothesis with simulated data consisting in twofeature vectors in which the variable affects the first feature, while the variable affects the second one.
In our experiment and have the same effect on the feature vectors, because we want to test whether the restricted permutation test is able to correctly identify as not confounding.
In Fig. 2 we report the values of as a function of the percentage of input data with , which represents the bias in the training phase.
If the index was correct we should have obtained independently from the value of , because in these simulated data is not confounding. However, as shown in Fig. 2, increases monotonically with respect to the value of the bias . Given that these data depend only on and , but are not differentiated with respect to the values of , the increase of can be attributed only to the dependence of the data from .
With this discussion, we have shown that the restricted permutation method consistently overestimates the confounding effect of a variable because it is unable to eliminate the association between label and features.
Furthermore, another intrinsic limit of this approach is that it measures a quantity that depends on the specific bias present in the training set. Thus, it does not allow to make any general conclusion on the magnitude of the effects of the confounder on the learning process. This includes the impossibility of ranking different confounders.
4 Definition of Confounding Index (CI)
In this section, we present our definition of Confounding Index (CI). This index makes it possible to compare the confounding effects of categorical variables with respect to a defined twoclass classification task, with a measure that does not depend on the particular bias present in the training dataset.
Basically, it shows how easily the differences due to a possible confounder can be detected by a ML algorithm with respect to the differences due to the classes we want to study. The applicability of the proposed CI can be extended also to study the confounding effects of continuous variables with an appropriate binning.
We begin by defining and discussing the validity of our CI. Then, we provide a pseudocode description for computing it.
4.1 CI definition
In order for our CI not to depend on the particular bias of the training set, we have to study how the model function obtained with the training varies with respect to the bias . Thus, let us consider a group of model functions obtained using different compositions of the training set (in the following notation the set subscripts denote the size of the samples):
(6) 
When , training does not present any bias with respect to the confounding variable, thus, the error committed by the model should not depend on the distribution of over the validation set, which means that all the following expressions should be equivalent (except for finite sample effects):
(7)  
Thus, from now on, we will use the term to refer to any of these three values.
For , is obtained from a biased training with bias , which means that only if and are both true (as above except for finite sample effects). In this case, Eq. (7) should hold for a generic , too.
From this observation it derives that the following condition is necessary for to be a confounding variable:
(8) 
In particular, considering which samples are more and less represented in Eq. (6), if is confounding enough to affect the training phase, we can expect a monotone increase in with respect to . This happens because the training and validation sets are biased in the same way with respect to the values of . Conversely, when the training and validation sets are oppositely biased, the logic that the model learns during the training phase no longer holds in the validation phase and thus the value should monotonically decrease. Summarizing, if the variable has a confounding effect both these monotonicity conditions should hold:
(9) 
When all the three conditions in Eq. (8) and (9) are satisfied, we define as a confounder and we can formulate a possible figure of merit to quantify its effects as the difference between the two terms of inequality (8), integrated over :
(10) 
Eq. (10) can be rewritten in a more explicit form as the sum of two contributions and which are:
(11) 
where () represents the increase (decrease) with respect to in case the training and validation sets are biased in the same (opposite) way with respect to the value of .
It is important to consider both these terms because, since is confined in the interval , depending on the value of , one of the two contributions may not correctly reflect the effect of the bias.
The figure of merit just described, , summarizes the effects that various degrees of bias in the training sets have in a particular classification problem.
However, it has been defined studying the of the group of model functions described in Eq. (6), that have been constructed using a positive correlation (measured by the bias ) between the pairs of labels and , while completely neglecting the possibility of the inverse situation: a positive correlation between the pairs and .
Given that we want to define a measure that does not depend on the particular bias present in the training dataset, we should consider also the calculated from the group of model functions :
(12) 
To understand why and can differ, let us consider for example the case in which both the differences due to and are not easily understandable by the model, and while . In this situation the effects of a bias can depend on which correlation we choose when building the training set, and as a consequence also the validation set. This happens because when we compute we are measuring the effects of a positive correlation between the pairs ; given that both these variables are affecting the same features, the confounding effect will be higher with respect to the case of the opposite correlation, measured with .
Given that the real correlation between these two variables is unknown and that we want to measure the confounding effects in the worst case possible, we define our CI as:
(13) 
4.2 CI applicability
Looking at the definitions of and in Eq. 10 and 12 it is clear that both these measurements have values in , because they represent a difference between two numbers, the , constrained to be in and integrated over a range of unitary length.
However, as previously stated, (and thus also ), can be used to quantify the confounding effects of only if the conditions in Eq. (8) and (9) are valid.
In particular, if the monotonicity conditions hold, and have values in (because the integrand is positive), thus this is also the range of meaningful values of the proposed , where 1 indicates the maximum confounding effect measurable, while 0 means the absence of any effect.
The monotonicity evaluation can be performed both using automated techniques or even with just a visual inspection of the data.
Dealing with real data, which can be scarce and noisy, the values of the can oscillate significantly. We therefore suggest to repeat the calculation more than once, using the mean values to compute the proposed , and to properly evaluate the propagation of the errors associated to these values.
Another fact to take into account is that the has been defined under the hypothesis that the input data depend mainly on two variables: the label one and a possibly confounding one , and considering the dependency from other variables irrelevant for the classification purposes. If this condition is not valid and the data depend strongly on other variables, it is very important to match the training sample with respect to these ones. Skipping this match operation may introduce unwanted biases in the computed with respect to these confounding variables.
Summarizing, the steps to correctly evaluate the confounding effect of a variable are:

Build the various training and validation datasets necessary for the CI calculation, matching the data for other possible variables that can affect this analysis.

Compute the needed to evaluate the confounding effect of and use their average values for the calculation of and .

Evaluate the monotonicity conditions.

Between and , take the one that satisfies the monotonicity conditions as the value of . If both of them do, take the largest one.
4.3 Pseudocode for the calculation of the CI
In this section, we describe the two algorithms necessary to evaluate the confounding effect of a variable . The Algorithm 1 calculates the quantity (the calculation of is equivalent) while the Algorithm 2 describes the steps to asses the value.
Note that, in order to get the most faithful results possible, all training and validation sets should be carefully matched for all the possible confounding variables that are not under study.
In this pseudocode, with respect to the description made in Section 4.1, we have corrected the value of (and thus also of ) for the effect of the discrete steps used to explore the range of . In fact, the maximum value of for a finite step size is not 1 but . Correcting for this factor is necessary in order to obtain a value that can be easily compared to other ones obtained with a different step size.
5 Monotonicity evaluation
As already explained in Section 4.2, our CI can be calculated only under the monotonicity conditions of Eq. (9) which should be verified. This can be done with just a visual inspection of the data, or using various trend analysis methods already described in literature. In this section we will briefly illustrate the method presented in brooks2005scale that we have used for all the analysis described in this paper. This method is easily implementable, computationally inexpensive and can be used on noisy data.
5.1 Deltamonotonicity
The method described in brooks2005scale is based on the notion of scalebased monotonicity, which means that the fluctuations within a certain scale are ignored. The scale of the fluctuations can be chosen by the user and is called . Given a function defined over an ordered set of real values , two elements where are called a if their images under are significantly different, while the images of the points between them can be considered constant on the scale of . This can be summarized with the following two conditions, graphically illustrated in Fig. 3:


implies and
A direction is increasing or decreasing according to whether or . Given these preliminary definitions, the authors of brooks2005scale define as over if all the have the same direction.
6 Materials and Methods
In this section we will first show the effectiveness of our on simulated data and describe a possible application on real world data.
Artificial data in fact allow to analyze how varies with respect to the differences introduced in the input data due to and , while real world data can give a practical idea of the usefulness of the CI.
The real data used in this study are neuroimaging data di2014autism; di2017enhancing which, as all the biomedical data, depend on several variables that can have a confounding effect on many different classification tasks. We will show that our index is able to rank the most important confounding variables that can affect a classifier, giving the possibility to design strategic solutions to the problem. We will also show that this index can be used for continuous variables, and in this case it can help in identifying a range of values in which the confounding effect of a variable can be considered irrelevant.
In all the analyses described in this section we have used a logistic regression classifier, but the CI can be constructed for any binary classifier.
6.1 Simulated Data
As already described in Section 4.1, we are considering the situation in which the input data of a classification problem depend mainly on two binary variables: the class label and a second variable that can have a confounding effect on the classification task. To assess the validity and effectiveness of our , we have generated four subgroups of artificial input data: , , and .
The data belonging to every subgroup have been generated as explained in Eq. (5).
In our simulation every is a vector of 100 features, in which the first contribution (the one that was described as the linear combination of the elements) has been simulated attributing to every feature a random real value in the range . The second and the third contributions (the ones that explain how the input data depend on and ) have been simulated adding or subtracting a constant value to a limited set of features, making the sum of all the contributions equal to zero.
In particular, to avoid the classifier learning a pattern based on the total sum of all the features, instead of finding which features are influenced by the value of a particular variable, the functions , , and add a constant to two features and subtract the same constant from other two features.
We have tested our in two different kinds of situations.
First, when the variables and influence different groups of features, meaning that . Then we have explored some cases in which the two variables affect the same group of features.
6.1.1 CI evaluation when the variables affect different features
In this analysis we consider the case in which and affect different features of the input data, as illustrated in Fig. 4.
As shown in the figure, we assume that the constant added for is the same for and it is called . Thus, the two classes differ only for the set of features depending on that are and . The same happens also for the variable , characterized by an additive constant .
In this situation, in which , we want to study how our responds to changes with respect to both and .
Then, we want to assess whether the confounding effect of depends on which correlation between and we choose when building the training set, which means that we want to study the difference between and .
Finally, we want to prove what we have previously said about the importance of calculating (or ) as the sum of and (or and ), showing how these quantities vary with respect to and .
To perform these analyses we have calculated the values of , , , and thus on the simulated data just described, with and varying from 0 to 10 with steps of 0.5. We have chosen this range because 0 represents the situation of nodifferences and 10 is the range of oscillation of our artificial noise.
6.1.2 CI evaluation when the variables can influence the same features
With this second analysis we want to explore the behaviour of our in more complex situations, when and affect the same group of features and the dependence of the input data from them is expressed by four different constants , , and .
At the end of Section 4.1, we hypothesized that and can differ when, for example, and with this analysis we want to test our hypothesis and experimentally study other situations that can cause a difference between and .
Exploring all the possible combinations of intersections between , , and is clearly computationally infeasible, thus we analyze only the extreme situations represented in Table 1, in which two or more of these groups of indexes are equal.
We let the values of all the four constants vary in , in order to explore the effects of small and large differences.
6.2 Neuroimaging Data
In this section, we describe an example of a possible application of our on a real problem.
The problem taken as example is the classification of subjects affected by Autism Spectrum Disorders (ASD) versus Healthy Controls (HC), through the analysis of their neuroimaging data with ML algorithms.
Neuroimaging data, as every biological data, can possibly depend on a great number of phenotypical characteristics of the subjects, but the relationships that correlate the data to them are unknown. Thus, it is difficult both to apply a proper normalization and to decide for which variables it is essential to match the training subjects.
No standards exist in literature. Some studies take into account some characteristics that others are neglecting. In our analysis we will focus on the main phenotypical characteristics cited in literature as possibly confounding and we will show that our gives us an idea of the importance of the problem, making the design of a study easier and more objective.
For this analysis we consider all the structural Magnetic Resonance Images (sMRI) available in the two collections ABIDE I di2014autism and ABIDE II di2017enhancing, the two biggest public databases for the study of ASD, containing brain magnetic resonance images of 2226 subjects.
These images have been processed with Freesurfer version 6.0 fischl2012freesurfer, the most commonly used software for the segmentation of the brain. This processing extracts quantitative morphological features related to the cortical and subcortical structures and to the whole brain: these last ones are generally called global features.
We selected 296 brain morphometric features, divided into:

volumes of 38 subcortical structures (the cortical structures are defined according to the Aseg Atlasfischl2002whole);

10 whole brain features;

volume, surface area, mean thickness and mean curvature of 31 cortical bilateral structures, segmented according to the DesikanKillianyTourville Atlasklein2012101.
The analysis consists in testing the confounding effects of various variables, computing all the AUCs necessary for the calculation of our with respect to the task of distinguishing between HCs and ASDs.
As already explained in Section 4, in order to obtain a good estimation of the the bias added in the training set must be related exclusively to the variable under examination, while the others should be controlled. This means that, in order to obtain reliable results, the subjects of the two categories must be matched for all the variables that may be confounding and that are not under study.
Furthermore, the matching operation must be performed on a casebycase basis, i.e., for each ASD subject, another HC matched for all the possible confounding parameters, must be included in the training set.
This matching operation can be difficult to perform and unavoidably reduces the number of subjects that can be used for the training. However, in order to be able to compare the calculated for the different possible confounding variables, it is important that all the training sets contain the same number of subjects and that the calculation of the various s are done exploring a sufficient number of biases.
The possibly confounding variables that we want to study are gender, age, handedness and Full Intelligence Quotient (FIQ). In fact, many authors supposed that the differences due to the mental abilities between ASDs and HCs can be a confounding factor that has to be avoided, because the meaning of finding a classifier able to distinguish between them is to help the physician to correctly diagnose which subject are ASDs and which ones are affected by other forms of mental retardation or neurodevelopmental disorders.
Besides these characteristics of the subjects that are typically mentioned as possibly confounding factors, we want also to analyze the of the data acquisition site. In fact, ABIDE, as most of the neuroimaging datasets, is a multicentric database and its sMRIs have been acquired in 40 different sites, each one using its own machines and acquisition protocols.
In most neuroimaging studies, data are analyzed without taking into account their different acquisition modalities for two reasons. First, because usually the database collects data acquired with the same macroscopical sMRI settings, which are considered to produce equivalent images. Second, because data used in the classification task are not the raw image data that may depend on the acquisition settings, but features obtained with segmentation tools that are supposed to extract more abstract quantities, as stated by the authors of Freesufer fischl2004sequence.
However, given the scarce reproducibility of the results obtained in neuroimaging literature, in the last years, an ever growing awareness has spread that the use of multicentric datasets may bias the analysis auzias2016influence.
Summarizing, in this application we want to study the confounding effect of gender, age, handedness, FIQ and site in a classification problem between ASDs and HCs using neuroimaging data.
Age and FIQ are continuous variables, thus, in order to compute the in these cases it is necessary to discretize their values.
This has been done selecting the length of the range of values that represent a single discrete unit and confronting the value of considering the confounding effect of two units separated by a distance .
7 Results and discussion
7.1 Simulated Data
7.1.1 CI evaluation when the variables affect different features
The results of the analysis described in Section 6.1.1 are reported in Fig. 5a, in which the values are plotted as a function of and .
As the plot shows, our depends both on and .
Furthermore, as we would expect, the confounding effect of is weaker for easier tasks (i.e. the ones with higher ) and stronger for harder tasks.
The Fig. 5c and 5d are an example of how every point in Fig. 5a is calculated; in fact is the maximum value between and , the two quantities shown in the plots.
These quantities are calculated considering the two possible correlations between and . They are obtained as the sum between two areas, one labelled , that shows how much the of a classifier increase if trained and validated on a favorably biased dataset, and the other called that on the contrary shows how they decrease when the bias is unfavorable.
For visualization purposes we attributed to the bias a negative value when computing the of the part.
In these images and can be considered equal within the estimated error, and we have found that this is true also for all the s calculated in this simulation, in which and affect different features. This can be intuitively visualized from the plot 5b, in which the values of and are reported on the two axes. As this image shows, all the points lay on the line , which means that the two values are consistent.
Finally, as shown in Fig. 5c and 5d, the and contributions seem equivalent, but these plots have been obtained when the in an unbiased situation is 0.5, i.e., . When considering all the and combinations explored in this analysis we see that the two contributions are different (see Fig. 6). In fact, when the tasks are easy, the unbiased are already very high and thus a favorable bias cannot significantly improve them, resulting in too small contributions even for high values of . In these cases in fact, the confounding effect is not manifested with an improvement of the , but with a change in the classification pattern extracted during the training. Similarly, when the tasks are hard, even a small unfavorable bias can significantly reduce the , bringing it next to 0 and making the contribution solely dependent on the unbiased , which will be lower the harder the task is. The problems just mentioned are both caused by the dependency of the and contributions on the unbiased . Their sum does not suffer from this dependency (see Eq. (10)) and it is, thus, the best figure of merit to correctly assess the entity of the confounding effect for any task complexity.
7.1.2 CI evaluation when the variables can influence the same features
Similarly to Fig. 6, Fig. 7 shows the results obtained with the analysis described in Section 6.1.2, showing along the two axes the values of and . The orange points, that all lay in the diagonal of the plot, have been obtained with the first 4 configurations described in Table 1. All of them are characterized by the absence of intersections between the group of features influenced by and the ones influenced by : .
As expected, the correlation between the values of and chosen for the calculation of the index is irrelevant, and thus , if and affect different features.
The other blue points seem to be grouped in clusters that are symmetrical with respect to the orange line. In these clusters, we can identify two kinds of situations: the one in which both and are positive (first quadrant), and the one in which one of the two quantities is positive while the other one is negative (second and third quadrant).
To better understand when these situations occur, we have analyzed the two clusters (and their respective symmetrical ones) coloured in red.
The points belonging to the clusters in the second and third quadrants, and , are all obtained with data simulated in a configuration in which one specific value of and one specific value of influence the same features, while the other values affect different and independent groups of features (see the last four lines in Table 1). Elements of a specific cluster (and its symmetrical one) have been obtained with the same module of , , , .
The only difference between two symmetrical clusters, e.g., and , is the signs of the constants affecting the same features. In the case depicted in Fig. 8, these signs are those of and .
To better understand why in these situations either or assumes a negative value, let us consider the plots in Fig. 9a and 9b, showing the components of two symmetrical points belonging respectively to the and clusters.
The input data used for the calculation of the quantities in Fig. 9a have as illustrated in Fig. 8 and are characterized by the constants . Thus, when there is a positive correlation between the variable and , their effects cancel each other in a significant portion of the training dataset. This results in a negative value of . Instead, when considering a positive correlation between and , the confounding effect due to is even greater with respect to the situation in which (with the same values). This happens because, if the features belonging to are affected by when and by when and is positively correlated with , it is like is measuring a hypothetically confounding factor with an intensity given by the difference of and , thus of .
For similar reasons, the symmetrical point has a negative and a positive as illustrated in Fig. 9b. In this case , thus measures the effects given by the sum of and , i.e. . When computing , the training dataset is biased in a way that makes the greatest part of the belonging also to , thus presenting the same effect on the same features that characterize the . This makes the data indistinguishable with respect to and brings the of to 0.5.
These two examples show clearly what happens to all the points belonging to these clusters, explaining their symmetry and why there are cases in which and can differ.
However, it is interesting that even in these very degenerate cases, our index gives the same estimation of the confounding effect, correctly reflecting the fact that they are characterized by the same absolute values of the constants. This happens because , being computed as the maximum value between and , always considers the worst possible correlation between and .
Something similar happens also for the clusters of the first quadrant and , that are composed of points obtained with the same configurations of and , but with values that make the task easier and with . For example, let us consider the two points belonging respectively to and illustrated in fig. 10; they have been obtained from simulated data with constants . In this situation, the difference between and is less important than in the one represented in Fig. 9 because the task is easier and because implies that their effects do not cancel each other completely when they are correlated.
Concluding, our analysis on these clusters shows that the situations in which one of or is negative are not inherently different from the situations in which they are both positive but different.
We also believe that the cluster distribution is simply an effect of the coarse grained sampling of the values and of neglecting partial intersections between the groups of features affected by and . Despite not being able to explore all the possible cases described above, it is interesting to note that in none of the performed analyses and were both negative.
7.2 Neuroimaging Data
The s calculated in this analysis are reported in Table 2 for the categorical variables and in Fig. 11 and 12, for the continuous ones, respectively age and FIQ.
For the evaluation of the confounding effect of age, we chose to discretize the range at steps of years, starting from a group of subjects of 14 years old. Thus, the first point of the plot in Fig. 11, in which , shows the calculated considering as and two groups of subjects with an age respectively in the age ranges years and years. Even considering that this two ranges are very close, the is sensibly different from 0. As we would expect, the value of increases with , exceeding 0.6 for . For the evaluation of the FIQ variable we instead considered points, starting from a FIQ score of 76.
The s of this variable are smaller than the s of age and this is also reflected by the oscillations present in Figure 12. However, an increasing trend is clearly detectable, bringing the of FIQ to be significant for high values of . It is remarkable that the points that seem out of the trend are the ones affected by the greatest error.
Summarizing, the application here reported is just an example of how the can be used to understand better the best conditions to design a ML study in the presence of confounding variables. Clearly, to calculate the it is necessary to reduce the number of subjects under examination, in order to match all the possible confounding factors; however we believe that this initial step can help the data analyst to optimize the number of subjects to include in the final analysis, providing a way to objectively evaluate which variables are more confounding (and thus must be matched in training). This analysis, for example, shows that the handedness category is not a confounding variable for the task under examination. Even if these results are related to the features and the specific classifier chosen, we believe that in many studies, recruitment choices such as reducing the dataset to only righthanded subjects, have unduly limited the subjects cohort available for training without a valid justification.
On the other hand, many studies have completely neglected the dependency of the data from the acquisition modalities and from the FIQ. This last variable can be very important, especially if the disease is related to mental disability, while the HCs follow a normal distribution.
This study also shows that the of the acquisition site is higher than the one related to gender and comparable to the one of an age difference of about 11 years, and thus should not be underestimated in a multicentric study.
It is interesting that in all these examples we have never obtained a significantly higher than 0.6, even if, for example, simply training the logistic regression classifier to distinguish between matched subjects acquired from two different sites we obtain a mean AUC of . This is due to the fact that the reaches values around 1 only when the confounding effect is so powerful to completely mislead the classifier even with very small biases in the training set; this rarely happens with real data.
Variable  Error  

Handedness  0.01  0.06 
Sex  0.43  0.03 
Site  0.54  0.02 
8 Conclusions
In this paper we have presented an index for assessing the confounding effect of a categorical variable in a binary classification study.
The study made on simulated data shows the goodness and sensitivity of our CI, the value of which depends on the intensity with which the confounder and the label influence the features under exam.
Furthermore, it has been found that and differ only when the confounder and class labels influence the same features. This phenomenon could give precious insights on the effect of both the confounder and the class labels.
The analysis conducted on neuroimaging data showed very informative results, proving that the CI can also be used on continuous variables by discretizing their values.
The analyses on real and simulated data were aimed at proving the goodness of the CI, but this figure of merit can be successfully used to assess the effectiveness of a normalization procedure or of a training algorithm specifically designed to be robust against biases in the dataset.
The only other work showing a quantity measuring the confounding effect in a classification study bases its concept of confounding on the specific bias in the training set, while our index depends on how much the confounder influences the features under exam. We believe that this information is more useful in an initial phase of study planning. Furthermore, we have shown on simulated data that the other index systematically overestimates the confounding effect of a variable when it is tightly correlated to the class labels.
Concluding, the proposed CI represents a novel and robust instrument to assess confounding effects, useful especially in the biomedical field, in which the data depend on many different variables and correcting the analysis from multiple confounding effects is not straightforward.