Rank-based approach for estimating correlations in mixed ordinal data
High-dimensional mixed data as a combination of both continuous and ordinal variables are widely seen in many research areas such as genomic studies and survey data analysis. Estimating the underlying correlation among mixed data is hence crucial for further inferring dependence structure. We propose a semiparametric latent Gaussian copula model for this problem. We start with estimating the association among ternary-continuous mixed data via a rank-based approach and generalize the methodology to p-level-ordinal and continuous mixed data. Concentration rate of the estimator is also provided and proved. At last, we demonstrate the performance of the proposed estimator by extensive simulations and two case studies of real data examples of algorithmic risk score evaluation and cancer patients survival data.
Keywords: Algorithmic fairness, Gaussian copula model, graphical models, Kendall’s , latent variable models, mixed data, nonparanormal distribution, ordinal data, sparse modeling
High-dimensional multilevel ordinal and continuous mixed data are now routinely collected in many research areas. For example, questionnaires and rating scales are commonly used to measure qualitative variables, such as attitudes and many behavioral, health-related variables and risk score measures and their relation to binary and continuous variables such as income, age, background history, or a phenotype. Another example is that of single nucleotide polymorphism (SNP) and expression data in genetics. Multilevel ordinal variables often arise as a result of discretizing latent continuous variables (Rabe-Hesketh and Skrondal, 2007). Fan et al. (2017) propose a generative latent Gaussian copula model for binary and mixed data, assuming the binary data are obtained by dichotomizing a continuous latent variable. Multilevel ordinal data are common in survey data as well. Estimating the associations between mixed data types is of great importance to gain insights about dependence between the variables, particularly for conditional dependencies and potential causal pathways.
There are several classical rank-based methods for analyzing association among ordinal variables (Agresti, 2010). Specifically, those measures are all based on the numbers of concordant and discordant pairs of observations. A pair of observations, say and , is concordant if the subject that has a higher ranking on also has a higher ranking on , and on the other hand this pair is called discordant if the subject ranking higher on ranks lower on . Kendall’s tau-a () was first proposed by (Kendall, 1938) as a measure that quantifies the difference between proportions of concordant and discordant pairs among all pairs, which is essentially a correlation coefficient for sign scores. Later in 1945, a revised version called tau-b () was introduced (Kendall, 1945) that took tied pairs into consideration. (Goodman and Kruskal, 1954) proposed the gamma measure as the difference between proportions of concordant and discordant pairs among all concordant and discordant pairs. Other similar measures such as Somers’ d (Somers, 1962) also considers the difference between proportions of concordant and discordant pairs, just with a different base as its denominator. Yet it remains an open question to measure the association between ordinal and continuous variables.
With this motivation, we propose a novel method to estimate associations between multilevel ordinal and continuous data using a latent Gaussian copula model approach. We assume that the multilevel ordinal variable is obtained by discretizing a latent variable, and estimate the correlation/covariance matrix underlying the Gaussian copula model via a rank-based approach. These results extend those for the latent Gaussian copula model for binary and continuous data proposed by (Fan et al., 2017).
In the next section we review the concept of Gaussian copula model and define a new latent Gaussian copula model for ordinal-continuous mixed data and review the motivations for Kendall’s rank correlation coefficient. In Section 3, we propose the rank-based estimation for ternary-continuous mixed data and then generalize it to the estimation for ordinal-continuous mixed data. We derive explicit formulas for the bridge functions that connect the Kendall’s of observed data to the latent correlation matrix for different combinations of data types. This requires derivation of new bridge functions, and those derivations are somewhat involved and more complex than in continuous/binary case. We then use these formulas to construct a rank-based estimator of the latent correlation matrix for the mixed data. The significant advantage of bridge function technique is that it allows to estimate the latent correlation structure of Gaussian copula without estimating marginal transformation functions. We also establish theoretical concentration bound results for the new rank-based estimators. In Section 4 we consider the case of tied data. Simulation results are presented in Section 5. In Section 6 we give two real data analysis that highlight the our proposed techniques applies them to the construction graphical models for mixed (binary, continuous, and ordinal) data. The first is a well example in the algorithmic fairness literature about ProPublica’s journalistic investigation on the apparent biases of machine learning based predictive analytics tool, COMPAS, in recidivism risk assessment (Angwin et al., 2016). The second example is another well know example first analyzed by Byar and Green (1980) and subsequently by Hunt and Jorgensen (1999) consisting of 12 mixed type measurements for prostate cancer patients who were diagnosed as having either stage 3 or 4 prostate cancer. We conclude with some discussion in Section 7. The proofs of the main results are given in the appendix.
2.1 Variations of the Gaussian copula model
In recent years, the Gaussian copula model has received a lot of attention due to the ability to relax the normality assumptions of a fully Gaussian model. Formally the Gaussian copula model is defined as follows (Xue and Zou, 2012; Liu et al., 2009, 2012):
(Gaussian copula model). A random vector follows Gaussian copula model if there exists a set of monotonically increasing transformation functions , such that with .
A random vector with these properties is said to follow a nonparanormal distribution denoted by . The distribution is much more flexible that a Gaussian model. In particular, individual components of can have skewed or even multimodal distributions.
Note that the Gaussian copula model only applies to continuous data. We now extend the latent Gaussian copula model to ordinal-continuous mixed data. Following the notation in the binary-continuous mixed case we consider a mixed-data random vector as , where is -dimensional vector of -level discrete variables (with each component of taking values in ) and is a -dimensional vector of continuous variables.
(Latent Gaussian copula models for ordinal-continuous data). The random vector follows the extended latent Gaussian copula model if there exists a -dimensional random vector of latent variables such that if for and , where the cutoff vector is given by and is an increasing sequence of constants, and .
The latent Gaussian copula model for binary-continuous data is just a special case of the above latent Gaussian copula model with . Alternatively, the binary case is retrieved if for . In fact, by setting , where for , we can handle situations with ordinal variables with differing numbers of levels. Fan et al. (2017) proposed the following latent Gaussian copula model as an extension to binary and mixed binary-continuous data:
(Latent Gaussian copula model for binary-continuous mixed data) Consider a mixed-data random vector where is a -dimensional vector of binary variables and is a -dimensional vector of continuous variables. Then follows a latent Gaussian copula model if there exists a -dimensional random vector of latent variables such that for where is a -dimensional vector of constants, with .
Our interest is in estimating the correlation matrix or the precision matrix with for latent Gaussian copula models for ordinal-continuous data. Furthermore, under the Gaussian copula model, the sparsity pattern of the precision matrix reveals the conditional dependencies between for . Hence the graph structure could also be recovered by estimating as in the prostate cancer diagnostic example in Section 6.2.
2.2 Kendall’s rank correlation coefficients
Kendall’s (Kendall’s rank correlation coefficient) is a nonparametric measure of nonlinear dependence between two random variables. It is similar to Spearman’s and Pearson’s , in that is measures the relationship between two variables. Even though Kendall’s is a similar to Spearman’s in that it is a nonparametric measure of relationship it differs in the interpretation of the correlation value. Spearman’s and Pearson’s magnitude are similar, however, Kendall’s is the difference between the probability that the observed data are in the same order versus the probability that the observed data are not in the same order.
Suppose the data consists of independent -dimensional random vectors, , from a latent Gaussian copula model. The rank-based estimation framework for , depending on the data type. Specifically, estimation is based on the “bridge function” that relates Kendall’s parameter, , for each variable pair , , with the correlation, , between them. Here, the parameter is given by
which can be estimated unbiasedly by the corresponding statistic
or equivalently by
where and are the number of concordant and discordant pairs among .
A variation of Kendall’s that accounts for the important case of ties is . Binary and ordinal data are very likely to have a large number of ties in ranking and, as a result, Kendall’s is likely to under-estimate the sample correlation. Therefore we consider a modified version, known as Kendall’s .
where is the number of pairs of tied values of the th response, and similarly .
Since Kendall’s is a ratio of random terms (and the denominator involves a square root), the population bridge function linking it to is intractable. We therefor consider 1st-order and 2nd order Taylor series approximation instead of directly computing its expectation. However, we find there is almost no difference between the 1st- and 2nd-order Taylor series approximations, or between them and a Monte-Carlo approximation of the exact expectation..
Suppose the data consists of independent -dimensional random vectors, , from a latent Gaussian copula model. In this section, we propose a rank-based estimation framework for , depending on the data type. Specifically, estimation is based on the “bridge function” that relates Kendall’s parameter, , for each variable pair , , with the correlation, , between them. The main idea behind our alternative procedure is to exploit Kendall’s statistics to directly estimate the unknown correlation matrix, without explicitly calculating the marginal transformation functions . Recall that the Kendall statistics are invariant under monotonic transformations. For Gaussian random variables there is a one-to-one mapping between these two statistics. For Gaussian copula distributions Kendall’s is connected to the covariance matrix in Definition 1 by .
3.1 Estimate correlation between ternary and ternary data
We begin by considering ternary (3-level) data, and then extend to the general -level case in Section 3.3. Now suppose , are discrete data with 3 categories, taking values . Then under latent Gaussian copula model, we have , and the data are obtained by trichotomizing the latent variable at cutoffs such that
where , for .
To estimate , we divide this into 3 cases where: (i) for , is the correlation between ternary variables; (ii) for , is the correlation between ternary and continuous variables; and (iii) for , is the correlation between continuous variables. In case (iii) it has been shown by Kendall (1948) that . In the remainder of this section, we confine our attention to cases (i) and (ii) respectively.
We first consider Kendall’s for two tenary variables. There are only four cases that need to be considered in order to determine concordance and discordance:
Combining the first two will give a concordant pair and combining the last two will give a discordant pair. So using equations (1) and (3) we can directly calculate the “bridge function” between and as
|(all the tied pairs cases in the first line will cancel out from those in the second line)||(5)|
where the last step follows from
The notation denotes the CDF of standard bivariate normal distribution with correlation , namely where is the probability density function of the standard bivariate normal distribution with correlation .
It follows that the bridge function for the population Kendall’s for variable pair , is given by where
It will be shown in Lemma 3.1 that, for fixed , is an invertible function of .
Simple moment estimators can be derived for the cutoffs using the relations
Specifically, these motivate the estimators
Thus a rank-based estimator of is given by
As will be seen from the following lemma, the bridge function is strictly increasing in , thus there exists a unique root for the equation which can be efficiently solved by Newton’s method.
For any fixed , in equation 7 is a strictly increasing function on . Thus, the inverse function exists.
The proof of Lemma 3.1 is given in Appendix A.1.
3.2 Estimate correlation between ternary and continuous data
We now consider the random vector pairs where variable is ternary and variable continuous. The latent Gaussian copula model assumptions imply that the corresponding latent vector pairs, , satisfy
independently, for , where is the correlation between and .
It follows that
Let denote the CDF for ,
Now we are ready to build the bridge function of the population Kendall’s for ternary and continuous variables as follows.
When is ternary and is continuous, is given by where
The next lemma shows that, for fixed , is an invertible function of , which implies that the equation has unique solution where the unknown cut-offs can be estimated with no bias by considering their expectations: and
For any fixed , in equation 12 is a strictly increasing function on . Thus, the inverse function exists.
The proofs of Lemma 3.2 and the following Lemma 3.3 can be found in Appendices A.2 and A.3.
Combining all three lemmas, we have constructed the rank-based estimate of as follows:
3.3 Generalized rank-based estimate for -level discrete-continuous mixed data
We now generalize the rank-based estimate to -level discrete-continuous mixed data. Suppose that is a -level ordinal variable and continuous, then the bridge function for -level discrete-continuous mixed data is established in the following lemma:
When is -level discrete taking value in , and is continuous, the population version of Kendall’s is given by , where
Moreover, if we consider the entire space, we can extend the estimates as
so that for -level mixed data ranging from then for we have . If we define for , then Therefore we can write the -form bridge function as:
3.4 Theoretical results
We now are ready to establish a theoretical result concerning the convergence rate of the correlation estimate. As mentioned in Fan et al. (2017), these two assumptions impose little restrictions in practice.
Assumption 1: (bounded correlations) There is a constant such that for .
Assumption 2: (bounded cut-offs) There is a constant such that and for any .
In the case of the estimate of correlation between -level ordinal and continuous data we have the following concentration result.
Under Assumptions 1 and 2, at fixed , for any we have the following property
implying that with probability greater than
where and are positive constants defined in Appendix A9. Essentially, Theorem 3.1 implies that for some constant independent of and , with probability .
We have a similar concentration rate for the correlation estimator of ternary-continuous mixed data.
Under assumptions 1 and 2, for any we have
4 Kendall’s for tied data
4.1 Kendall’s estimate for binary and binary variables
When and are both binary discrete random variables, the 1st-order Taylor series approximation of the population version of Kendall’s , given by , is
We can easily see that is strictly increasing in since the denominator is independent of and the numerator is the bridge function for Kendall’s . Therefore the equation has a unique solution.
4.2 Kendall’s estimate for binary and continuous variables
When is binary and is continuous, the 1st-order Taylor series approximation of the population version of Kendall’s , given by , is
This bridge function is also strictly increasing in because the denominator does not involve and the numerator has been shown to be monotonically increasing by Fan et al. (2017).
We also derived the 2nd-order Taylor approximation of the bridge function in this case.
Let . Then, the 2nd-order Taylor approximation of is given by
with the sample space of being , the probability of concordance and discordance respectively as and .
The 1st-order and 2nd-order Taylor and Monte Carlo approximations to are plotted in Figure 1 (right panel) for , and . The difference between the two Taylor approximations is shown in the right panel.
5 Simulation results for generalized -level mixed data
In this section, we show some simulation results for -level mixed data where . We conducted two scenarios here:
Scenario 1: Starting with , we dichotomize the data equally by setting the cutoff . With increasing, we discretize the data by setting the cutoff so that we will have equal counts of each level.
Scenario 2: Starting with , we discretize the continuous Gaussian copula data equally so that each level has about the same number of counts. As decreases, we combine the highest level with one level lower: e.g. when , we collapse “16”s into “15”s. The motivation is that in Genetics research, when encountering ternary data, people sometimes combine “1”s and “2”s to make the data binary. As we can see in the following plot, this will lead to an increased estimation error (see leftmost plots in Figure 2).
For each scenario, we first simulate bivariate Gaussian copula data of size , and , with the correlation/covariance , and we estimate the correlations using the continuous data. Then we discretize the first dimension of the data into level in the way described by each Scenario, and estimate the correlation following the bridge function in equation (13). For each , the same process is repeated by 80 times and we take their mean of the squares as the error measure. We further smooth the curve by averaging the errors over , , etc.
We can see from the following plot that as increases, the estimation error approaches to the one in raw continuous data. However, notice that how much estimation error will be introduced by combining levels as we can see in Figure 2.
6 Real Data analysis
In this section, we present two studies of real data analysis. We start with applying our correlation estimation method to two sets of real data that have been studied intensively in the past, and then pass the correlation estimator to graph estimation procedures in next step. In the graph estimation procedure, we adopt the modified graphical lasso estimation method as in Fan et al. (2017), which essentially consists of two steps: first we project the correlation estimator into the cone of positive semidefinite matrices to facilitate the optimization algorithms in Friedman et al. (2008), denoted as ; second we pass to the graphical lasso estimation to replace the sample covariance matrix, to obtain the following precision matrix estimator:
We set the path of tuning parameter to be the vector of length 10 starting from to , as suggested by Friedman et al. (2008). Furthermore, we did not penalize the diagonal of inverse covariance matrix. We used high-dimensional BIC score (HBIC) as selection criterion, defined in Fan et al. (2017). The estimated graphs are then presented to reveal conditional independence relationships.
6.1 COMPAS Data
ProPublica (Angwin et al., 2016) carried out a journalistic investigation on possible biases of machine learning based predictive analytic tools used in criminal justice. The ProPublica article examined whether black-box risk assessment tools disproportionately recommend nonrelease of African-American defendants. COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) is a proprietary software tool developed by Northpointe, Inc. that gives a prediction score for a defendant’s likelihood of failing to appear in court or reoffending. Angwin et al. (2016) compiled criminal records from the criminal justice system in Broward County, Florida, combining detailed individual level criminal histories with predictions from the COMPAS risk assessment tool. This data set has served as a key example in the algorithmic fairness literature (e.g. Adler et al. (2018); Berk et al. (2017); Chouldechova (2017); Johndrow and Lum (2017); Kleinberg et al. (2016); Tan et al. (2017); Zhou et al. (2018)).
The COMPAS score is computed by a black-box algorithm and produces a decile score (deciles of the predicted probability of rearrest) as well as a (ordinal) categorical score consisting of three levels of risk (low, medium, and high). Dieterich et al. (2016) suggest that a medium and high COMPAS scores garner more interest from supervision agencies than low scores. In order to assess the accuracy of the recidivism predictions, Larson et al. (2016) compared individual COMPAS score based predictions to a ground truth indicator of whether that particular individual had indeed been rearrested within two years of release. Larson et al. (2016) developed a binary logistic regression model (low versus medium or high) that considered race, age, criminal history, future recidivism, and charge degree, they analyzed both the COMPAS scores for risk of overall and violent recidivism and used their model to assess the odds of getting a higher COMPAS score for certain subgroups.
The ProPublica article (Angwin et al., 2016) mentions three African-Americans that had a medium risk COMPAS score and no subsequent offenses whereas non-African Americans had low risk score but had subsequent serious offenses. So it is of interest to examine the three level COMPAS score (low, medium, and high) rather than the binary classification in Larson et al. (2016). We use our proposed graphical model approach to examine the conditional independence relationship between two-year recidivism (binary) and the three level (both overall and violent) COMPAS score (low, medium, and high) with gender (binary), recorded misdemeanor (binary), age category (<25, 25-45, and >45), number of priors, and juvenile criminal history (felony, misdemeanor, and other – all binary). To better understand the underlying relationships, we separate the data into three race groups: we estimated the underlying correlation matrices for African-American, Caucasian and Hispanic respectively, and also repeat the same procedure for the three races pooled together. Also, these analysis are done for overall COMPAS score categories and violent COMPAS score categories separately. Our estimated conditional independence graphs are in Figures 3 and 4. A first interesting finding is that we notice the graphical structures vary across African-American, Caucasian and Hispanic groups. In Figure 3 for the overall COMPAS score, note that the overall COMPAS score has a direct effect on two-year recidivism for African-Americans but is conditionally independent for Caucasians and Hispanics, however has a quite indirect effect in the pooled model. Conversely, in Figure 4, the violent COMPAS score is conditionally independent of two-year recidivism for African-Americans but has a direct effect for Caucasians and an indirect effect for Hispanics and the pooled groups. It is also interesting how the various juvenile criminal history measures have different associations across the race groups and two COMPAS scores. There is common structure seen across all three races too, misdemeanor and number of priors are consistently connected to two-year recidivism for all races. In contrast, this in not the case for misdemeanor and number of priors and the two COMPAS scores. Also, the graphical models for pooled group are the same for both sets of variables involving score category and violent score category respectively.
6.2 Prostate cancer data analysis
This data set was first analyzed by Byar and Green (1980) and subsequently by Hunt and Jorgensen (1999). It consists of 12 mixed type measurements for 475 prostate cancer patients who were diagnosed as having either stage 3 or 4 prostate cancer. Among the 12 variables, 8 are continuous, 3 are ordinal and 1 is nominal (list of variables and corresponding abbreviations can be found in Table 1). More details of the data can be found at (Andrews, 1985). We are interested in how the ‘Survival Status’ is correlated with the other 11 variables after removing the nominal variable ‘Electrocardiogram code’ since it is not appropriate to infer latent variable for nominal variable. The ’Survival Status’ is transformed into binary variable as either survived or died, regardless of causes of death. Also, we combined performance rating’s level 2 and level 3 as one level since these patients are in bed more than 50% of daytime. The correlation/covariance matrices are given in Table 3 and 3 for Stage 3 and 4 patients respectively. Figure 5 illustrates the recovered graph for the 12 variables for Stage 3 and Stage 4 patients respectively. It is interesting that the set of nodes connected to ’Survival Status’ are different among Stage 3 and 4 patients. For Stage 3 patients, the ’Survival Status’ node is of degree 3, with neighbors including ’Cardiovascular disease history’ (HX), ’Bone metastases’ (BM), and ’Performance Rating’ (PF). Whilst for Stage 4 patients, ’Survival Status’ node is of degree 2 instead, with its neighbors being ’Performance rating’ and ’Serum prostatic acid phosphatase’. It is interesting that ’Performance rating’ (PF) is adjacent to ’Survival Status’ in both networks, which is reasonable since an active patient (Performance rating = 0 or 1) was probably able to move around hence survived. However, PF was not included in the best model found by Hunt and Jorgensen (1999), which we speculate as a result of mistreating the categorical variable PF. Also, we notice that some variables are highly correlated with Surv but not a neighbor of Surv on the network graph, such as the ’Age’ variable for Stage 3 patients, and Bone Metastases (BM) variable for Stage 4. It’s easy to see the reason after a closer look at the correlation tables in Table 3 and 3: for Stage 3, the ’Age’ variable has a higher correlation with ’PF’ than with ’Surv’, implying that the high correlation between ’Age’ and ’Surv’ might be a result of the high correlation between it and ’PF’. It is similar for Stage 4: ’BM’ variable sees a higher correlation with ’PF’ and ’AP’, ’HG’ has a higher correlation with ’PF’ than with ’Surv’, and ’SZ’ finds itself highly correlated with ’AP’ and ’PF’, namely the high correlations between those variables and ’Surv’ can be due to their high correlations with ’AP’ and/or ’PF’, thus they are indirectly connected to ’Surv’ node in the network graph (Figure 5) but rather directly connected to the neighbors of ’Surv’. Another interesting structure can also be discovered from the network graph (Figure 5) that agrees with Hunt and Jorgensen (1999): they found that the cluster consisting of variables ’BM’, ’Wt’, ’HG’, ’SBP’ and ’DBP’ gave the second best likelihood; on the other hand, we found that those 5 variables are consistently clustered for both Stage 3 and 4 patients, which agrees with the finding by Hunt and Jorgensen (1999). One might also notice that ’Size of primary tumor’ node is isolated for only Stage 3 patients’ network. This in fact agrees with the definition of Stage: stage 3 represents local extension of the disease whilst stage 4 represents distant metastasis as evidenced by elevated acid phosphatase and/or X-ray evidence (Hunt and Jorgensen, 1999). In other words, for Stage 3 patients, ’SZ’ (node 10) is not necessarily a good indicator of ’Index of tumor stage, histolic grade’ (node 11) or ’Serum prostatic acid phosphatase’ (node 12), but it might be a good one for stage 4 patients as we can see in the graph that node 10 is connected to node 11 and 12. Another interesting finding is that ’Size of primary tumor’ and ’Serum prostatic acid phosphatase’ are adjacent in the networks for Stage 4 patients, which agrees with the results in McParland and Gormley (2016) that Stage 4 patients on average saw larger tumors and higher levels of serum prostatic acid phosphatase.
|Covariate||Abbreviation||Number of levels|