A model-free estimation for the covariate-adjusted Youden index and its associated cut-point
In medical research, continuous markers are widely employed in diagnostic tests to distinguish diseased and non-diseased subjects. The accuracy of such diagnostic tests is commonly assessed using the receiver operating characteristic (ROC) curve. To summarize an ROC curve and determine its optimal cut-point, the Youden index is popularly used. In literature, estimation of the Youden index has been widely studied via various statistical modeling strategies on the conditional density. This paper proposes a new model-free estimation method, which directly estimates the covariate-adjusted cut-point without estimating the conditional density. Consequently, covariate-adjusted Youden index can be estimated based on the estimated cut-point. The proposed method formulates the estimation problem in a large margin classification framework, which allows flexible modeling of the covariate-adjusted Youden index through kernel machines. The advantage of the proposed method is demonstrated in a variety of simulated experiments as well as a real application to Pima Indians diabetes study.
Key words: diagnostic accuracy, margin, receiver operating charachteristic curve, reproducing kernel Hilbert space, Youden index
In medical research, continuous markers are widely employed in diagnostic tests to distinguish diseased and non-diseased subjects . A subject is diagnosed as diseased if the marker value is higher than a given threshold value, and otherwise non-diseased. The diagnosis accuracy of the marker is usually evaluated through sensitivity and specificity, or the probabilities of a true positive and a true negative for any given threshold value. In addition, the receiver operating characteristic (ROC) curve is defined as sensitivity versus 1-specificity over all possible threshold values for the marker [2, 3]. To summarize the overall property of an ROC curve, different summarizing indices are proposed, including the Youden index  and the area under the ROC curve (AUC; ).
The Youden index, defined as the maximum vertical distance between the ROC curve and the line, is an indicator of how far the curve is from the uninformative test . It ranges from 0 for the uninformative test to 1 for an ideal test. The Youden index has been successfully applied in many medical studies to provide an appropriate one-dimensional summary of the test accuracy and determine its associated cut-point (e.g., [6, 7]).
In literature, various statistical modeling strategies have been proposed to estimate the Youden index and its associated cut-point. One main strategy is to use parametric models (e.g., [8, 9, 10]), which assume that the values of the diagnostic marker for the diseased and non-diseased subjects, respectively, follow certain probability distribution. With the parametric probability distributions, explicit formulas for the Youden index and its associated cut-point can be derived. Another popular strategy uses non-parametric models (e.g., [11, 12, 13]), which estimate the conditional distribution of the diagnostic marker for the diseased and non-diseased subjects via non-parametric density estimation techniques. Fluss et al.  conducted numerical comparisons among a number of popular estimation methods, and suggested that the kernel density estimation is generally the best performer without restricting the data distribution. With the estimated distributions of the diagnostic marker for the diseased and non-diseased subjects, the associated cut-point can be estimated as the value where the two estimated densities are identical.
Furthermore, it is important to appropriately set cut-points in subpopulations and to understand the sources of false positive and false negative results. Zhou et al.  and Pepe  discussed the covariate effect on the accuracy of diagnostic tests and the estimation of the ROC curve. Ignoring the covariate effects may lead to biased inference about the accuracy of the test for distinguishing diseased and non-diseased subjects. In data analysis, as pointed out in Pepe (; page 135), “it might be of interest to calculate both the pooled and covariate-specific ROC curves in order to ascertain the gains in accuracy that can be achieved by using covariate-specific thresholds.” Although much research has been done on the covariate-adjusted ROC curve (e.g., [15, 16, 17]), little has been done on the covariate-adjusted Youden index and its associated cut-point. To the best of our knowledge, Zhou  studied the covariate-adjusted Youden index by using the heteroscedastic regression model .
In this paper, a new model-free estimation framework is proposed, which directly estimates the covariate-adjusted cut-point without estimating the conditional densities. With the estimated cut-point, the covariate-adjusted Youden index can be estimated through any one-dimensional non-parametric density estimation methods. In particular, the estimation framework formulates the estimation problem in a large margin classification framework, where the covariate-adjusted cut-point is modeled non-parametrically in a reproducing kernel Hilbert space (RKHS;). The proposed method is applied to Pima Indians diabetes study, and suggests the important effect of age in estimating the Youden index and its associated cut-point.
The rest of the paper is organized as follows. In Section 2, we briefly review the estimation of the population-based Youden index and its associated cut-point. In Section 3, we introduce the covariate-adjusted Youden index and its associated cut-point, and propose a model-free estimation framework based on the large margin classification for estimating the covariate-adjusted Youden index and its associated cut-point. In Section 4, numerical experiments are conducted to demonstrate the advantage of the proposed method. In Section 5, we apply the proposed method to the Pima Indians diabetes dataset. Section 6 contains some discussion, and the Appendix is devoted to technical proofs.
2 The Youden index and optimal cut-point
Suppose that every observation consists of a continuously supported diagnostic measurement , and a binary variable , where denotes a diseased subject and otherwise. A cut-point is introduced so that a diseased status is predicted if and non-diseased otherwise. The ROC curve is constructed to display the sensitivity, and the specificity, . To summarize the information in the ROC curve, the Youden index is defined as
The Youden index ranges from 0 to 1, where represents a complete separation, and represents a complete overlap. The associated cut-point is the point that yields ,
|Figure 1 about here.|
Since and , direct derivation yields that is a solution of
where and with , and if and otherwise for convenience.
Assume that is increasing in , then the solution of (1) satisfies .
Furthermore, it can be showed that also satisfies that , where and are probability density functions of conditional on and , respectively. Note that it is important to understand how the study is designed before interpreting Theorem 1. Different designs lead to different meanings of , , and . Two popular designs in medical research are case-control study and cohort study. In a case-control study, the diseased status is known when sampled, and then is known and equal to the proportion of diseased subjects among the sampled subjects. Also, , and are the distributions of among the diseased and non-diseased subjects, respectively. In a cohort study, , , are i.i.d., and then is the prevalence of the disease and can be estimated by the proportion of diseased subjects among the sampled subjects.
To estimate the Youden index and its associated cut-point, various modeling strategies have been proposed. The parametric methods [8, 9, 10] impose distributional assumptions on and , and estimate as the solution of . The nonparametric methods [11, 12, 13] relax the distributional assumption and estimate and in a nonparametric fashion, and then estimate as the solution of .
By Proposition 1, rather than focusing on the conditional density estimation, the expectation in (1) could be approximated by its empirical version. Then given the training sample , the estimated is defined as a solution of
where , , , , and denotes the cardinality of a set. It is clear that the solution to (2) is equivalent to estimating and by their empirical distributions. The optimization in (2) can be solved by an exhaustive search over all possible values of , by noting that the objective function does not change when varies between two adjacent values of ’s. Another desirable property of the formulation (2) is that it can be naturally extended to covariate-adjusted formulation, where is allowed to vary according to subject’s profile.
3 Covariate-adjusted formulation
In many situations, the accuracy of diagnostic tests could be largely influenced by various factors such as the demographic characteristics of subjects [15, 16] or the design characteristics of diagnostic tests . For instance, Pepe  investigated an audiology study and found that the test accuracy is associated with auditory stimulus levels for patients.
To incorporate the effect of covariates, Faraggi , Simth and Thompson, and Guttman et al.  employed linear regression models. Pepe  further formulated a general regression framework to evaluate the effect of covariates. To relax the restrictive model assumptions, Pepe , Cai and Pepe  and Cai  proposed a semi-parametric generalized linear model (GLM) for covariate-adjusted ROC curve without predicting the sensitivity and specificity at a given threshold. Yao et al.  and Zhou  employed a non-parametric heteroscedastic regression model to address the covariate adjustment for the ROC curve and the related indices such as the AUC and the Youden index.
In this section, we generalize the formulation of large margin classification in (2) to estimate the covariate-adjusted Youden index and cut-point, and evaluate the effect of covariates.
3.1 Covariate-adjusted cut-point
Let , and , where denotes subject’s profile. For convenience of describing the main idea, here we assume that for any . This holds for cohort studies where subjects are sampled randomly. But for case-control studies, we need to assume that the cases (diseased subjects) and the controls (non-diseased subjects) are sampled with covariates being matched. If covariate is not matched, propensity scores  could be used to estimate .
Extending the formulation in (1), the ideal covariate-adjusted cut-point is a solution to
where is a function of , and the expectation is taken with respect to . Similarly as in (1), we can show that must satisfy
where is assumed to be a continuous and strictly increasing function of for any value of .
To estimate the covariate-adjusted , note that the empirical version of (3)
involves the 0-1 loss and needs to be optimized with respect to functional . It can no longer be solved by the exhaustive grid search or any other efficient optimization techniques. In this paper, we propose a novel surrogate loss, -loss, which extends the -loss [27, 28] by introducing a parameter that controls the difference between the surrogate loss and the 0-1 loss. More importantly, Proposition 2 shows that the -loss is asymptotically Fisher consistent in estimating when approaches 0.
For any given , if the conditional density is continuous and is strictly increasing in , then as uniformly over a compact set containing and
With the -loss, the proposed estimation formulation for the covariate-adjusted cut-point is a solution of
where is a tuning parameter, is a regularization term on the complexity of , and is set as a reproducing kernel Hilbert space (RKHS; ). The final estimation formulation then becomes
where is the RKHS induced by some pre-specified kernel function such as linear kernel or Gaussian kernel, and is the associated RKHS norm of . It follows from the representer theorem  that the solution to (7) is of the form , and thus with and .
3.2 Non-convex optimization
Note that the objective function in (7) is non-convex, and thus we employ the difference convex algorithm (DCA; ) to tackle the non-convex optimization. The key idea of the DCA is to decompose the non-convex objective function into the difference of two convex functions, and then construct a sequence of subproblems by approximating the second convex function with its affine minorization function.
In particular, the -loss is decomposed as
Then the objective function in (7) can be decomposed as , where
and is the coefficient vector for the RKHS representation of .
Next, the DCA constructs a sequence of decreasing upper envelop of by approximating with its affine minorization function, , where is the estimated at the -th iteration, and is the subgradient of at . The updated is then obtained by solving
The updating scheme is iterated until convergence. Although the DCA cannot guarantee global optimum, it delivers a superior numerical performance as demonstrated in the extensive simulation study in .
3.3 Covariate-adjusted Youden index
For any given , with the covariate-adjusted cut-point , the covariate-adjusted Youden index is expressed as
Then estimation of boils down to estimation of the conditional probabilities in (9).
In literature, Faraggi  and Smith and Thompson  estimated the conditional probabilities assuming Weibull or normal distribution, respectively. Yao et al.  and Zhou  proposed to estimate the conditional probability by using the heteroscedatic regression models without assuming any distributional assumption. In this paper, we adopt the similar kernel smoothing method as in Yao et al.  to overcome the lack of observations sharing the same for estimating and . In specific, the estimated is
where and are bandwidths for the diseased and non-diseased subjects, and is any pre-specified kernel density function.
4 Simulation examples
This section examines the proposed estimation method for estimating the covariate-adjusted Youden index and its associated cut-point using simulated examples. The numerical performance of the proposed covariate-adjusted estimation (CAE) method is compared against normal regression model (NRM) of Faraggi  and heteroscedastic regression model (HRM) of Yao et al.  and Zhou .
For illustration, the kernel function used in all methods is set as the Gaussian kernel . In simulated examples, the true and are known, and thus the empirical integrated squared errors
are employed to select the tuning parameter for estimating and bandwidths and for estimating , respectively. In all examples, the grid for selecting is set as and the grid for selecting is set as . For our method, for all simulated examples as discussed in “On minimum clinically important difference” by Hedayat et al..
Example 1. A random sample is generated as follows. First, is generated from , and is generated from . Second, if , then is generated from , where denotes the c.d.f. of standard normal distribution. Otherwise, is generated from . Similar example was used in Yao et al. (2010) and Zhou (2011).
Example 2. A random sample is generated as follows. First, is generated from , and is generated from . Second, if , then is generated from ; otherwise, is generated from .
Example 3. A random sample is generated as follows. First, is generated from with , and is generated from . Second, if , then is generated from , where and ; otherwise, is generated from .
Example 4. A random sample is generated as follows. First, is generated from with , and is generated from . Second, if , then is generated from , where ; otherwise, is generated from .
In all examples, the sample size is set as and . All examples are replicated 50 times and the averaged performance and the corresponding standard deviations of empirical integrated squared errors for and are reported in Table 1 and Table 2.
It is evident that our proposed method outperforms NRM and HRM for estimating the cut-point except for the cases in Examples 1 and the case with in Example 2. The performance of NRM largely depends on whether the data follows a normal distribution or not. HRM yields competitive performance in Examples 1 and 2 where the number of covariate , but its numerical performance appears to be less satisfactory in Examples 3 and 4 with . This is due to the reason that HRM requires a more complicated estimation framework when gets larger (Yao et. al, 2010). Furthermore, our proposed method also delivers competitive performance for estimating .
5 Real application
In this section, we applied our proposed method to analyze the Pima Indians diabetes study dataset. The Pima Indians diabetes study aims to evaluate the effectiveness of plasma glucose concentration in an OGTT for discriminating patients with diabetes, which is a classical and standard diagnostic test for diabetes . The dataset is publicly available at the University of California Irvine Machine Learning Repository (http://archive.ics.uci.edu/ml/). It was originally discussed in Smith et al. , and also analyzed by Zhou  using the heteroscedastic regression model for the covariate-adjusted Youden index.
The dataset consists of nine variables: number of times pregnant, plasma glucose concentration in an OGTT, diastolic blood pressure (mm Hg), triceps skin fold thickness(mm), 2-hour serum insulin (mu U/ml), body mass index, diabetes pedigree function, age (years), disease status variable (0 or 1). There are 268 subjects in the case group and 500 subjects in the control group. Incomplete observations with OGTT value 0 are removed for the analysis.
In this section, we attempt to estimate the Youden index and the associated cut-point adjusted for the covariate age. For simplicity, we focus on the subjects who are younger than 60 years old due to the sparseness of senior subjects . We also set and , and select the tuning parameter by 5-fold cross validation targeting on maximizing (5). Figure 2 depicts the estimated the Youden index and the associated cut-point as functions of age.
|Figure 2 about here.|
It is clear that the value of cut-point increases with age and the value of the Youden index decreases with age. Similar conclusion has also been reached in Faraggi  and Zhou . Smith and Thompson  also studied the effect of age on the ROC curve for the Cairo diabetes based on the belief that “aging process may be associated with relative insulin deficiency or resistance among persons who do not have diabetes”.
6 Closing remarks
This paper proposes a new framework for estimating the covariate-adjusted Youden index and its associated cut-point. As opposed to existing methods focusing on estimating conditional density functions, the proposed method targets on directly estimating the covariate-adjusted cut-point, and formulates the estimation problem in a large margin classification framework. A new surrogate loss function is proposed, and the resultant non-convex optimization is solved through difference convex algorithm. One key advantage of our proposed method is its estimation of the covariate-adjusted cut-point when a relatively large number of covariates are present, where multi-dimensional density estimations in existing methods are often unreliable.
Then desired result follows immediately. That is to say,
and by the Bayes’ rule, .
Proof of Proposition 2. For any given , since , we have
Note that is decreasing in , and approaches 0 when . Furthermore, , which is increasing in when . Therefore, there exist and such that
This implies that for any , . Similarly, there exist and such that for any , . Therefore, for any , must lie in a compact set around .
The second term on the right hand side of (11) is bounded below by 0 and above by and is decreasing in . Therefore, by Dini’s theorem, converges to 0 uniformly over as . It further implies that converges to uniformly over as . This, together with the fact that is convex in , implies that
when converges to zero.
|CAE||0.126 (0.0887)||0.060 (0.0401)||0.048 (0.0398)|
|NRM||0.087 (0.0366)||0.075 (0.0147)||0.073 (0.0100)|
|HRM||0.069 (0.0358)||0.051 (0.0277)||0.035 (0.0170)|
|CAE||0.964 (0.9410)||0.462 (0.3466)||0.278 (0.1995)|
|NRM||1.066 (0.5920)||0.918 (0.3896)||0.945 (0.3284)|
|HRM||0.897 (0.5400)||0.496 (0.2659)||0.311 (0.1372)|
|CAE||1.715 (1.1838)||0.796 (0.4590)||0.441 (0.1781)|
|NRM||15.786 (5.4328)||14.863 (3.7421)||15.352 (2.6172)|
|HRM||3.595 (1.8378)||2.008 (0.6182)||1.379 (0.3460)|
|CAE||8.979 (3.1713)||5.545 (2.0988)||3.632 (0.9736)|
|NRM||25.727 (4.8747)||20.963 (3.2498)||21.655 (2.5352)|
|HRM||10.723 (3.9732)||6.377 (2.1690)||4.531 (1.0852)|
|CAE||0.010 (0.0084)||0.006 (0.0041)||0.004 (0.0030)|
|NRM||0.013 (0.0097)||0.011 (0.0056)||0.006 (0.0035)|
|HRM||0.010 (0.0073)||0.007 (0.0071)||0.006 (0.0055)|
|CAE||0.011 (0.0118)||0.005 (0.0047)||0.003 (0.0017)|
|NRM||0.016 (0.0119)||0.013 (0.0059)||0.014 (0.0042)|
|HRM||0.017 (0.0132)||0.010 (0.0099)||0.007 (0.0082)|
|CAE||0.163 (0.0421)||0.141 (0.0225)||0.129 (0.0150)|
|NRM||0.270 (0.0771)||0.288 (0.0480)||0.287 (0.0308)|
|HRM||0.123 (0.0343)||0.091 (0.0250)||0.071 (0.0168)|
|CAE||0.079 (0.0330)||0.062 (0.0158)||0.049 (0.0089)|
|NRM||0.074 (0.0398)||0.050 (0.0227)||0.059 (0.0127)|
|HRM||0.073 (0.0209)||0.063 (0.0139)||0.037 (0.0070)|
- Shapiro, D. (1999). The interpretation of diagnostic tests, Statistical Methods in Medical Research, 8, 113-134.
- Zhou, X., McClish, D., and Obuchowski, N. (2002). Statistical methods in diagnostic medicine. Wiley, New York.
- Pepe, M. (2003). The statistical evaluation of medical tests for classification and prediction. Oxford University Press.
- Youden, W. (1950). An index for rating diagnostic tests, Cancer, 3, 32-35.
- Bamber, D. (1975). The area above the ordinal dominance graph and the area below the receive operationg characteristic graph. Journal of Mathematical Psychology, 12, 387-415.
- Aoki, K., Misumi, J., Kimura, T., Zhao, W. and Xie, T. (1997). Evaluation of cutoff levels for screening of gastric cancer using serum pepsinogens and distributions of levels of serum pepsinogens I, II and of PG I/PG II ratios in a gastric cancer case-control study, Journal of Epidemiology, 7, 143-151.
- Castle, P., Lorincz, A., Scott, D., Sherman, M., Glass, A., Rush, B., Wacholder, S., Burk, R., Manos, M., Schussler, J., Macomber, P. and Schiffman, M. (2003). Comparison between prototype hybrid capture 3 and hybrid capture 2 human papillomavirus DNA assays for detection of high-grade cervical interepithelial neoplasia and cancer, Journal of Clinical Microbiology, 9, 4022-4030.
- Zou, K. and Hall W. (2000). Two transformation models for estimating an ROC curve derived from continuous data. Journal of Applied Statistics, 27, 621-631.
- Faraggi, D. and Reiser, B. (2002). Estimation of the area under the ROC curve, Statistics in Medicine, 21, 2093-3106.
- Schisterman, E., Perkins, N., Liu, A. and Bondell, H. (2005). Optimal cut-point and its corresponding Youden Index to discriminate individuals using pooled blood samples, Epidemiology, 16, 73-81.
- Hsieh, F. and Turnbull, W. (1996). Nonparametric methods for evaluating diagnostic tests, Statistia Sinica, 6, 47-62.
- Zou, K., Tempany, C., Fielding, J. and Silverman, S. (1998). Original smooth receiver operating characteristic curves estimation from continuous data: statistical methods for analysing the predictive value of spiral CT of urethral stones. Academic Radiology, 5, 680-687.
- Knight, K. (2000). Mathematical Statistics. Chapman & Hall/CRC.
- Fluss, R., Faraggi, D. and Reiser, B. (2005). Estimation of the Youden index and its associated cutoff point, Biometrical Journal, 47, 458-472.
- Pepe, M. (2000). An interpretation for the ROC curve and inference using GLM procedures, Biometrics, 56, 352-359.
- Faraggi, D. (2003). Adjusting receiver operating curves and related indices for covariates, The Statistician, 52, 179-192.
- Yao, F., Craiu, R. V. and Reiser, B. (2010). Nonparametric covariate adjustment for receiver operating characteristic curves, The Canadian Journal of Statistics, 38, 27-46.
- Zhou, H. (2011). Statistical Inferences for the Youden Index. Ph.D. Dissertation, Georgia State University.
- Wahba, G. (1990). Spline models for observational data. Philadelphia: SIAM.
- Pepe, M. (1997). A regression modeling framework for receiver operating characteristics curves in medical diagnostic testing, Biometrika, 84, 595-608.
- Pepe, M. (1998). Three approaches to regression analysis of reciever operating characteristic curves for continues test results, Biometrics, 54, 124-135.
- Smith, P. and Thompson, T. (1996). Correcting for confounding in analyzing receiver operating characteristic curves, Biometrical Journal, 38, 857-863.
- Guttman, I., Johnson, r., Bhattacharyya, G. and Reiser, B. (1988). Confidence limits for stress-strength models with explanatory variables, Technometrics, 30, 161-168.
- Cai, T. and Pepe, M. (2002). Semi-parametric receiver operating characteristic analysis to evaluate biomarkers for disease, Journal of the American Statistical Association, 97, 1099-1107.
- Cai, T. (2004). Semi-parametric ROC regression analysis with placement values, Biostatistics, 5, 45-60.
- Rosenbaum, P. and Rubin, D. (1983). The Central Role of the Propensity Score in Observational Studies for Causal Effects. Biometrika 70, 41â55.
- Shen, X., Tseng, G., Zhang, X., and Wong, W. (2003). On -learning. Journal of the American Statistical Association, 98, 724-734.
- Liu, Y. and Shen, X. (2006). Multicategory psi-learning. Journal of the American Statistical Association, 101, 500-509.
- An, L. and Tao, P. (1997). Solving a class of linearly constrained indefinite quadratic problems by D.C. algorithms. Journal of Global Optimization, 11, 253-285.
- Liu, S., Shen, X. and Wong W. (2005). Computational development of -learning. Proceedings of the SIAM International Conference on Data Mining, Newport, CA, 1-12.
- World Health Organization (1985). Diabetes mellitus, Technical report series 727, World Health Organization: Geneva.
- Smith, J., Everhart, J., Dickson, W., Knowler, W. and Johannes, R. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus, Johns Hopkins APL Technical Digest, 10, 262-266.