A Descriptive Study of Variable Discretization and Cost-Sensitive Logistic Regression on Imbalanced Credit Data
Training classification models on imbalanced data tends to result in bias towards the majority class. In this paper, we demonstrate how variable discretization and cost-sensitive logistic regression help mitigate this bias on an imbalanced credit scoring dataset, and further show the application of the variable discretization technique on the data from other domains, demonstrating its potential as a generic technique for classifying imbalanced data beyond credit socring. The performance measurements include ROC curves, Area under ROC Curve (AUC), Type I Error, Type II Error, accuracy, and F1 score. The results show that proper variable discretization and cost-sensitive logistic regression with the best class weights can reduce the model bias and/or variance. From the perspective of the algorithm, cost-sensitive logistic regression is beneficial for increasing the value of predictors even if they are not in their optimized forms while maintaining monotonicity. From the perspective of predictors, the variable discretization performs better than cost-sensitive logistic regression, provides more reasonable coefficient estimates for predictors which have nonlinear relationships against their empirical logit, and is robust to penalty weights on misclassifications of events and non-events determined by their apriori proportions.
lass imbalance; variable discretization; cost-sensitive logistic regression; discrimination ability; credit scoring
Class imbalance problems refer to a class of problems related to classifying imbalanced data where many more observations are labeled by the majority class than the minority class  . In practice, the minority class is usually the class of interest, such as fraud in the fraud detection problem , malignance in the breast cancer diagnosis problem , delinquency in the credit scoring problem , sinus bradycardia in the arrhythmia analysis , and poor quality in the product quality inspection .
However, when trained on imbalanced data, most standard statistics and machine learning models are heavily biased towards the majority class (i.e. non-events) and severely misclassify the minority class (i.e. events) , caused by their assumptions of equal target class distribution  and maximizing overall accuracy . Models with poor event discrimination are less useful and generate costs associated with Type II errors (money, reputation, health, etc.).
To solve these problems more efficiently, researchers and practitioners have made efforts from various perspectives, such as data sampling , feature selection  , cost-sensitive learning   , ensemble learning , and kernel-based learning , with the considerations of concrete problem characteristics.
Previous research has not considered variable discretization as a generic technique for class imbalance problems. In this paper, we empirically explore the effects of variable discretization on classifying imbalanced data and compare it with cost-sensitive logistic regression models. Variable discretization and cost-sensitive logistic regression are studied for their high interpretability and computational efficiency. A credit scoring dataset is used in the case study. The goal is to predict the probability of a debtor’s default or delinquency. The proportion of delinquency observations is only 6.68%. We provide a detailed descriptive study on how variable discretization and cost-sensitive logistic regression help mitigate the model bias and/or variance on an imbalanced credit scoring data. The variable discretization technique is further applied on two datasets from other domains (i.e. biology, business) to demonstrate its potential for use in a wide range of fields.
The paper is structured as follows. In Section 2, related work is reviewed. In Section 3, the data is explored and discretized. In Section 4, the models on the credit scoring dataset are developed, evaluated, and compared. In Section 5, the performance of variable discretization is examined on two datasets from other domains. In Section 6, conclusions and future work are discussed.
2 Related Work
A comprehensive review on the foundations, algorithms, and applications of imbalanced learning was conducted by He et al. in 2013 . It summarized the previous research in five categories, including sampling methods, cost-sensitive methods, kernel-based learning methods, active learning methods, and one-class learning methods. It also suggested to evaluate models based on both curve-based measures (e.g. ROC curve, AUC) and single-value measures (e.g. Type I Error, Type II Error, F1 score, G-mean), considering that some traditional performance measures (e.g. accuracy) did not serve as a good indicator of discrimination abilities of models . In an imbalanced credit scoring study by Wang et al., AUC and F-measure (i.e. F1 score) were used as model performance metrics .
In 2001, King proposed the weighted log-likelihood function in Eq. 2 for the logistic regression in rare events data. Compared with the standard log-likelihood function in Eq. 1, Class 1 Weight () and Class 0 Weight () were added to penalize the misclassifications of events and non-events differently. and were determined by the estimated population proportion of events and the sample proportion of events .
The weighted logistic regression in Eq. 2 is referred to as class-dependent cost-sensitive logistic regression . Bahnsen et al. proposed a different version of cost-sensitive logistic regression, called example-dependent cost-sensitive logistic regression , where each example (i.e. observation) in the log-likelihood function was associated with a user-defined constant misclassification cost weight based on domain knowledge. Deng and Maher proposed determining each observation’s cost weight by Gaussian kernel function   , resulting in very high computational complexity and limiting its application on big data.
Different from cost-sensitive logistic regression which has been widely used, the variable discretization method has not been considered for addressing class imbalance problems, although it has been widely used as a domain-specific standard technique in credit scoring. This technique creates more powerful and interpretable predictors from continuous (i.e. interval) data. Dougherty et al. reviewed existing variable discretization methods, compared three of them (i.e. equal width interval, entropy-based, and purity-based) in depth on 16 datasets, and found that the global entropy-based one performed the best on average . For entropy-based discretization methods, the evaluation measures include: class information entropy, Gini, dissimilarity, and the Hellinger measure . For the scoring problem, one commonly used variable discretization method is called the optimal binning, which computes the cutoff points based on conditional inference trees and recursive partitioning .
To select powerful discretized variables, one common measurement is information value defined in Eq. 4 , where is the number of non-events (i.e. non-delinquency) in the level of the variable divided by the total number of non-events, and is the number of events (i.e. delinquency) in the level of the variable divided by the total number of events. To interpret the information value, the following rule of thumb is proposed  .
to : weak
to : medium
Demographic and financial information from borrowers is publicly available in a dataset used in a Kaggle 2011 Competition Give Me Some Credit . The characteristics of the individuals in the data are represented by variables, as shown in Table 3. The goal was to predict whether a client will experience financial distress in the next two years or not, indicated by the dependent variable . As shown in Table 3, there are delinquent observations and non-delinquent observations. The proportion of delinquencies is 6.
There are observations with missing values either in the variable or , which is of the total. These missing values are treated as follows.
Missing Completely at Random (MCAR) analysis is conducted, and there is no pattern existing in the missing data. Hence, those observations are dropped to ensure the data accuracy and support the model training computation, when building the model with original variables. After dropping missing data, the proportion of delinquencies is , which is very close to the original data.
When building the model with discretized variables, those observations are kept by grouping the missing values separately into a level of a variable.
3.1 Exploratory Analysis
Because the dependent variable is binary and all independent variables are interval, the empirical logit plot is used to examine the linearity of the relationship between the dependent variable and independent variables. If the relationship is linear, it is reasonable to use the interval form of an independent variable. Otherwise, a transformation is required. Moreover, through the empirical logit plots, we can check the univariate effects, positive or negative.
The empirical logit plot is created in the following steps.
For each interval variable, generate percentile ranks from to .
For each rank of each interval variable, calculate the total number of observations , the number of delinquency observations , and the mean of the interval variable .
For each rank of each interval variable, compute the empirical logit using the formula .
For each interval variable, plot the empirical logit against the mean in each rank and their linear regression line. Each point in the plot represents data points from the dataset by their mean.
For each interval variable, plot the empirical logit against the rank and their linear regression line. Each point in the plot represents data points from the dataset by their rank index.
For example, consider the predictor variable . Percentile ranks can be found in Table 3.1. Ranks are merged together because their respective minimum and maximum points are the same. As shown in Figure (a)a, there is a nonlinear relationship between and its empirical logit, mainly caused by extreme values. These extreme values in the empiricial logit plot cannot be simply removed, considering they represent several hundred data points in the dataset. However, the relationship between its rank and its empiricial logit is approximately linear as shown in Figure (b)b. In this case, its rank, the discretized form of its original interval values, is preferred to be used in the modeling.
3.2 Variable Discretization
Four variable discretization methods (i.e. distance, quantile, Gini, optimal binning) are compared. On the credit scoring dataset, the quantile discretization produces the highest AUC on the test data with the logistic regression model trained on the training data, where the ratio of training data and test data is 70% vs. 30%. Each variable is ranked and discretized into bins maximally based on the quantile, with the threshold value selected by the same procedure above.
Information value is used as the measurement of the discrimination power of each individual variable after discretization, as shown in Table 3.2. Note that for some variables, the resulting number of bins is less than 20 because the bins with non-significant differences are merged together. For the variable , an additional bin has been included to accomodate missing values. By following the rule suggested by Hand et. al , the variables with the information value over will be studied.
3.3 Datasets from Other Domains
Beyond the credit scoring data, two public datasets from other domains (i.e. biology, business) are collected. They include 206 and 11 interval variables respectively, as shown in Table 3.3. The goal of the arrhythmia data is to predict sinus bradycardia , and the goal of the wine_quality data is to predict poor quality . The process illustrated in Sections 3.1 and 3.2 is performed on these two datasets. Among all variable discretization methods, the optimal binning method produces the best performance. The resulting discretized variables will be modeled using logistic regression in Section 5.
Logistic regression and class-dependent cost-sensitive logistic regression are used as classifiers for their high interpretability. The models are evaluated by -fold cross-validation. The performance measurements include ROC curve, AUC, Type I Error, Type II Error, accuracy, and F1 Score. The mean of AUCs of -fold cross-validation is used to measure the model bias, while the standard deviation of AUCs of -fold cross-validation is used to measure the model variance. They are reasonable measurements, considering that the model bias refers to the error introduced by approximating the true model, and the model variance refers to the amount of the change of the estimated model if using a different training dataset .
To evaluate and compare the performance of variable discretization and class-dependent cost-sensitive logistic regression, the following five models are built.
Model : Logistic regression model on all original interval form of independent variables in Table 3.
Model : Logistic regression model on original interval form of variables with the information value over in Table 3.2.
Model : Class-dependent cost-sensitive logistic regression model on the same independent variables in Model . The class weights (i.e. , ) that produce the highest mean of AUCs of -fold cross-validation are used in the modeling, indicated by the dash line in Figure (b)b. The search for the best class weights will be discussed below.
Model : Logistic Regression model on discretized form of independent variables used in Model . The discretized variables are transformed by the one-hot encoder. In total, binary dummy variables are created.
Model : Class-dependent cost-sensitive logistic regression model on the same discretized independent variables in Model . The class weights (i.e. , ) that produce the highest mean of AUCs of -fold cross-validation are used in the modeling, indicated by the solid line in Figure (b)b.
For the class weights (i.e. , ) in Model and Model , they are determined by the population proportion of events and the sample proportion of events in Eq. 3. is known from the data. is typically unknown and hard to obtain accurate estimation . Here is tuned as a hyperparameter from 0 to 0.5. As shown in Figure (a)a, as increases, increases and decreases linearly. Figure (b)b shows how the mean of AUCs on the -fold cross-validation changes as increases. When modeling on interval variables in Model , the best occurs at , resulting in and . The changes of the class weights have minimal influence on the modeling of discretized variables used in Model , implying that good variable discretization is robust to penalty weights determined by proportions of events and non-events. Hence, for Model , we take and , leading Model the same as Model . Because of this, we will only compare Model with other models in the following section.
The ROC curve of each model can be found in Figure 3. Model and Model have similar AUCs, indicating that the variables with the information value below provide minimal contribution. The ROC curves of Model and Model demonstrate stronger results than Model . Moreover, for Model , the ROC curves on -fold cross-validation are closer to each other, indicating lower model variance. This can be further confirmed by the mean and standard deviation of AUCs on -fold cross-validation in Table 4. Model produces the highest mean and the lowest standard deviation of AUCs, demonstrating the power of variable discretization.
The estimated coefficients of the models are also examined. As shown in Table 4, Model and Model produce different estimates for every independent variable, as well as the sign of the variable . Its sign is negative in Model , while its sign is positive in Model . Its empirical logit plot in Figure (c)c shows the positive relationship. Based on its variance inflation factor (VIF) in Table 4, its sign change in Model is caused by its multicollinearity with the variables and . None of them can be dropped in the modeling because of their information values presented in Table 3.2. Model specificly guarantees a positive estimate, which is consistent with the univariate effect. For other variables, the signs of estimated parameters are consistent with their univariate effect shown in their empirical logit plots in Figures (a)a, (b)b, and (d)d. The estimated parameters of Model are not presented here because of space limitation. Considering these dummy variables are binary indicators transformed by one-hot encoder, their estimated coefficients are more interpretable.
Further, these models are compared based on Type I Error, Type II Error, accuracy, and F1 score on the test data after splitting the original dataset into training data (70%) and test data (30%), which can be found in Table 4. The probability cutoff is chosen as the intersection point of the specificity plot and sensitivity plot, one of the most frequently used criterion  . We have the following findings.
There is no improvement from Model to Model , indicating that variables with information value below 0.1 provide limited contribution.
Compared with Model , Model decreases Type I Error by 8.23%, decreases Type II Error by 8.3%, increases accuracy by 8.24%, and increases F1 score by , indicating the contribution of penalizing the misclassifications of events and non-events in different scales by running the class-dependent logistic regression.
Compared with Model , Model decreases Type I Error by 11.85%, decreases Type II Error by 11.91%, increases accuracy by 11.85%, and increases F1 score by , indicating the contribution of variable discretization.
Compared with Model , Model decreases Type I Error by 3.62%, decreases Type II Error by 3.61%, increases accuracy by 3.61%, and increases F1 score by , indicating that variable discretization performs better than the inclusion of class-dependent costs in the logistic regression.
5 Application of Variable Discretization in Other Domains
To further examine the power of variable discretization, logistic regression models with original interval variables and discretized variables in the datasets arrhythmia and wine_quality are built and compared. The original datasets are split into training data (70%) and test data (30%). Logistic regression models are trained on the training data and then evaluated on the test data.
Their resulting ROC curves on the test data can be found in Figure 5. For both datasets, the ROC curve by discretized variables moves closer to the upper-left corner than the one by interval variables. The improvement can be further checked by other performance measures (i.e. Type I Error, Type II Error, accuracy, F1 score) in Table 1, where the probability cutoff is chosen as the intersection point of the sensitivity plot and specificity plot. For example, on the dataset arrhythmia, Type I Error decreases by , Type II Error decreases by , accuracy increases by , and F1 score increases by . Note that the probability cutoff on this dataset is very small, but it is reasonable that some estimated probabilities are very close to 0, considering the facts that they are direct outputs of a sigmoid function ranging from 0 to 1 and target classes (i.e. non-event, event) are represented by 0 and 1 in the data.
|Dataset||Model||AUC||Type I Error||Type II Error||Accuracy||F1 Score||Probability Cutoff|
6 Discussions and Conclusions
To improve the model performance on imbalanced data, efforts have been made from the perspective of the predictors and the modeling algorithm, respectively, in this study. Through the detailed study on the credit scoring dataset, we show that the proper variable discretization and class-dependent cost-sensitive logistic regression with the best class weights help reduce the model bias and/or variance, based on the ROC curves and AUC on -fold cross-validation, Type I Error, Type II Error, accuracy, and F1 score. Moreover, class-dependent cost-sensitive logistic regression is beneficial for increasing the prediction power of predictors during the training phase even if those predictors are not transformed in their best forms and keeping the multivariate effect and univariate effect of predictors consistent.
On the other hand, the logistic regression model with proper discretized variables performs better than class-dependent cost-sensitive logistic regression, provides more reasonable coefficient estimates, and is robust to penalty scales of misclassification costs of events and non-events determined by their proportions. This indicates that we should always discretize the variables showing nonlinear relationships against their empirical logits.
In this study, logistic regression and its variant (i.e. class-dependent cost sensitive logistic regression) are used as classifiers. In the future, we will study the performance of variable discretization with other classifiers such as decision tree, support vector machine, and neural network.
-  A. Ali, S.M. Shamsuddin, and A.L. Ralescu, Classification with class imbalance problem: a review, Int. J. Advance Soft Compu. Appl 7 (2015), pp. 176–204.
-  A.C. Bahnsen, D. Aouada, and B. Ottersten, Example-dependent cost-sensitive logistic regression for credit scoring, in Machine Learning and Applications (ICMLA), 2014 13th International Conference on. IEEE, 2014, pp. 263–269.
-  I. Brown and C. Mues, An experimental comparison of classification algorithms for imbalanced credit scoring data sets, Expert Systems with Applications 39 (2012), pp. 3446–3453.
-  G. Collell, D. Prelec, and K.R. Patil, A simple plug-in bagging ensemble based on threshold-moving for classifying binary and multiclass imbalanced data, Neurocomputing 275 (2018), pp. 330–340.
-  P. Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis, Modeling wine preferences by data mining from physicochemical properties, Decision Support Systems 47 (2009), pp. 547–553.
-  K. Deng, Omega: On-line memory-based general purpose system classifier, Ph.D. diss., Carnegie Mellon University, 1998.
-  J. Ding and W. Xiong, A new estimator for a population proportion using group testing, Communications in Statistics-Simulation and Computation 45 (2016), pp. 101–114.
-  S. Ding, B. Mirza, Z. Lin, J. Cao, X. Lai, T.V. Nguyen, and J. Sepulveda, Kernel based online learning for imbalance multiclass classification, Neurocomputing 277 (2018), pp. 139–148.
-  S. Donnelly and J. Verkuilen, Empirical logit analysis is not logistic regression, Journal of Memory and Language 94 (2017), pp. 28–42.
-  J. Dougherty, R. Kohavi, and M. Sahami, Supervised and unsupervised discretization of continuous features, in Machine Learning Proceedings 1995, Elsevier, 1995, pp. 194–202.
-  X. Guo, Y. Yin, C. Dong, G. Yang, and G. Zhou, On the class imbalance problem, in 2008 Fourth international conference on natural computation, Vol. 4. IEEE, 2008, pp. 192–201.
-  H.A. Guvenir, B. Acar, G. Demiroz, and A. Cekin, A supervised machine learning algorithm for arrhythmia analysis, in Computers in Cardiology 1997. IEEE, 1997, pp. 433–436.
-  F. Habibzadeh, P. Habibzadeh, and M. Yadollahie, On determining the most appropriate test cut-off value: the case of tests with continuous results, Biochemia medica: Biochemia medica 26 (2016), pp. 297–307.
-  D.J. Hand and W.E. Henley, Statistical classification methods in consumer credit scoring: a review, Journal of the Royal Statistical Society: Series A (Statistics in Society) 160 (1997), pp. 523–541.
-  H. He and Y. Ma, Imbalanced learning: foundations, algorithms, and applications, John Wiley & Sons, 2013.
-  G. James, D. Witten, T. Hastie, and R. Tibshirani, An introduction to statistical learning, Vol. 112, Springer, 2013.
-  N. Japkowicz, The class imbalance problem: Significance and strategies, in Proc. of the Intâl Conf. on Artificial Intelligence. 2000.
-  H. Jopia, Scoring Modeling and Optimal Binning (2018). Available at https://cran.r-project.org/web/packages/smbinning/smbinning.pdf, Accessed: 2018-10-11.
-  Kaggle, Give Me Some Credit. Available at https://www.kaggle.com/c/GiveMeSomeCredit/data, Accessed: 2018-02-01.
-  G. King and L. Zeng, Logistic regression in rare events data, Political analysis 9 (2001), pp. 137–163.
-  S. Kotsiantis and D. Kanellopoulos, Discretization techniques: A recent survey, GESTS International Transactions on Computer Science and Engineering 32 (2006), pp. 47–58.
-  B. Krawczyk, Learning from imbalanced data: open challenges and future directions, Progress in Artificial Intelligence 5 (2016), pp. 221–232.
-  B. Krawczyk, M. Galar, Ł. Jeleń, and F. Herrera, Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy, Applied Soft Computing 38 (2016), pp. 714–726.
-  B. Krawczyk and M. Woźniak, Cost-sensitive neural network with roc-based moving threshold for imbalanced classification, in International Conference on Intelligent Data Engineering and Automated Learning. Springer, 2015, pp. 45–52.
-  J.L. Leevy, T.M. Khoshgoftaar, R.A. Bauder, and N. Seliya, A survey on addressing high-class imbalance in big data, Journal of Big Data 5 (2018), p. 42.
-  M. Maalouf and T.B. Trafalis, Robust weighted kernel logistic regression in imbalanced and rare events data, Computational Statistics & Data Analysis 55 (2011), pp. 168–183.
-  M. Maalouf, T.B. Trafalis, and I. Adrianto, Kernel logistic regression using truncated newton method, Computational management science 8 (2011), pp. 415–428.
-  mlr-org, Cost-Sensitive Classification. Available at https://mlr-org.github.io/mlr-tutorial/release/html/cost_sensitive_classif/index.html, Accessed: 2018-04-27.
-  A. Moayedikia, K.L. Ong, Y.L. Boo, W.G. Yeoh, and R. Jensen, Feature selection for high dimensional imbalanced class data using harmony search, Engineering Applications of Artificial Intelligence 57 (2017), pp. 38–49.
-  F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011), pp. 2825–2830.
-  C. Phua, D. Alahakoon, and V. Lee, Minority report in fraud detection: classification of skewed data, Acm sigkdd explorations newsletter 6 (2004), pp. 50–59.
-  L.A. Pramono, S. Setiati, P. Soewondo, I. Subekti, A. Adisasmita, N. Kodim, and B. Sutrisna, Prevalence and predictors of undiagnosed diabetes mellitus in indonesia, Age 46 (2010), pp. 100–100.
-  F. Provost, Machine learning from imbalanced data sets 101, in Proceedings of the AAAIâ2000 workshop on imbalanced data sets. 2000, pp. 1–3.
-  M.M. Rahman and D. Davis, Addressing the class imbalance problem in medical datasets, International Journal of Machine Learning and Computing 3 (2013), p. 224.
-  R. Rousseau, Basic properties of both percentile rank scores and the i3 indicator, Journal of the American Society for Information Science and Technology 63 (2012), pp. 416–420.
-  Scikit-learn, One Hot Encoder. Available at https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html, Accessed: 2018-02-01.
-  N. Siddiqi, Credit risk scorecards: developing and implementing intelligent credit scoring, Vol. 3, John Wiley & Sons, 2012.
-  S. Visa and A. Ralescu, Issues in mining imbalanced data sets-a review paper, in Proceedings of the sixteen midwest artificial intelligence and cognitive science conference, Vol. 2005. sn, 2005, pp. 67–73.
-  H. Wang, Q. Xu, and L. Zhou, Large unbalanced credit scoring using lasso-logistic regression ensemble, PloS one 10 (2015), p. e0117844.
-  G. Zeng, Metric divergence measures and information value in credit scoring, Journal of Mathematics 2013 (2013).