A Descriptive Study of Variable Discretization and CostSensitive Logistic Regression on Imbalanced Credit Data
Abstract
Training classification models on imbalanced data sets tends to result in bias towards the majority class. In this paper, we demonstrate how the variable discretization and CostSensitive Logistic Regression help mitigate this bias on an imbalanced credit scoring data set. 10fold crossvalidation is used as the evaluation method, and the performance measurements are ROC curves and the associated Area Under the Curve. The results show that good variable discretization and CostSensitive Logistic Regression with the best class weight can reduce the model bias and/or variance. It is also shown that effective variable selection helps reduce the model variance. From the algorithm perspective, CostSensitive Logistic Regression is beneficial for increasing the prediction ability of predictors even if they are not in their best forms and keeping the multivariate effect and univariate effect of predictors consistent. From the predictors perspective, the variable discretization performs slightly better than CostSensitive Logistic Regression, provides more reasonable coefficient estimates for predictors which have nonlinear relationship against their empirical logit, and is robust to penalty weights of misclassifications of events and nonevents determined by their proportions.
Application Note
mbalanced learning; variable discretization; costsensitive logistic regression; credit scoring
1 Introduction
Imbalanced learning is defined as the knowledge discovery process on severely skewed data set to support the decision making [4]. The tasks include regression, classification, and clustering. For classification, it refers to learning the decision boundary on the data set where the proportion of interesting events in the dependent variable is very low. Effective classification on the imbalanced data is key to many realworld problems like antimoney laundering, fraud detection, credit scoring, rare disease diagnoses, spam detection, and cybersecurity. However, classical machine learning algorithms and statistical methods usually perform poorly without any adjustment when the event rate is low [10].
To solve these problems more efficiently, researchers and practitioners have made efforts from various perspectives like data sampling and algorithms, with the considerations of concrete problem characteristics. In this paper, we focus on the credit scoring problem. It predicts the probability of a debtor’s default or delinquency. The default instances are usually much less than the nondefault instances. We provide a detailed descriptive study on how the variable discretization and CostSensitive Logistic Regression help mitigate the bias of an imbalanced credit scoring data. These two techniques are studied for their high interpretability to serve the regulation purpose on the credit scoring.
2 Related Work
A comprehensive review on the foundations, algorithms, and applications of the imbalanced learning was conducted by He et al. in 2013 [4]. It summarized the past research in five categories, including sampling methods, costsensitive methods, kernelbased learning methods, active learning methods, and oneclass learning methods. In 2001, King proposed the weighting technique for the logistic regression in rare events data, where the weighted loglikelihood in Eq. 2 was maximized instead of the loglikelihood in Eq. 1 during the training phase [7], where , is the population fraction of events and is the sample proportion of events induced by choicebased sampling.
(1) 
(2) 
The weighted logistic regression in Eq. 2 is referred as ClassDependent CostSensitive Logistic Regression [9]. Bahnsen et al. proposed a different version of CostSensitive Logistic Regression, called ExampleDependent CostSensitive Logistic Regression [1], where the objective cost function was defined depending on the predefined misclassification cost of each example/observation and minimized during the training phase.
The past research has rarely considered the variable discretization as a technique for the imbalanced classification task, although it has been widely used as a generic technique for creating more powerful and interpretable discretized predictors from continuous ones. Dougherty et al. reviewed existing variable discretization methods, compared three of them (i.e. equal width interval, entropybased, and puritybased) in depth on 16 datasets, and found that the global entropybased one performed the best on average [2]. For entropybased discretization methods, the evaluation measures include class information entropy, Gini, dissimilarity, and the Hellinger measure [8].
(3) 
To select powerful discretized variables in the credit scoring problem, a common measurement is information value [3]. The information value of a discretized variable is defined as in Eq. 3, where is the number of nonevents (i.e. nondelinquency) in the level of the variable divided by the total number of nonevents, is the number of events (i.e. delinquency) in the level of the variable divided by the total number of events, and referred as the weight of evidence. It also pointed out that the variables with the information value over should be considered in the model.
3 Data
Biographical and financial information from clients is available in the data set from Kaggle competition Give Me Some Credit [6]. The characteristics of the individuals in the data are represented by variables, as shown in Table 3. The goal is to predict whether a client will experience financial distress in the next two years or not, indicated by the dependent variable . As shown in Table 3, there are delinquency instances and nondelinquency instances. The proportion of delinquency instances is 6.
There are observations with missing values in original variables provided, which is of the total. They are treated as follows.

When building the model with original variables, those observations are dropped to ensure the data accuracy and support the model training computation.

When building the model with discretized variables, those observations are kept by grouping the missing values separately into a level of a variable.
3.1 Exploratory Analysis
Because the dependent variable is binary and all independent variables are interval, the empirical logit plot is used to examine whether the relationship between the dependent variable and an independent variable is linear or not. If linear, we can use the interval form of that independent variable. If not linear, we need to discretize that independent variable to represent the nonlinearity. Moreover, through the empirical logit plots, we can check the univariate effects, positive or negative.
The empirical logit plot is created in the following steps.

For each interval variable, generate percentile ranks from to .

For each rank of each interval variable, calculate the total number of observations , the number of delinquency observations , and the mean of the interval variable .

For each rank of each interval variable, compute the empirical logit using the formula .

For each interval variable, plot the empirical logit against the mean in each rank and their linear regression line. Each point in the plot represents data points from the data set by their mean.

For each interval variable, plot the empirical logit against the rank and their linear regression line. Each point in the plot represents data points from the data set by their rank index.
To show how the empricial logit plot works, take the variable as the example. Its percentile rank information can be found in Table 3.1. Note that its rank are merged together because they have the same cutting points (i.e. min, max). As shown in Figure (a)a, there is nonlinear relationship between and its empirical logit, mainly caused by extreme values. Note that these extreme values in the empricial logit plot cannot be simply removed, considering they represent several hundred data points in the data set instead of a few ones. However, the relationship between its rank and its empricial logit is quite linear in the positive direction as shown in Figure (b)b. In this case, its rank, the discretized form of its original continuous values, is preferred to be used in the modeling.
3.2 Variable Discretization
Three variable discretization methods (i.e. distance, quantile, and Gini) are compared, and the quantile discretization gives the best Area Under the Curve (i.e. AUC), after fitting a logistic regression model on the data set partitioned into the training data () and the validation data (). So, each variable is ranked and discretized into bins maximally based on the quantile, with the threshold value selected by the same procedure above.
Information value is used as the measurement of the discrimination power of each individual variable after discretization, as shown in Table 3.2. Note that for some variables, the resulting number of bins is less than 20, because the bins with the same cutting points are merged together. And for the variable , there is bins, because there are some missing values in it, which is seperated into one extra bin.
4 Modeling
Logistic Regression and ClassDependent CostSensitive Logistic Regression are used as the methodology for their high interpretability. fold crossvalidation is used for the model evaluation. The performance measurements are ROC curve and AUC. The mean of AUCs of fold crossvalidation is used to measure the model bias, while the standard deviation of AUCs of fold crossvalidation is used to measure the model variance. They are reasonable measurements, considering that the model bias refers to the error introduced by approximating the true model and the model variance refers to the amount of the change of the estimated model if using a different training data set [5].
To evaluate and compare the performance of the variable discretization and ClassDependent CostSensitive Logistic Regression, the following five models are built.

Model : Logistic Regression model on all original interval form of variables provided.

Model : Logistic Regression model on original interval form of variables with the information value over .

Model : ClassDependent CostSensitive Logistic Regression model on original interval form of variables with the information value over and the best Class Weight based on the mean of AUCs of fold crossvalidation, which is as indicated by the gray line in Figure 2.

Model : Logistic Regression model on discretized form of variables with the information value over , where the discretized variables are transformed by the onehot encoder with each bin represented by one dummy variable. In total, dummy variables are created.

Model : ClassDependent CostSensitive Logistic Regression model on discretized form of variables with the information value over and the best Class Weight based on the mean AUC of fold crossvalidation, which is any value from to as indicated by the blue line in Figure 2. The discretized variables are encoded to dummy variables in the same way as in Model .
For Model , to avoid inducing the population proportion of events, which is used in Eq. 2, we use a single hyperparameter to conduct the weighting, as shown in Eq. 4, where . is referred as Class Weight, which penalizes the misclassifications of events to nonevents. Correspondingly, is referred as Class Weight, which penalizes the misclassification of nonevents to events. The larger the value is, the more the misclassifications of events to nonevents are penalized or weighted. As shown by the grey line in Figure 2, as Class Weight increases, which indicates that more weight is put on the misclassifications of events to nonevents, the mean of AUCs on the fold crossvalidation increases gradually and then decreases sharply when approaching . The best occurs when it is .
(4) 
For Model , the same search for the best Class Weight is conducted as for Model . The blue line of Figure 2 shows the performance of ClassDependent CostSensitive Logistic Regression model on discretized form of variables with the information value over under different Class Weights. As shown, Class 1 Weight does not have any influence when its value is from to . If we take the Class Weight as , Model is the same as Model , which penalizes the misclassifications of events to nonevents and nonevents to events in the same scale. Moreover, compared to the performance of ClassDependent CostSensitive Logistic Regression model on original interval form of variables with the information value over , which is Model 3, Class Weight has much less influence on this model. It implies that good variable discretization is robust to penalty weights determined by proportions of events and nonevents.
The ROC curves of five models can be found in Figure 3. Because Model ends up the same as Model , we only compare Model with other models. For Model and Model , they have similiar AUCs, but the ROC curves of Model are closer to each other. It indicates that the variables with the information vlaue below don’t make much contributions in this model. The ROC curves of Model and Model are much closer to the upperleft corner than the ones of Model . Moreover, for Model , the ROC curves on fold crossvalidation are closer to each other. This can be further confirmed by the mean and standard deviation of AUCs on fold crossvalidation in Table 4. The mean of AUCs for Model , Model , Model , and Model is , , , and , respectively. Their standard deviation is , , , and , respectively.
The estimated coefficients of the models are also examined. The estimated parameters of Model and Model can be found in Table 4. Their values are different, as well as the sign of the variable . Its sign is negative in Model , while its sign is positive in Model . Its empirical logit plot in Figure (c)c shows the positive relationship, so the positive sign in Model is consistent with its univariate effect. For other variables, the signs of estimated parameters are consistent with their univariate effect shown in their empirical logit plots in Figure (a)a, (b)b, and (d)d. The estimated parameters of Model are not presented here, because it is spaceconsuming to list dummy variables. But onehot encoded discretized variables give more interpretable estimates, considering they are binary indicators.
In short summary, from Model to Model , selecting only the interval form of variables with the information value over , the model variance is reduced. From Model to Model , penalizing the misclassifications of events and nonevents in different scales by running the ClassDependent Logistic Regression, the model bias is reduced, and all multivariate effects become consistent with the univariate effects based on the signs of estimated parameters. From Model to Model , using the discretized form of variables with the information value over , both the model bias and the model variance are reduced. And Model is slightly better than Model . From Model to Model , running the ClassDependent Logistic Regression on the discretized variables, the model performance is the same.
5 Discussions and Conclusions
To improve the model performance, two efforts have been made from the perspective of the predictors and the modeling algorithm respectively. Based on the ROC curves and AUC on fold crossvalidation, good variable discretization and ClassDependent CostSensitive Logistic Regression with the best class weight help mitigate the imbalance in the data and reduces the model bias and/or variance. We also observe that effective variable selection can help reduce the model variance. Moreover, ClassDependent CostSensitive Logistic Regression is beneficial for increasing the prediction power of predictors during the training phase even if those predictors are not transformed in their best forms and keeping the multivariate effect and univariate effect of predictors consistent.
On the other hand, the model with good discretized variables performs slightly better than ClassDependent CostSensitive Logistic Regression, provides more reasonable coefficient estimates, and is robust to penalty scales of misclassifications of events and nonevents determined by their proportions. This indicates that we should always discretize the variables which show nonlinear relationship against their empirical logits.
In this study, we provide the detailed study of the variable discretization and ClassDependent Cost Sensitive Logistic Regression on an imbalanced credit data set. In the future, we will consider more data sets, study concretely the relationship between the penalty scales and the proportions of events and nonevents, and compare comprehensively with other classification algorithms like neural network and some sampling methods for imbalanced learning.
References
 [1] A.C. Bahnsen, D. Aouada, and B. Ottersten, Exampledependent costsensitive logistic regression for credit scoring, in Machine Learning and Applications (ICMLA), 2014 13th International Conference on. IEEE, 2014, pp. 263–269.
 [2] J. Dougherty, R. Kohavi, and M. Sahami, Supervised and unsupervised discretization of continuous features, in Machine Learning Proceedings 1995, Elsevier, 1995, pp. 194–202.
 [3] D.J. Hand and W.E. Henley, Statistical classification methods in consumer credit scoring: a review, Journal of the Royal Statistical Society: Series A (Statistics in Society) 160 (1997), pp. 523–541.
 [4] H. He and Y. Ma, Imbalanced learning: foundations, algorithms, and applications, John Wiley & Sons, 2013.
 [5] G. James, D. Witten, T. Hastie, and R. Tibshirani, An introduction to statistical learning, Vol. 112, Springer, 2013.
 [6] Kaggle, Give Me Some Credit. Available at https://www.kaggle.com/c/GiveMeSomeCredit/data, Accessed: 20180201.
 [7] G. King and L. Zeng, Logistic regression in rare events data, Political analysis 9 (2001), pp. 137–163.
 [8] S. Kotsiantis and D. Kanellopoulos, Discretization techniques: A recent survey, GESTS International Transactions on Computer Science and Engineering 32 (2006), pp. 47–58.
 [9] mlrorg, CostSensitive Classification. Available at https://mlrorg.github.io/mlrtutorial/release/html/cost_sensitive_classif/index.html, Accessed: 20180427.
 [10] L. Zhang, J. Priestley, and X. Ni, Influence of the event rate on discrimination abilities of bankruptcy prediction models, International Journal of Database Management Systems 10 (2018), pp. 1–14.