WOTBoost: Weighted Oversampling Technique in Boosting for imbalanced learning

WOTBoost: Weighted Oversampling Technique in Boosting for imbalanced learning

Wenhao Zhang
University of California, Los Angeles
wenhaoz@ucla.edu
&Ramin Ramezani
University of California, Los Angeles
raminr@ucla.edu
&Arash Naeim
University of California, Los Angeles
ANaeim@mednet.ucla.edu
Abstract

Machine learning classifiers often stumble over imbalanced datasets where classes are not equally represented. This inherent bias towards the majority class may result in low accuracy in labeling minority class. Imbalanced learning is prevalent in many real-world applications, such as medical research, network intrusion detection, and fraud detection in credit card transactions, etc. A good number of research works have been reported to tackle this challenging problem. For example, Synthetic Minority Over-sampling TEchnique (SMOTE) and ADAptive SYNthetic sampling approach (ADASYN) use oversampling techniques to balance the skewed datasets. In this paper, we propose a novel method that combines a Weighted Oversampling Technique and ensemble Boosting method (WOTBoost) to improve the classification accuracy of minority data without sacrificing the accuracy of the majority class. WOTBoost adjusts its oversampling strategy at each round of boosting to synthesize more targeted minority data samples. The adjustment is enforced using a weighted distribution. We compare WOTBoost with other four classification models (i.e., decision tree, SMOTE + decision tree, ADASYN + decision tree, SMOTEBoost) extensively on 18 public accessible imbalanced datasets. WOTBoost achieves the best G mean on 6 datasets and highest AUC score on 7 datasets.

\keywords

Imbalanced learning, oversampling, ensemble learning, SMOTE

1 Introduction

Learning from imbalanced datasets can be very challenging as the classes are not equally represented in the datasets [31]. There might not be enough examples for a learner to form a legit hypothesis that can well model the under-represented classes. Hence, the classification results are often biased towards the majority classes. The curse of imbalanced learning is prevalent in real-world applications. In medical research, models are usually trained to give predictions on a dichotomous outcome based on a series of observable features [11]. For example, learning from a cancer dataset which mostly contains non-cancer data samples is perceived to be difficult. Other practical applications with more severely skewed datasets are fraudulent telephone calls [13], detection of oil spills in satellite images [23], detection of network intrusions [26], and information retrieval and filtering tasks [27]. In these scenarios, the imbalance ratio of majority class to minority class can go up to 100,000 [6]. Even though class imbalance issue can exist in multi-class applications, we only focus on the binary class scenario in this paper as it is feasible to reduce a multi-class classification problem into a series of binary classification problems [2].

To address the class imbalance issue, we proposed a novel method which combines a Weighted Oversampling Technique and ensemble Boosting method (WOTBoost) to improve the classification accuracy of the minority data without sacrificing the accuracy of majority class. Essentially, the proposed method synthesizes data samples of minority class to balance the dataset. In addition, WOTBoost identifies the minority data samples which are mostly enclosed by the data samples from the other class. Empirically, it is deemed quite challenging to predict the true labels of these minority data samples. By placing enough synthesized data points within the proximity of the difficult minority data samples, the classification boundaries might be pushed away from the minority data samples. In other words, it is more likely to predict these difficult minorities to be minority class. Therefore, WOTBoost creates more synthesized data points for these "difficult" minority data samples.

The contributions in this paper are as follows:

  • We identify the minority class data examples which are harder to learn at each round of boosting and generate more synthetic data for this kind.

  • We test our proposed algorithm extensively on 18 public accessible datasets and compared the results with the most commonly used algorithms. To our knowledge, this might be first work to carry out such a comprehensive comparison study in ensemble method combined with oversampling approach.

  • We inspect the various distributions of 18 datasets and discussed why WOTBoost performs better on certain datasets.

The rest of the paper is organized as follows: section 2 briefly reviews the literature in dealing with imbalanced datasets. Section 3 proposes our WOTBoost algorithm with details. Section 4 compares the experimental results of WOTBoost algorithm with other baseline methods in terms of precision, recall, F measure,G mean, Specificity, AUC. Section 5 discusses the results and propose the future work direction.

2 Background

There have been ongoing efforts in this research domain finding ways to better tackle the imbalanced learning problem. Most of the state-of-the-art research methodologies are fallen into two major categories: 1) Data level approach, or 2) Algorithm level approach [14, 1].

2.1 Data level approach

On the data level, skewed datasets can be balanced by either 1) oversampling the minority class data examples, 2) under-sampling the majority class data examples.

2.1.1 Oversampling

It aims to overcome the class imbalance by artificially creating new data from the under-represented class. However, simply duplicating the minority class samples would potentially cause overfitting. One of the most widely used techniques is SMOTE. The SMOTE algorithm generates synthetic data examples for minority class by randomly placing the newly created data instances between minority class data points and their neighbors [6]. This technique not only can better model the minority classes by introducing a bias towards the minority instances but also has a lower chance of overfitting. This is due to SMOTE forcing the learners to create larger and less specific decision regions. Based on SMOTE, Hui Han et al. propose the Borderline-SMOTE, which only synthesizes the minorities on the decision borderline [18]. The Borderline-SMOTE classifies minority classes into "safe type" and "dangerous type". The "safe type" is located in the homogeneous regions where the majority of data examples belong to the same class. On the other hand, the "dangerous type" data points are outliers and most likely lie within the decision regions of the opposite class. The intention behind Borderline-SMOTE is to give more weights to the "dangerous type" minority class as it is deemed to be more difficult to learn [30]. Haibo He et al. adopt the same philosophy and proposed ADASYN algorithm, which uses a weighted distribution for different minority class data. The weights are assigned to minority data examples based on the level of difficulty in learning. In other words, harder data examples have more weights thus higher chance of getting more synthesized data. Prior to generating synthetic data, ADASYN inspects the nearest neighbors for each minority class data example, and counts the number of neighbors from the majority class, . Next, the difficulty of learning can be calculated as a ratio of [19]. ADASYN assigns higher weights on the difficult minority samples. On the contrary, Safe-Level-SMOTE gives more priority to safer minority instances and has a better accuracy performance than SMOTE and Borderline-SMOTE [5]. Karia et al. propose a genetic algorithm, GenSample, for oversampling in imbalanced Datasets. GenSample accounts for the difficulty in learning minority examples when synthesizing, along with the performance improvement achieved by oversampling [22].

2.1.2 Undersampling

This technique approaches the imbalanced learning by removing a certain number of data examples from the majority class while keeping the original minority data points untouched. Random undersampling is the most common method in this category [1]. Elhassan AT et al. combine the undersampling algorithm with Tomek Link (T-Link) to create a balanced dataset [11, 37]. However, the undersampling method may suffer severe information loss. In this paper, we mainly focus on the oversampling technique and its variants [38].

2.2 Algorithm level approach

On the algorithm level, there are typically three mainstream approaches: a) Improved algorithms, b) cost-sensitive learning, and c) ensemble method [1, 34].

2.2.1 Improved algorithms

This approach generally attempts to tailor the classification algorithms to directly learn from the skewed dataset by shifting the decision boundary in favor of the minority class. Tasadduq Imam et al. propose z-SVM to counter the inherent bias in datasets by introducing a weight parameter, , for minority class to correct the decision boundary during model fitting [21]. Other modified SVM classifiers have also been reported, such as GSVM_RU and BSVM [36, 20]. One special form of an improved algorithm for imbalanced datasets is one-class learning. This method aims to generalize the hypothesis on a training dataset which only contains the target class [28, 8].

2.2.2 Cost-sensitive learning

This technique penalizes the misclassifications of different classes with varying costs. Specifically, it assigns more costs to the misclassification of the target class. Hence, the false negative would be penalized more than the false positives [41, 29]. In cost-sensitive learning, a cost weight distribution is predefined in favor of the target classes.

2.2.3 Ensemble method

Ensemble method trains a series of weak learners in a fixed number of iterations. A weak learner is a classifier whose accuracy is just barely above chance. At each round, a weak learner is created and a weak hypothesis is generalized. The predictive outcome is produced by aggregating all these weak hypotheses using a weighted voting method [9]. For example, AdaBoost.M2 algorithm calculates the pseudo-loss of each weak hypothesis during boosting. The pseudo-loss is computed over all data examples with respect to the incorrect classifications. The weight distribution is computed using the pseudo-loss (see algorithm 1). The weight distribution is updated with respect to pseudo loss at the current iteration and will be carried over to the next round of boosting. Hence, the learners in the next iteration will concentrate on the data examples which are hard to learn [15]. Since Adaboost is apt to learn from a imbalanced dataset, several works are based on this boosting framework [7, 32, 17]. SMOTEBoost is proposed to combine the merits of SMOTE and Boosting methods by adding a SMOTE procedure at the beginning of each round of boosting. SMOTEBoost aims to improve the true positives without sacrificing the accuracy of majority class. RUSBoost alleviates class imbalanced by introducing random undersampling technique into a standard boosting procedure. Compared with SMOTEBoost, RUSBoost is a faster and simpler alternative to SMOTEBoost [32]. Ashutosh Kumar et al. proposed RUSTBoost algorithm which adds a redundancy-driven modified Tomek-Link based undersampling procedure before RUSBoost [25]. The Tomek-Link pairs are the pairs of closest data points from different classes. However, all the mentioned boosting algorithms treat the data examples equally. Krystyna Napierala et al. highlighted that the various types of minority data examples (e.g., safe, borderline, rare, and outlier) have unequal influence on the outcome of classification. As such, the algorithms should be designed to focus on the examples which are not easy to learn[30]. DataBoost-IM is reported to discriminate different types of data examples beforehand and adjust the weight distribution accordingly during boosting [17].

3 WOTBoost: Weighted Oversampling Technique in Boosting

In this section, we propose the WOTBoost algorithm which combines a weighted oversampling algorithm with the standard boosting procedure. The Weighted Oversampling Technique populates synthetic data based on the weights that are associated to each minority data. In other words, higher weighted minority data samples are synthesized more. This algorithm is an ensemble method and creates a series of classifiers in an arbitrary number of iterations. The boosting procedure will be elaborated with details in Algorithm 1 and 2: a) We introduce a weighted oversampling step at the beginning of each iteration of boosting. b) We adjust the weighted oversampling strategy using the updated weights (i.e., at line 8 in algorithm 1) associated with the minority during each round of boosting [31]. The boosting algorithm gives more weights to the data samples which were misclassified in the previous round. Hence, WOTBoost can be designed to generate more synthetic data examples for the minority data which were misclassified in the previous iterations. Meanwhile, boosting technique would also add more weights to misclassified majority class data, and force the learner to focus on these data as well. Therefore, we combine the merits of weighted oversampling technique and AdaBoost.M2 together. The goal is to improve the discriminative power of the classifier on difficult minority examples without sacrificing the accuracy of the majority class data instances.

Input: Training dataset with samples , where is an instance in the dimensional feature space, , and is the label associated with ;
Let ;
specifies the number iterations in boosting procedure;
Initialize: ,
1 for t=1,2,3,…  do
2       Create N synthetic examples from minority class with the weight distribution using Algorithm 2;
3       Fit a weak learner using the temporary training dataset which is a combination of original data and synthetic data;
4       Calculate a weak hypothesis ;
5       Compute the pseudo-loss of : ;
6       Let ;
7       Update the weight distribution : ;
8       Normalize , where is a normalization constant such that
9 end for
Output:
Algorithm 1 Boosting with weighted oversampling
Input: N is the number of synthetic data examples from minority class;
is the weight distribution passed at line 2 in Algorithm 1
1 Calculate the number of synthetic data examples for each minority class instance: ;
2 For each minority class instance, , in original training dataset, generate synthetic data examples using the following rules: for  1,2,3,…,  do
3       Randomly choose a minority class example, , from the nearest neighbors of , which is a n-dimensional feature vector ;
4       Calculate the difference vector ;
5       Create a synthetic data example using the following equation
where
Output: A temporary training dataset combining the original data with synthetic data
Algorithm 2 Dynamic weighted oversampling procedure
Figure 1: Overview of the comparison study

Algorithm 1 presents the details of the boosting procedure, which is a modified version of AdaBoost.M2 [15]. It takes a training dataset with data samples, . is the ith feature vector in -dimensional space, and is the true label associated with . is the predicted label. We initialize a mislabel distribution, , which contains all the misclassified data instances (i.e., ). In addition, we also initialize a weight distribution for the training data by assigning equal weights over all samples. During each round of boosting (step 1 - step 9), a weak learner is built on a training dataset which is the output of a weight oversampling procedure. The weak learner formulates a weak hypothesis which is just slightly better than random guessing, hence the name [15]. But this is good enough as the final output will aggregate all the weak hypotheses using weighted voting. As for error estimation, the pseudo loss of a weak hypothesis is calculated as specified at step 5. Instead of using ordinary training loss, pseudo loss is adopted to force the ensemble method to focus on mislabeled data. More justification for using pseudo loss can be found in [15, 16]. Once the pseudo loss is computed, the weight distribution, , is updated accordingly and normalized at step 5 - step 8.

Algorithm 2 demonstrates the weighted oversampling procedure. The inputs to oversampling technique are the weight distribution, , and an arbitrary number of synthetic data samples, . It uses the weight distribution as the oversampling strategy to decide how to synthesize for each minority data samples, as it is described at step 1 in algorithm 2. As mentioned previously, the ensemble method would assign more weights to misclassified data. Therefore, this oversampling strategy facilitates the classifier to learn a broader representation of mislabeled data by placing more similar data samples around them.

4 Experimentation

In this section, we conduct a comprehensive comparison study of WOTBoost algorithm with decision tree, SMOTE + decision tree, ADASYN + decision tree, and SMOTEBoost. Figure 1 shows how the models are built and assessed.

Dataset Instances Attributes Outcome Frequency Imbalanced Ratio No. of safe minority No. of unsafe minority unsafe minority%
Pima Indian Diabetes [4] 768 9 Maj: 506 Min:268 1.9 86 182 67.9%
Abalone [10] 4177 8 Maj:689 Min:42 6.4 5 37 88.1%
Vowel Recognition [10] 990 14 Maj:900 Min:90 10.0 89 1 1.1%
Mammography [12] 11183 7 Maj: 10923 Min: 260 42 107 153 58.8%
Ionosphere [10] 351 35 Maj: 225 Min 126 1.8 57 69 54.8%
Vehicle [10] 846 19 Maj: 647 Min:199 3.3 154 45 22.6%
Phoneme [39] 5404 6 Ma j: 3818 Min:1580 2.4 980 606 38.2%
Haberman [10] 306 4 Maj: 225 Min:81 2.8 8 73 90.1%
Wisconsin [10] 569 31 Maj: 357 Min: 212 1.7 175 37 17.5%
Blood Transfusion [40] 748 5 Maj: 570 Min: 178 3.2 23 83 87.1%
PC1 [33] 1484 9 Maj: 1032 Min: 77 13.4 8 69 89.6%
Heart [10] 294 14 Maj: 188 Min: 106 1.8 17 89 84.0%
Segment [10] 2310 20 Ma j: 1980 Min: 330 6.0 246 84 25.5%
Yeast [10] 1484 9 Ma j: 1240 Min: 244 5.1 95 149 61.1%
Oil 937 50 Maj: 896 Min: 41 21.9 0 41 100.0%
Adult [10] 48842 7 Maj: 37155 Min: 11687 3.2 873 10814 92.5%
Satimage [10] 6430 37 Maj: 5805 Min: 625 9.3 328 297 47.5%
Forest cover [3] 581012 11 Maj: 35754 Min: 2747 13.0 2079 668 24.3%
Table 1: Characteristics of 18 testing datasets

4.1 Dataset overview

We evaluate these 5 models extensively using 18 imbalanced datasets which are publicly accessible. The imbalanced ratio (i.e., counts of majority class samples to counts of minority class samples) of these datasets vary from 1.7 to 42. Since some testing imbalanced datasets have more than 2 classes, and we are only interested in the binary class problem in this paper, we pre-processed these datasets and modified them into a binary class datasets following the rules in the literature [6, 18, 5, 19, 7, 25]. Meanwhile, only numeric attributes are included when processing datasets. The details of data cleaning can be referred to the prior works [6, 7, 19]. The characteristics of these datasets are summarized in table 1.

4.2 Experiment setup

We compare the WOTBoost algorithm with naive decision tree classifier, decision tree classifier after SMOTE, decision tree classifier after ADASYN, and SMOTEBoost. Figure 1 shows that the clean datasets are split evenly into training and testing during each iteration [19]. As a control group, a naive decision tree model learned directly from the imbalanced training dataset. SMOTE and ADASYN algorithms are used separately to balance the training dataset before inputting it to decision tree classifiers. SMOTEBoost and WOTBoost take in imbalanced training datasets and synthesize new data samples for the minority at each round of boosting. Both of them use decision tree as the weak learner [7]. Models are evaluated on a separate testing dataset. The evaluating metrics used in this study are precision, recall, F1 measure, G mean, specificity, area under ROC. The final performance assessments are averaged over 100 such runs, and they are summarized in table 3. During each testing run, we oversample the training dataset in a way that both minority class and majority class are equally represented in all models [19]. For SMOTE, ADASYN, SMOTEBoost, and WOTBoost, we set the number of nearest neighbors to be 5.

4.3 Metrics

Overall accuracy is typically chosen to evaluate the predictive power of machine learning classifiers provided with a balanced dataset. As for imbalanced datasets, overall accuracy is no longer an effective metric. For example, in the information retrieval and filtering domain by Lewis and Catlette (1994), only 0.2% are interesting cases [23]. A dummy classifier that always gives predictions of majority class would easily achieve an overall accuracy of 99.8%. However, this predictive model is uninformative as we are more interested in classifying the minority class. Common alternatives to overall accuracy in assessing imbalanced learning models are F measures, G mean, and Area Under the Curve (AUC) for Receiver Operating Characteristic (ROC) [35]. By convention, majority class is regarded as negative class and minority class as positive class [6, 24]. Table II shows a confusion matrix that is typically used to visualize and assess the performance of predictive models. Based on this confusion matrix, the evaluation metrics used in this paper are mathematically formulated as follows:

Actual Positive
Actual Negative
Predicted Positive
True Positive (TP) False Positive (FP)
Predicted Negative
False Negative (FN) True Negative (TN)
Table 2: Confusion matrix of a binary classification problem
(1)
(2)
(3)
(4)
(5)

5 Results

Dataset Methods OA Precision Recall F_measure G_mean Specificity Sensitivity ROC AUC Outcome Frequency Imbalanced ratio
Pima Indian Diabetes DT 0.71 0.02 0.61 0.04 0.54 0.05 0.57 0.03 0.66 0.02 0.80 0.03 0.54 0.05 0.670.02 Maj: 500 Min:268 1.9
S 0.670.02 0.550.03 0.540.04 0.540.02 0.630.02 0.750.03 0.540.04 0.640.02
A 0.68 0.02 0.560.04 0.580.05 0.570.03 0.660.03 0.740.03 0.580.05 0.660.02
SM 0.660.02 0.520.02 0.860.04 0.640.02 0.680.02 0.540.05 0.860.04 0.700.01
WOT 0.730.02 0.600.03 0.780.05 0.680.02 0.740.02 0.710.03 0.780.05 0.740.02
Abalone DT 0.93 0.01 0.460.12 0.460.10 0.460.08 0.660.08 0.960.01 0.460.10 0.710.04 Maj:689 Min:42 16.4
S 0.880.02 0.240.07 0.380.11 0.290.07 0.590.08 0.920.02 0.380.11 0.650.05
A 0.880.02 0.240.06 0.420.11 0.310.07 0.620.09 0.910.02 0.420.11 0.660.05
SM 0.840.06 0.190.04 0.460.12 0.270.05 0.630.05 0.870.07 0.460.12 0.660.05
WOT 0.940.01 0.550.33 0.340.11 0.420.13 0.58 0.18 0.980.01 0.34 0.11 0.66 0.05
Vowel Recognition DT 0.970.00 0.900.06 0.790.06 0.840.04 0.880.03 0.990.00 0.790.06 0.890.03 Maj:900 Min:90 10.0
S 0.960.00 0.850.06 0.740.06 0.800.04 0.860.03 0.990.00 0.740.06 0.870.03
A 0.970.00 0.880.05 0.790.07 0.830.04 0.880.03 0.990.00 0.790.07 0.890.03
SM 0.980.00 0.830.05 0.960.04 0.890.03 0.970.02 0.980.00 0.960.04 0.970.02
WOT 0.980.01 0.870.10 0.980.01 0.930.07 0.980.02 0.990.01 0.980.01 0.980.01
Ionosphere DT 0.860.02 0.830.06 0.730.06 0.770.04 0.820.03 0.920.04 0.730.06 0.830.03 Maj: 225 Min 126 1.8
S 0.850.03 0.750.05 0.810.06 0.780.04 0.840.03 0.860.03 0.810.06 0.840.03
A 0.880.03 0.840.05 0.800.06 0.820.04 0.860.03 0.920.03 0.800.06 0.860.03
SM 0.910.02 0.890.06 0.850.04 0.870.03 0.900.02 0.950.04 0.850.04 0.900.02
WOT 0.910.02 0.920.05 0.790.04 0.850.03 0.870.02 0.970.02 0.790.04 0.880.02
Vehicle DT 0.940.01 0.850.04 0.880.04 0.870.03 0.920.02 0.950.01 0.880.04 0.920.02 Maj: 647 Min:199 3.3
S 0.900.01 0.750.04 0.880.05 0.810.03 0.890.02 0.910.01 0.880.05 0.890.02
A 0.920.01 0.810.04 0.870.04 0.840.02 0.900.02 0.930.01 0.870.04 0.900.02
SM 0.950.00 0.840.03 0.970.02 0.900.02 0.960.01 0.940.01 0.970.02 0.960.01
WOT 0.890.10 0.700.15 0.970.03 0.810.11 0.920.07 0.870.14 0.970.03 0.920.06
Phoneme DT 0.860.00 0.750.01 0.740.01 0.750.01 0.820.00 0.900.00 0.740.01 0.820.00 Maj: 3818 Min:1580 2.4
S 0.860.00 0.740.01 0.780.01 0.760.01 0.830.01 0.890.00 0.780.01 0.830.00
A 0.830.00 0.680.01 0.780.01 0.730.01 0.820.00 0.850.00 0.780.01 0.820.00
SM 0.770.00 0.570.01 0.860.01 0.690.01 0.800.00 0.740.01 0.860.01 0.800.00
WOT 0.520.06 0.380.03 0.990.01 0.540.03 0.570.07 0.340.09 0.990.01 0.660.04
Haberman DT 0.670.03 0.380.06 0.250.08 0.300.05 0.460.05 0.830.05 0.250.08 0.540.03 Maj: 225 Min:81 2.8
S 0.650.03 0.400.05 0.390.08 0.390.05 0.640.04 0.760.05 0.390.08 0.570.03
A 0.600.03 0.370.05 0.520.08 0.430.05 0.580.05 0.630.05 0.520.08 0.580.04
SM 0.480.06 0.340.03 0.840.07 0.480.03 0.530.10 0.340.11 0.840.07 0.590.02
WOT 0.540.05 0.350.07 0.700.12 0.470.05 0.570.05 0.480.11 0.700.12 0.590.03
Wisconsin DT 0.950.01 0.930.03 0.930.01 0.930.01 0.950.01 0.960.02 0.930.01 0.950.01 Maj: 357 Min: 212 1.7
S 0.920.01 0.890.03 0.90 0.03 0.890.02 0.910.01 0.930.02 0.900.03 0.910.01
A 0.950.01 0.930.03 0.940.03 0.940.02 0.950.01 0.960.02 0.940.03 0.950.01
SM 0.980.01 0.990.00 0.950.01 0.970.01 0.970.01 0.990.00 0.950.01 0.970.01
WOT 0.970.01 0.970.03 0.950.02 0.960.02 0.960.01 0.980.02 0.950.02 0.960.01
Blood Transfusion DT 0.720.01 0.390.06 0.280.08 0.320.06 0.490.07 0.860.01 0.280.08 0.570.04 Maj: 570 Min: 178 3.2
S 0.710.01 0.390.05 0.390.07 0.390.05 0.560.05 0.810.01 0.390.07 0.600.03
A 0.700.01 0.380.05 0.420.08 0.400.06 0.570.07 0.780.01 0.420.08 0.600.04
SM 0.440.03 0.290.03 0.930.10 0.450.03 0.520.04 0.290.04 0.930.10 0.610.04
WOT 0.680.03 0.380.16 0.520.12 0.44 0.09 0.610.14 0.730.03 0.520.12 0.620.05
PC1 DT 0.900.01 0.250.05 0.270.05 0.260.04 0.500.04 0.940.03 0.270.05 0.610.02 Maj: 1032 Min: 77 13.4
S 0.870.02 0.220.04 0.380.05 0.270.04 0.580.03 0.900.03 0.380.05 0.640.02
A 0.870.02 0.260.04 0.510.06 0.350.03 0.680.03 0.900.04 0.510.06 0.710.02
SM 0.820.05 0.160.02 0.410.04 0.230.03 0.590.07 0.850.08 0.410.04 0.630.02
WOT 0.910.03 0.340.03 0.300.07 0.320.03 0.530.02 0.960.03 0.300.07 0.630.02
Heart DT 0.770.03 0.680.06 0.630.08 0.650.05 0.730.04 0.840.05 0.630.08 0.740.03 Maj: 188 Min: 106 1.6
S 0.760.03 0.670.06 0.570.07 0.620.04 0.700.04 0.850.05 0.570.07 0.710.03
A 0.790.03 0.680.05 0.750.06 0.710.04 0.780.03 0.810.04 0.750.06 0.780.03
SM 0.700.03 0.550.05 0.750.06 0.630.03 0.710.03 0.680.06 0.750.06 0.710.03
WOT 0.740.03 0.600.06 0.740.06 0.660.04 0.740.03 0.750.06 0.740.06 0.740.03
Segment DT 0.960.00 0.880.04 0.880.03 0.880.02 0.930.02 0.980.00 0.880.03 0.930.01 Maj: 1980 Min: 330 6.0
S 0.960.00 0.870.03 0.850.03 0.860.02 0.910.01 0.980.00 0.850.03 0.910.01
A 0.960.00 0.880.03 0.870.03 0.870.02 0.920.01 0.980.00 0.870.03 0.920.01
SM 0.950.00 0.800.03 0.870.02 0.830.02 0.920.01 0.960.00 0.870.02 0.920.01
WOT 0.720.09 0.340.08 0.970.07 0.510.08 0.810.06 0.680.11 0.970.07 0.820.05
Yeast DT 0.830.01 0.460.03 0.590.05 0.510.03 0.720.03 0.870.01 0.590.05 0.730.02 Maj: 1240 Min: 244 5.1
S 0.810.01 0.410.04 0.600.04 0.490.03 0.710.02 0.850.02 0.600.04 0.720.02
A 0.820.01 0.430.03 0.660.05 0.520.02 0.750.03 0.840.01 0.660.05 0.750.02
SM 0.700.02 0.320.02 0.820.03 0.460.02 0.750.01 0.680.03 0.820.03 0.750.01
WOT 0.840.02 0.500.05 0.730.05 0.590.03 0.790.02 0.870.03 0.730.05 0.800.02
Oil DT 0.930.01 0.350.11 0.480.13 0.410.11 0.680.12 0.960.01 0.480.13 0.720.06 Maj: 896 Min: 41 21.9
S 0.910.01 0.260.07 0.480.11 0.330.07 0.670.08 0.930.01 0.480.11 0.700.05
A 0.890.01 0.220.09 0.520.11 0.310.08 0.690.09 0.910.01 0.520.11 0.710.05
SM 0.940.01 0.410.07 0.520.10 0.460.06 0.710.06 0.960.02 0.520.10 0.740.05
WOT 0.950.02 0.470.13 0.280.16 0.350.09 0.520.12 0.980.02 0.280.16 0.630.07
Adult DT 0.750.00 0.480.00 0.470.00 0.480.00 0.630.00 0.840.00 0.470.00 0.660.00 Maj: 37155 Min: 11687 3.2
S 0.700.00 0.410.00 0.560.00 0.480.00 0.650.00 0.750.00 0.560.00 0.650.00
A 0.710.00 0.420.00 0.570.00 0.480.00 0.650.00 0.750.00 0.570.00 0.660.00
SM 0.810.00 0.620.01 0.550.01 0.580.00 0.700.01 0.900.00 0.550.01 0.720.00
WOT 0.750.02 0.480.03 0.670.05 0.560.01 0.720.01 0.770.05 0.670.05 0.720.01
Satimage DT 0.91 0.00 0.530.02 0.510.03 0.520.02 0.700.02 0.950.00 0.510.03 0.730.01 Maj: 5805 Min: 625 9.3
S 0.900.00 0.510.02 0.630.02 0.560.01 0.770.01 0.930.00 0.630.02 0.780.01
A 0.890.00 0.450.03 0.600.03 0.520.02 0.750.01 0.920.00 0.600.03 0.760.01
SM 0.900.00 0.490.02 0.720.02 0.580.01 0.810.00 0.920.02 0.720.01 0.820.01
WOT 0.880.01 0.420.03 0.750.03 0.540.02 0.820.01 0.890.01 0.750.03 0.820.01
Forest cover DT 0.970.00 0.810.01 0.820.01 0.820.01 0.900.00 0.990.00 0.820.01 0.900.00 Maj: 35754 Min: 2747 13.0
S 0.970.00 0.780.01 0.850.01 0.810.00 0.910.00 0.980.00 0.850.01 0.92 0.00
A 0.970.00 0.790.01 0.860.01 0.820.01 0.920.00 0.980.00 0.860.01 0.920.00
SM 0.960.00 0.730.01 0.720.01 0.720.00 0.840.00 0.980.00 0.720.01 0.850.00
WOT 0.910.02 0.430.05 0.880.02 0.580.05 0.900.01 0.910.02 0.880.02 0.900.01

DT=Decision Tree, S=SMOTE, A=ADASYN, SM=SMOTEBoost, WOT=WOTBoost
Values are rounded to 2 decimal places

Table 3: Evaluation metrics and performance comparison

We highlight the best model and its performance in boldface for each dataset in table 3. Figure 2 presents the performance comparison of these 5 models on G mean and AUC score in 18 datasets. To assess the effectiveness of the proposed algorithm on these imbalanced datasets, we count the cases when WOTBoost algorithm outperforms or matches other models on each metric. The results presented in table 4 show that WOTBoost algorithm has the most winning times on G mean (6 times) and AUC (7 times). As defined in equation 4 in the metric section, G mean is the square root of the product between positive accuracy (i.e., recall or sensitivity) and negative accuracy (i.e., specificity). Meanwhile, area under the ROC curve, or AUC, is typically used for model selection, and it examines the true positive rate and false positive rate at various thresholds. Hence, both evaluation metrics consider the accuracy of both classes. Therefore, we argue that WOTBoost indeed improves the learning on the minority class while keeping the accuracy of the majority class.

Figure 2: Performance comparison of G mean and AUC score on 18 datasets
Winning counts Precision Recall F_measure G_mean Specificity AUC
Decision Tree 9 0 2 2 11 2
SMOTE 2 1 1 3 2 3
ADASYN 2 3 3 3 1 3
SMOTEBoost 3 10 8 4 2 6
WOTBoost 6 8 4 6 7 7
Table 4: Summary of effectiveness of WOTBoost algorithm on 18 datasets

In table 3, we observe that WOTBoost has the best G mean and AUC score on Pima Indian Diabetes whereas SMOTEBoost is the winner on Ionoshphere with the same assessments. Considering these two datasets have similar global imbalanced ratio, it naturally raises the question: are there any other factors that are influential in the classification performance? To understand the reasons why WOTBoost performs better on certain datasets, we investigate the local characteristics of the minority class in these datasets. We use t-SNE to visualize the distribution of these two datasets as shown in Figure 3. t-SNE algorithm allows us to visualize high dimensional datasets by projecting it into a two-dimensional panel. Figure 3 indicates there are more overlapping between two classes in Pima Indian Diabetes, whereas more "safe" minority class samples in Ionosphere. It is likely that WOTBoost is able to learn better when there are more difficult minority data examples. Figure 4 demonstrates the distribution of Pima Indian Diabetes before and after applying WOTBoost. We highlight one of the regions where minority data samples are difficult to learn. WOTBoost algorithm is able to populate synthetic data for these minority data samples.

Figure 3: (a) Distribution of Pima Indian Diabetes dataset. (b) Distribution of Ionosphere dataset
Figure 4: Pima Indian Diabetes distribution before and after applying WOTBoost

Table 1 shows the number of safe/unsafe minority samples of 18 dataset. We consider a minority class sample to be safe if its 5 nearest neighbors contain at most 1 majority class sample; Otherwise, it is labeled as an unsafe minority [18, 30]. Unsafe minority percentage is computed by


We observe that the unsafe minority percentages are around 50% or higher in most of the datasets where WOTBoost has the best G-mean or AUC shown in table 3. For example, Adult, Haberman, Blood Transfusion, Pima Indian Diabetes, and Satimage have 92.5%, 90.1%, 87.1%, 67.9%, 47.5% unsafe minority among the total minority class samples, respectively. Meanwhile, the global imbalanced ratios of these datasets are from 1.9 to 10.0. Hence, WOTBoost might be a good candidate to tackle imbalanced datasets with large proportion of unsafe minority samples and relatively high between-class imbalance ratios.

6 Conclusion

In this paper, we propose the WOTBoost algorithm to better learn from imbalanced datasets. The goal is to improve the performance of classification on minority class without sacrificing the accuracy of the majority class. We carry out a comprehensive comparison between WOTBoost algorithm and 4 other classification models. Results indicate that WOTBoost has the best G mean and AUC scores in 6 out of 18 datasets. WOTBoost shows more balanced performance, such as in G mean, than other classification models compared to particularly SMOTEBoost. Even though WOTBoost is not a cure-all method to the imbalanced learning problem, it is likely to produce promising results for datasets that contain a large portion of unsafe minority samples and maybe relatively high global imbalanced ratios. We hope that our contribution to this research domain would provide more insights and directions.

In addition, our study demonstrates that having the prior knowledge of the minority class distribution could facilitate the learning performance of the classifiers [31, 17, 19, 30, 18, 5]. Further investigating on the data-driven sampling may produce interesting findings in this domain.

References

  • [1] A. Ali, S. M. Shamsuddin, and A. L. Ralescu (2015) Classification with class imbalance problem: a review. Int. J. Advance Soft Compu. Appl 7 (3), pp. 176–204. Cited by: §2.1.2, §2.2, §2.
  • [2] E. L. Allwein, R. E. Schapire, and Y. Singer (2000) Reducing multiclass to binary: a unifying approach for margin classifiers. Journal of machine learning research 1 (Dec), pp. 113–141. Cited by: §1.
  • [3] J. A. Blackard and D. J. Dean (1999) Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Computers and electronics in agriculture 24 (3), pp. 131–151. Cited by: Table 1.
  • [4] C. L. Blake and C. J. Merz (1998) UCI repository of machine learning databases, 1998. Cited by: Table 1.
  • [5] C. Bunkhumpornpat, K. Sinapiromsaran, and C. Lursinsap (2009) Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In Pacific-Asia conference on knowledge discovery and data mining, pp. 475–482. Cited by: §2.1.1, §4.1, §6.
  • [6] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer (2002) SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research 16, pp. 321–357. Cited by: §1, §2.1.1, §4.1, §4.3.
  • [7] N. V. Chawla, A. Lazarevic, L. O. Hall, and K. W. Bowyer (2003) SMOTEBoost: improving prediction of the minority class in boosting. In European conference on principles of data mining and knowledge discovery, pp. 107–119. Cited by: §2.2.3, §4.1, §4.2.
  • [8] D. Devi, S. K. Biswas, and B. Purkayastha (2019) Learning in presence of class imbalance and class overlapping by using one-class svm and undersampling technique. Connection Science, pp. 1–38. Cited by: §2.2.1.
  • [9] T. G. Dietterich (2000) Ensemble methods in machine learning. In International workshop on multiple classifier systems, pp. 1–15. Cited by: §2.2.3.
  • [10] D. Dua and C. Graff (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: Table 1.
  • [11] T. Elhassan and M. Aljurf (2016) Classification of imbalance data using tomek link (t-link) combined with random under-sampling (rus) as a data reduction method.". Cited by: §1, §2.1.2.
  • [12] M. Elter, R. Schulz-Wendtland, and T. Wittenberg (2007) The prediction of breast cancer biopsy outcomes using two cad approaches that both emphasize an intelligible decision process. Medical physics 34 (11), pp. 4164–4172. Cited by: Table 1.
  • [13] T. Fawcett and F. J. Provost (1996) Combining data mining and machine learning for effective user profiling.. In KDD, pp. 8–13. Cited by: §1.
  • [14] A. Fernández, S. Garcia, F. Herrera, and N. V. Chawla (2018) SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. Journal of artificial intelligence research 61, pp. 863–905. Cited by: §2.
  • [15] Y. Freund, R. E. Schapire, et al. (1996) Experiments with a new boosting algorithm. In icml, Vol. 96, pp. 148–156. Cited by: §2.2.3, §3.
  • [16] Y. Freund and R. E. Schapire (1997) A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences 55 (1), pp. 119–139. Cited by: §3.
  • [17] H. Guo and H. L. Viktor (2004) Learning from imbalanced data sets with boosting and data generation: the databoost-im approach. ACM Sigkdd Explorations Newsletter 6 (1), pp. 30–39. Cited by: §2.2.3, §6.
  • [18] H. Han, W. Wang, and B. Mao (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In International conference on intelligent computing, pp. 878–887. Cited by: §2.1.1, §4.1, §5, §6.
  • [19] H. He, Y. Bai, E. A. Garcia, and S. Li (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1322–1328. Cited by: §2.1.1, §4.1, §4.2, §6.
  • [20] C. Hoi, C. Chan, K. Huang, M. R. Lyu, and I. King (2004) Biased support vector machine for relevance feedback in image retrieval. In 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No. 04CH37541), Vol. 4, pp. 3189–3194. Cited by: §2.2.1.
  • [21] T. Imam, K. M. Ting, and J. Kamruzzaman (2006) Z-svm: an svm for improved classification of imbalanced data. In Australasian Joint Conference on Artificial Intelligence, pp. 264–273. Cited by: §2.2.1.
  • [22] V. Karia, W. Zhang, A. Naeim, and R. Ramezani (2019) GenSample: a genetic algorithm for oversampling in imbalanced datasets. External Links: 1910.10806 Cited by: §2.1.1.
  • [23] M. Kubat, R. C. Holte, and S. Matwin (1998) Machine learning for the detection of oil spills in satellite radar images. Machine learning 30 (2-3), pp. 195–215. Cited by: §1, §4.3.
  • [24] M. Kubat, S. Matwin, et al. (1997) Addressing the curse of imbalanced training sets: one-sided selection. In Icml, pp. 179–186. Cited by: §4.3.
  • [25] A. Kumar, R. Bharti, D. Gupta, and A. K. Saha (2019) Improvement in boosting method by using rustboost technique for class imbalanced data. In Recent Developments in Machine Learning and Data Analytics, pp. 51–66. Cited by: §2.2.3, §4.1.
  • [26] W. Lee and S. Stolfo (1998) Data mining approaches for intrusion detection. Cited by: §1.
  • [27] D. D. Lewis and J. Catlett (1994) Heterogeneous uncertainty sampling for supervised learning. In Machine learning proceedings 1994, pp. 148–156. Cited by: §1.
  • [28] L. M. Manevitz and M. Yousef (2001) One-class svms for document classification. Journal of machine Learning research 2 (Dec), pp. 139–154. Cited by: §2.2.1.
  • [29] D. D. Margineantu (2002) Class probability estimation and cost-sensitive classification decisions. In European Conference on Machine Learning, pp. 270–281. Cited by: §2.2.2.
  • [30] K. Napierala and J. Stefanowski (2016) Types of minority class examples and their influence on learning classifiers from imbalanced data. Journal of Intelligent Information Systems 46 (3), pp. 563–597. Cited by: §2.1.1, §2.2.3, §5, §6.
  • [31] F. Provost (2000) Machine learning from imbalanced data sets 101. In Proceedings of the AAAI 2000 workshop on imbalanced data sets, Vol. 68, pp. 1–3. Cited by: §1, §3, §6.
  • [32] C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse, and A. Napolitano (2010) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 40 (1), pp. 185–197. Cited by: §2.2.3.
  • [33] J. S. Shirabad and T. J. Menzies (2005) The promise repository of software engineering databases. School of Information Technology and Engineering, University of Ottawa, Canada 24. Cited by: Table 1.
  • [34] Y. Sun, A. K. Wong, and M. S. Kamel (2009) Classification of imbalanced data: a review. International Journal of Pattern Recognition and Artificial Intelligence 23 (04), pp. 687–719. Cited by: §2.2.
  • [35] J. A. Swets (1988) Measuring the accuracy of diagnostic systems. Science 240 (4857), pp. 1285–1293. Cited by: §4.3.
  • [36] Y. Tang, Y. Zhang, N. V. Chawla, and S. Krasser (2009) SVMs modeling for highly imbalanced classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39 (1), pp. 281–288. Cited by: §2.2.1.
  • [37] N. Thai-Nghe, D. Nghi, and L. Schmidt-Thieme (2010) Learning optimal threshold on resampling data to deal with class imbalance. In Proc. IEEE RIVF International Conference on Computing and Telecommunication Technologies, pp. 71–76. Cited by: §2.1.2.
  • [38] I. Tomek (1976) An experiment with the edited nearest-neighbor rule. IEEE Transactions on systems, Man, and Cybernetics (6), pp. 448–452. Cited by: §2.1.2.
  • [39] M. Verleysen, J. Voz, P. Thissen, and J. Legat (1995) A statistical neural network for high-dimensional vector classification. In Proceedings of ICNN’95-International Conference on Neural Networks, Vol. 2, pp. 990–994. Cited by: Table 1.
  • [40] I. Yeh, K. Yang, and T. Ting (2009) Knowledge discovery on rfm model using bernoulli sequence. Expert Systems with Applications 36 (3), pp. 5866–5871. Cited by: Table 1.
  • [41] B. Zadrozny and C. Elkan (2001) Learning and making decisions when costs and probabilities are both unknown. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 204–213. Cited by: §2.2.2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
398261
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description