A Cost-Sensitive Deep Belief Network for Imbalanced Classification
Imbalanced data with a skewed class distribution are common in many real-world applications. Deep Belief Network (DBN) is a machine learning technique that is effective in classification tasks. However, conventional DBN does not work well for imbalanced data classification because it assumes equal costs for each class. To deal with this problem, cost-sensitive approaches assign different misclassification costs for different classes without disrupting the true data sample distributions. However, due to lack of prior knowledge, the misclassification costs are usually unknown and hard to choose in practice. Moreover, it has not been well studied as to how cost-sensitive learning could improve DBN performance on imbalanced data problems. This paper proposes an evolutionary cost-sensitive deep belief network (ECS-DBN) for imbalanced classification. ECS-DBN uses adaptive differential evolution to optimize the misclassification costs based on training data, that presents an effective approach to incorporating the evaluation measure (i.e. G-mean) into the objective function. We first optimize the misclassification costs, then apply them to deep belief network. Adaptive differential evolution optimization is implemented as the optimization algorithm that automatically updates its corresponding parameters without the need of prior domain knowledge. The experiments have shown that the proposed approach consistently outperforms the state-of-the-art on both benchmark datasets and real-world dataset for fault diagnosis in tool condition monitoring.
Class imbalance with disproportionate number of class instances commonly affects the quality of learning algorithms. Multifarious imbalanced data problems exist in numerous real-world applications, such as fault diagnosis , recommendation systems, fraud detection , risk management , tool condition monitoring [4, 5, 6] and medical diagnosis , brain computer interface (BCI) [8, 9], data visualization , etc. As a result of the equal misclassification costs or balanced class distribution assumption, the traditional learning algorithms are prone to the majority class when dealing with complicated classification problems that have skewed class distribution. Such imbalanced data often lead to degradation of performance in learning and classification systems. Typically, imbalance learning can be categorized into two conventional approaches, namely data level approaches and algorithm level approaches . The typical data level approaches are based on resampling approaches [12, 13, 14, 15, 16, 17]. Some well-known resampling based approaches include synthetic minority over-sampling technique (SMOTE)  and adaptive synthetic sampling approach (ADASYN) , etc. SMOTE  is an oversampling technique that generates synthetic samples of minority class. ADASYN  uses a weighted distribution for different minority class according to their level of difficulty in learning and more synthetic data for minority class. A typical algorithm level approach is called cost-sensitive learning [18, 19, 20, 21, 22]. We focus the study of algorithm level approach in this paper.
Resampling approaches attempt to manually rebalance dataset by oversampling minority samples and/or under-sampling majority samples. Unfortunately, such approaches may, on one hand miss out potentially useful data, on the other hand add the computational burden with the redundant samples. Essentially, resampling-based approaches would alter the original distribution of classes. In practice, the assumption that all misclassification errors have equal costs is not true in real-world applications. There could be large differences in terms of costs between different misclassification errors. For instance, in fault diagnosis of tool condition, if we are detecting the healthy state versus failure state of a machine, we know that missing the detection of a failure state may cause a catastrophic accident which costs much higher than the others.
Many conventional approaches presume equal costs for all the classification errors and this assumption usually does not hold in practice. Some real-world problems have drastically various costs for different classes, for example, between failure state and healthy state of a machine. Cost-sensitive learnings are popular methods that deal with imbalanced classification problems with unknown and unidentical costs on algorithmic level. The intuition of cost-sensitive learning is to assign misclassification costs for each class appropriately. We have seen studies in cost-sensitive neural networks , cost-sensitive decision trees , cost-sensitive extreme learning machine , etc. However, there are few studies on cost-sensitive deep belief networks. Deep belief network (DBN) [24, 25], a generative model stacked with several Restricted Boltzmann Machines (RBMs), has drawn tremendous attention recently. It has shown promising results in classification tasks such as, image identification, speech recognition , and natural language processing . DBN is known for its extraordinary end-to-end feature learning and classification characteristics. However, it has not been investigated as to how cost-sensitive learning could enhance DBN to deal with imbalanced data problems.
A key issue in cost-sensitive learning is to estimate the costs associated with data classes in different problems. The generic population-based evolutionary algorithms, that successfully deal with multi-model optimization problems [28, 29, 30, 31], offers a solution to addressing the issue. Differential evolution (DE) is a popular variant of EAs. DE optimizes an optimization problem by iteratively searching for a solution given an evaluation metric. Basically, DE moves the candidate solutions around the search space by using simple mathematical formulae to combine the positions of existing solutions. In this way, if a new position gives improvement, an old position is replaced, otherwise, the new position is discarded. It solves non-separable multi-model (i.e. has many local optima) problems and avoids local optima. In comparison with other variants of EAs, DE has better exploration capability with fewer parameters. It is also easy to implement. However, the exploration and exploitation capabilities of DE are mainly controlled by two key parameters, i.e. the mutation factor and crossover probability. Traditional DEs use fixed parameters which are not suitable for different problems and hard to tune. Adaptive DE  automatically updates the parameters according to the probability matching, that can be easily implemented. Therefore, it becomes a logical choice in solving practical problems.
The above observations have promoted us to study an Evolutionary Cost-Sensitive Deep Belief Network (ECS-DBN) to deal with the imbalanced data problems, where we find ways to assign differential misclassification costs to the classes, that we also call class-dependent misclassification costs. The misclassification costs are optimized by adaptive differential evolution (DE) algorithm . We consider that such a study could help to identify methods to heuristically optimize misclassification costs. In the rest of the paper, we presume positive label for the minority class and negative label for the majority class.
The contributions of this paper are summarized as follows.
We formulate a novel learning algorithm for classification prediction of DBN that handles imbalanced data classification;
We show how ECS-DBN works by assigning appropriate misclassification costs and incorporating cost-sensitive learning with deep learning.
We show that ECS-DBN allows us to determine the unknown misclassification costs without prior domain knowledge.
We show that ECS-DBN offers an effective solution of good performance to imbalanced classification problems. The proposed approach can automatically work for both binary and multiclass classification problems.
The rest of the paper is organized as follows. Section II reviews current related literature. Section III and Section IV introduce cost-sensitive deep belief network and present the proposed ECS-DBN, respectively. Section V compares the proposed approach and other state-of-the-art methods on 58 benchmark datasets. Section VI reports the experiments on a real-world dataset of fault diagnosis in tool condition monitoring on gun drilling. Finally, Section VII concludes the discussion and highlights some potential research directions.
Ii Literature Reviews
Ii-a Cost-sensitive Learning
Cost-sensitive learning method  is a learning paradigm that assigns differential misclassification costs to the classes involved in a classification task.
Datta et al.  investigated Near-Bayesian Support Vector Machines (NBSVM) for imbalanced classification with equal or unequal misclassification costs for multi-class scenario. Zong et al.  proposed a weighted extreme learning machine (WELM) for imbalance learning. The approach benefits from the idea of original extreme learning machine (ELM) which is simple and convenient to implement. It can be applied directly into multiclass classification tasks. The WELM is capable of dealing with imbalanced class distribution. The weights are assigned for each example according to users’ needs. Krempl et al.  proposed OPAL which is a fast, non-myopic and cost-sensitive probabilistic active learning approach. However, such approaches cannot determine the optimal misclassification loss without the need of prior domain knowledge. Castro et al.  proposed a cost-sensitive algorithm (CSMLP) using a single cost parameter to differentiate misclassification errors to improve the performances of MLPs on binary imbalanced class distributions. ABMODLEM  addressed imbalanced data with argument based rule learning. By using the expert knowledge, CSMLP and ABMODLEM improve learning rules from imbalanced data. With argument based rule induction, such approaches is convenient for domain experts to describe reasons for specific classes.
There are many studies related to neural networks over imbalanced learning. Zheng  proposed cost-sensitive boosting neural networks which incorporate the weight updating rule of boosting procedure to associate samples with misclassification costs. Bertoni et al.  proposed a cost-sensitive neural network for semi-supervised learning in graphs. Cost-sensitive SVM (CS-SVM)  was discussed with model selection via global minimum cross validation error. Tan et al.  proposed an evolutionary fuzzy ARTMAP (FAM) neural network using adaptive incremental learning method to overcome the stability-plasticity dilemma on stream imbalanced data. A cost-sensitive convolutional neural network  was proposed for imbalanced image classification. Despite many studies on imbalanced learning, the potential benefits through DBN with imbalanced learning have not been fully explored yet.
Ii-B Evolutionary Algorithm (EA)
We note that it is possible to decide misclassification costs either by trial and error [39, 18] or EAs [37, 40, 41]. Inspired by the biological evolution process, EA is a meta-heuristic optimization method, which attracts significant attention when learning from imbalanced data. EA based studies on optimizing imbalanced classification can be broadly grouped into two categories.
In the first category, one can implement EA to optimize the dataset for training classifiers. Early studies are focused on using EA to drive the sampling process of the training dataset. Such approaches [15, 42, 43] represent data by expressing the chromosome with binary representation. However, these approaches have poor scaling ability for large datasets as the chromosome expands proportionally with the size of the dataset, resulting in a cumbersome and time consuming evolutionary process. Recent methods (e.g. ) attempt to circumvent this problem by employing EA to sample smaller subsets of data to represent the imbalanced dataset. Another idea  is to use EA to carry out random under-sampling by determining the optimal regions in the sample space. Recent trends incorporating EA into sample space are limited to repetitive sampling-based solutions. ECO-ensemble  incorporates synthetic data generation within an ensemble framework optimized by EA simultaneously. Although this approach integrates EA into the whole framework from sample space to model space, the computational complexity also increases accordingly. These data-level approaches are dreadfully sensitive to the quality of imbalanced data (i.e. outliers, sparse data and small disjuncts). The use of synthetic data may change the true distribution of the original dataset therefore do more harm than good to the classifier.
In the second category, one can implement EA by optimizing classifiers for imbalanced classification at algorithmic level. Such approaches  optimize the classifier in the model space by using EA. Some studies [46, 47, 48] implement EAs to enhance rule-based classifiers. Genetic programming has been utilized to acquire sets of optimized classifiers such as negative correlated learning [49, 50, 51]. Perez et al.  integrates an evolutionary cooperative-competitive algorithm to obtain a set of simple and accurate radial basis function networks. Due to lack of sufficient domain knowledge, the costs of misclassification in cost-sensitive methods are usually hard to determine. In the literature, we note that the present studies are focused very much on traditional simple network models [18, 40, 37]. To our best knowledge, there is no reported work on the study of EA in cost sensitive deep learning. In this paper, we study how to estimate the misclassification costs automatically to improve the performance of cost-sensitive deep belief network.
Iii Technical Details of Deep Belief Network with Cost-sensitive Learning
Iii-a Deep Belief Network
Deep Belief Network (DBN) is a probabilistic generative model stacked with several Restricted Boltzmann Machines (RBMs). DBN is trained by greedy unsupervised layer-wise pre-training and discriminative supervised fine-tuning. The weight connections in DBN are between contiguous layers, there is no connections between the hidden neurons within the same layer.
The fundamental building block of DBN is an RBM which consists of one visible layer and one hidden layer. To construct a DBN, the hidden layer of previous RBM is regarded as the visible layer of its subsequent RBM in the deep structure. To train a DBN, typically each RBM is pre-trained initially from the bottom to the top in a layer-wise manner and subsequently the whole network is fine-tuned with supervised learning methods. Ultimately, the hypothesized prediction is obtained in the output layer based on the posterior probability distribution obtained from the penultimate layer.
DBN is usually trained by progressively untying the weights in each layer from the weights in higher layers . The pre-training is carried out by alternating Gibbs sampling from the true posterior distribution over all the hidden layers between a data sample on the visible variables and the transposed weight matrices to infer the factorial distributions over each hidden layer. All the variables in one layer are updated in parallel via Markov chain until they reach their stationary equilibrium distribution. The log posterior probability of the data is maximized by this training procedure. While the posterior distribution is created by the likelihood term coming from the data . Factorial approximations is used in DBN to replace the intractable true posterior distribution. By implementing a complementary prior, the true posterior is exactly factorial.
The posterior in each layer is approximated by a factorial distribution of independent variables within a layer given the values of the variables in the previous layer. Based on wake-sleep algorithm proposed in Hinton et al. , the weights on the undirected connections at the top level are learned by fitting the top-level RBM to the posterior distribution of the penultimate layer. The fine-tuning starts with a state of the top-level output layer and uses the top-down generative connections to stochastically activate each lower layer in turn. So a DBN can be viewed as an RBM that defines a prior over the top layer of hidden variables in a directed belief net, combined with a set of ârecognitionâ weights to perform fast approximate inference.
The architecture of DBN makes it possible to abstract higher level features through layer conformation . Each layer of hidden variables learns to represent features that capture higher order correlations in the original input data. Applying DBNs to a classification problem, feature vectors from data samples are used to set the states of the visible variables of the lower layer of the DBN. The DBN is then trained to produce a probability distribution over the possible labels of the data based on posterior probability distribution of the data samples.
Suppose a dataset contains a total number of data sample pairs , where is the th data sample, is the corresponding th target label. Assume a DBN consists of hidden layer and the parameters of each layer by . Given an input data sample from the dataset, the DBN with hidden layer(s) presents a complex feature mapping function. After feature transformation, softmax layer serves as the output layer of DBN to perform classification predictions as parameterized by . Suppose there are neurons in the softmax layer, where the -th neuron is responsible for estimating the prediction probability of class given input of which is the output of the previous layer and associated with weights and bias ï¼
where is the output of the previous layer. Based on the probability estimation, the trained DBN classifier provides a prediction as
In practice, the parameters of DBN are massively optimized by statistic gradient descent with respect to the negative log-likelihood loss over the training set .
Iii-B Cost-sensitive Deep Belief Network
The concept of cost-sensitive learning is to minimize the overall cost (e.g. Bayes conditional risk ) on the training data set.
Assume the total number of classes is , given a sample data , denotes the cost of misclassifying as class when actually belongs to class . In addition, , when , which indicates the cost for correct classification is 0.
Given the misclassification costs , a data sample should be classified into the class that has the minimum expected cost. Based on decision theory , the decision rule minimizing the expectation cost of classifying an input vector into class can be expressed as:
where is the posterior probability estimation of classifying a data sample into class . Given the prior probability , the general decision rule indicates which action to take for each data sample , thus the overall risk is
According to the Bayes decision theory, an ideal classifier will give a decision by computing the expectation risk of classifying an input to each class and predicts the label that reaches the minimum overall expectation risk. Misclassification costs represent penalties for classification errors. In cost-sensitive learning, all misclassification costs are essentially non-negative.
Mathematically, the probability that a sample data belongs to a class , a value of a stochastic variable , can be expressed as :
The misclassification threshold values are introduced to turn posterior probabilities into class labels such that the misclassification costs are minimized. Implementing the misclassification threshold value on the obtained posterior probability , one can obtain the new probability :
The hypothesized prediction of the sample is the member of the maximum probability among classes, can be obtained by using the following equation:
The proposed cost-sensitive learning method only concerns the output layer of a DBN. In this paper, we follow the same pre-training and fine-tuning procedures as in .
For imbalanced classification problems, the prior probability distribution of different classes is essentially imbalanced or non-uniform. To reflect class imbalance, there is a need to introduce the misclassification cost at the output layer to reflect the imbalanced class distributions. In addition, traditional training algorithms generally assume uniform class distribution with equal misclassification costs, i.e. , if , if , ãwhich is not true in many real-world applications.
In many real-world applications, the misclassification costs are essentially unknown and they vary across various classes. The current studies  usually attempt to determine misclassification costs by trial and error which generally does not lead to an optimal solution. Some studies  have devised mechanisms to update misclassification costs based on the number of samples in different classes. However, such methods may not be suitable for the cases where the classes are important but rare, such as some rare fatal diseases. To avoid hand tuning of misclassification costs, adaptive differential evolution algorithm  is implemented in this paper. Adaptive differential evolution algorithm is a simple effective and efficient evolutionary algorithm which could obtain optimal solution by evolving and updating a population of individuals during several generations. It attempts to adaptively self-update the control parameters without the need of prior knowledge.
Iv Evolutionary Cost-Sensitive Deep Belief Network (ECS-DBN)
As discussed in Section II-B, evolutionary algorithm (EA) is a widely used optimization algorithm which is motivated by the biological evolution process. The EA algorithm can be designed to optimize the misclassification costs that are unknown in practice. In this paper, we propose an Evolutionary Cost-Sensitive Deep Belief Network (ECS-DBN) by incorporating cost-sensitive function directly into its classification paradigm with the misclassification costs being optimized through adaptive differential evolution [30, 57]. The main idea of this cost-sensitive learning technique is to assign class-dependent costs. Fig. 1 shows the schematic diagram of a cost-sensitive deep belief network.
The procedure of training the proposed ECS-DBN can be summarized in Table I. Firstly, a population of misclassification costs are randomly initialized. We then train a DBN with the training dataset. After applying misclassification costs on the outputs of the DBN, we evaluate the training error based on the performance of the corresponding cost-sensitive hypothesized prediction. According to the evaluation performance on training dataset, proper misclassification costs are selected to generate the population of next generation. In the next generation, mutation and crossover operators are employed to evolve a new population of misclassification costs. Adaptive differential evolution (DE) algorithm will proceed to next generation and continuously iterate between mutation and selection to reach the maximum number of generations. Eventually, the best found misclassification costs are obtained and applied to the output layer of DBN to form ECS-DBN as shown in Fig. 1. At run-time, we test the resulting ECS-DBN with test dataset to report the performance. The practical steps of ECS-DBN is summarized in Algorithm 1, and discussed next.
|1. Let be the training set.|
Iv-a Chromosome Encoding
Chromosome encoding is an important step in evolutionary algorithms which aims at effectively representing important variables for better performance. In many real-world applications, misclassification costs in cost-sensitive deep belief network are usually unknown. In order to obtain appropriate costs, in our proposed approach each chromosome represents misclassification costs for different classes, and the final evolved best chromosome is chosen as the misclassification costs for ECS-DBN. The chromosome encoding here directly encodes the misclassification costs as values in the chromosome with numerical type and value range of . Fig. 2 illustrates of chromosome encoding and evolution process in ECS-DBN.
Iv-B Population Initialization
The initial population is obtained via uniformly random sampling in feasible solution space for each variable within the specified range of the corresponding variable. The population is to hold possible misclassification costs and forms the unit of evolution. The evolution of the misclassification costs is an iterative process with the population in each iteration called a generation.
Iv-C Adaptive DE Operators
After initialization, adaptive differential evolution evolves the population with a sequence of three evolutionary operations, i.e. mutation, crossover, and selection, generation by generation. Mutation is carried out with DE mutation strategy to create mutation individuals based on the current parent population as shown in Step 2.1 of Algorithm 1. After mutation, a binomial crossover operation is utilized to generate the final offspring as shown in Step 2.2 of Algorithm 1. In adaptive DE, each individual has its associated crossover probability instead of a fixed value. The selection operation selects the best one from the parent individuals and offspring individuals according to their corresponding fitness values as shown in Step 2.3 of Algorithm 1. Parameter adaptation is conducted at each generation. In this way, the control parameters are automatically updated to appropriate values without the need of prior parameter setting knowledge in DE. The crossover probability of each individual is generated independently based on a normal distribution with mean and standard deviation 0.1. Similarly, the mutation factor of each individual is generated independently based on a Cauchy distribution with location parameter and scale parameter 0.1. Both the mean and the location parameter are updated at the end of each generation as shown in Step 2.4 of Algorithm 1.
Iv-D Fitness Evaluation
Fitness evaluation allows us to choose the appropriate misclassification costs. In the proposed method, each individual chromosome is introduced into individual DBN as misclassification costs. We generate suitable misclassification costs for DBN using the training set. G-mean of training set is chosen as the objective function for the optimization.
Iv-E Termination Condition
Evolutionary algorithms are designed to evolve the population generation by generation and maintain the convergence as well as diversity characteristics within the population. A maximum number of generations is set to be a termination condition of the algorithm. In this implementation, we consider the solutions converged when the best fitness value remains unchanged over the past 30 generations . The algorithm terminates either when it reaches the maximum number of generations or when it meets the convergence condition.
Iv-F ECS-DBN Creation
Eventually, the optimization process ends with the best individual which is used as misclassification costs to form an ECS-DBN. The best individual is obtained from the last generation.
V Evaluation On Benchmark Datasets
In this section, the proposed ECS-DBN approach is evaluated on 58 popular KEEL benchmark datasets.
ADASYN, SMOTE and its various resampling methods are applied with DBN to generate synthetic minority data on the imbalanced datasets. The nomenclature convention used in labeling the imbalance learning methods are as follows: the prefix letters “ADASYN”, “SMOTE”, “SMOTE-SVM”, “SMOTE-borderline1”, and “SMOTE-borderline2” respectively, represent the adaptive synthetic sampling approach , Synthetic Minority Over-sampling Technique , Support Vectors SMOTE , and Borderline SMOTE of types 1 and 2 . The suffix “-DBN” represents deep belief network.
V-B Benchmark Datasets
In this paper, benchmark datasets are selected from KEEL (Knowledge Extraction based on Evolutionary Learning) dataset repository . The details specification of 58 binary-class imbalanced datasets are shown in Table II. All datasets are downloaded from KEEL website111http://sci2s.ugr.es/keel/imbalanced.php. They are known to have a high imbalance ratio between the majority and minority classes.
The imbalance ratio (IR) is the number of data samples in majority class divided by that in minority class which is described by (8).
V-C Implementation Details
The learning rates of both pre-training and fine-tuning are 0.01. The number of pre-training and fine-tuning iterations are 100 and 300 respectively. The range of hidden neuron number is . The hidden neuron number of the networks are randomly selected from the range of hidden neuron number. Generally speaking, there are two key parameters that affect DE process namely mutation factor and crossover probability . A larger enables DE of better exploration ability. A smaller allows DE to have better exploitation ability. DE with better exploitation ability leads to better convergence. DE with better exploration ability avoids local optima better, but it may result in slower convergence. Crossover probability affects the diversity of populations. A larger enables DE of better exploitation ability while a smaller enables DE of better exploration ability. We set the parameters empirically  to ensure that DE generally converges. All the codes of resampling methods for comparison in this paper are from  in Python and their corresponding parameters are set as default. All the simulation results are obtained with 5-fold cross validation over 10 trials. All of the simulations are done on an Intel Core i5 3.20GHz machine with 16 GB RAM and NVIDIA GeForce GTX 980.
V-D Evaluation Metrics
As in [32, 17, 13, 12, 14, 44, 58], accuracy, G-mean, F1-score, recall, precision are the most commonly used evaluation metrics. Considering an imbalance binary-class classification problem, let TP, FP, FN, TN represent true positive, false positive, false negative and true negative, respectively. To evaluate the performance of a classifier, it is common to use the overall accuracy that is formulated in (9).
In this section, both accuracy and G-mean are used. We use G-mean (10) because it evaluates the degree of inductive bias which considers both positive and negative accuracy. The higher G-mean values represent the classifier could achieve better performance on both minority and majority classes. G-mean is less sensitive to data distributions that is given as follows,
V-E Results of ECS-DBN
In this section, we investigate the performance of DBN in different settings that include ECS-DBN, DBN, ADASYN-DBN, SMOTE-DBN, SMOTE-borderline1-DBN, SMOTE-borderline2-DBN and SMOTE-SVM-DBN. We report the results over a total of runs on 58 KEEL benchmark datasets in terms of test accuracy and test G-mean, respectively. A detailed summary can be found at Tables AI and AII in Annex, with the best results being highlighted in boldface. To visualize, Fig. 3 illustrates the overall performances of the 7 variations of DBN algorithms on 58 benchmark datasets. It is clear that ECS-DBN stands out from the rest.
From the simulation results, the proposed ECS-DBN exhibits superior overall performance, especially in terms of G-mean. From Table AII, we observe that ECS-DBN excels in 34 out of 58 benchmark datasets in terms of accuracy. As there are many more samples in majority class than minority class, a classifier can bias to the majority class yet achieve a high accuracy. We also report results in G-mean, that G-mean takes both performance of majority class and minority class into account. If some methods give a highly biased performance, their G-mean values will be close to 0. It is worth noting that ECS-DBN outperforms on 52 out of 58 benchmark datasets in terms of G-mean as shown in Table AII. We may attribute this to the fact that ECS-DBN has been optimized using evolutionary algorithm with maximized G-mean objective. Therefore, the proposed ECS-DBN can provide better performances on minority class as well as those on majority class.
V-F Computational Time Analysis
The computing of ECS-DBN at run-time is closely related to the DBN network complexity. The larger and deeper network size of DBN, the more computing is required. Table III reports the average computational time of ECS-DBN with 5-fold cross validation over 10 trials on the overall 58 benchmark datasets. In order to make a fair and clear comparison between different imbalance learning methods, the average computational time at run-time testing is summarized in Table III. ECS-DBN shows a higher computational cost that is mainly due to the evolutionary algorithm. We note that the resampling methods are a little bit faster than ECS-DBN due to the small data size of KEEL benchmark datasets.
V-G Statistical Tests for Evaluating Imbalance Learning
Statistical tests provide evidence to ascertain the claim that the ECS-DBN outperforms other competitive methods. It is noted that three common statistical tests [16, 61, 62, 21, 63, 17], can serve our purpose. Wilcoxon paired signed-rank test is adopted for pairwise comparisons between algorithms. Alternatively for comparison between multiple algorithms, a Holm post-hoc test can be utilized to conduct a posteriori tests between the control algorithm and the rest subgroups of algorithms. Average rank is also implemented for fair comparison.
V-G1 Wilcoxon Paired Signed-Rank Test
In order to substantiate whether the results of ECS-DBN and other kinds of imbalance learning methods differ in a statistically significant way, a nonparametric statistical test known as Wilcoxon paired signed-rank test is conducted at the 5% significance level. The Wilcoxon paired signed-rank test is employed separately between pairs of algorithms for each dataset. The entries which are significantly better than all the counterparts are marked with in Tables AI and AII. The total number of win-lose-draw between the proposed method and its counterparts are then reckoned. The pairwise comparisons of the proposed ECS-DBN method against other kinds of methods in terms of accuracy and G-mean are shown in Tables AI and AII, respectively. In most cases, the proposed ECS-DBN method outperforms other state-of-the-art resampling methods, i.e. ADASYN, SMOTE, SMOTE-borderline1, SMOTE-borderline2 and SMOTE-SVM.
V-G2 Holm post-hoc Test
For multiple comparisons, different algorithms are compared using the Holm post-hoc test to detect statistical differences among them. The proposed ECS-DBN is chosen as the control algorithm for comparison. Then Holm post-hoc test is implemented on the results of the method for all datasets in terms of accuracy and G-mean as shown in Tables AI and AII, respectively. The -values from Holm post-hoc test shown in Tables AI and AII indicates that the proposed ECS-DBN method statistically outperforms other methods with significant statistical differences based on the results of all 58 datasets. In Holm post-hoc test, ECS-DBN outperforms others.
V-G3 Average Rank
Average rank is the mean of the ranks of individual method on all the datasets. Average ranks provide a fair comparison in terms of accuracy and G-mean of different methods as shown in Tables AI and AII. Based on the average ranks in terms of both accuracy and G-mean on all datasets, the proposed ECS-DBN ranked first in the majority of the datasets. The results indicate that ECS-DBN outperforms other competing methods and excels especially in terms of G-mean. For a better illustration, the average rank of different algorithms in terms of G-mean and accuracy is shown in Fig. 4. It is apparent that ECS-DBN outranks others in terms of G-mean and accuracy.
In sum, the results show that ECS-DBN method significantly outperforms other competing methods. First of all, according to the Wilcoxon paired signed-rank test, ECS-DBN outperforms other methods in most cases. Secondly, the -values from the Holm post-hoc test suggest that ECS-DBN achieves a statistically significant improvement over other competing methods. Thirdly, average ranks show that ECS-DBN is ranked first across most of the benchmark datasets. The fact that ECS-DBN outperforms DBN validates the need for cost-sensitive learning. Finally, the significant improvement of ECS-DBN over some other methods manifests the effectiveness of optimization.
Vi Evaluation On A Real-World Dataset
Vi-a Overview of the Imbalanced Gun Drilling Dataset
We report the experiment results on gun drilling dataset collected from a UNISIG USK25-2000 gun drilling machine in Advanced Manufacturing Lab at the National University of Singapore, Singapore, in collaboration with SIMTech-NUS joint lab.
Vi-B Experimental Setup
In the experiments, an Inconel 718 workpiece with the size of is machined using gun drills. The tool diameter of gun drills is . The detailed tool geometry of the tools are shown in Table VI. Four vibration sensors (Kistler Type 8762A50) are mounted on the workpiece in order to measure the vibration signals in three directions (i.e. , and ) during the gun drilling process. The details about sensor types and measurements are shown in Table IV. The sensor signals are acquired via a NI cDAQ-9178 data acquisition device and logged on a laptop.
In data acquisition, 14 channels of raw signals belonging to three types are logged. The measured signals include force signal, torque signal, and 12 vibration signals (i.e. acquired by 4 accelerometers in directions). The tool wears have been measured using Keyence VHX-5000 digital microscope. In this paper, the maximum flank wear which is most widely used in literature [64, 5, 65, 66, 67, 6] has been used as the health indicator of the tool. In this dataset, it is found 3 out of 20 tools are broken, 6 out of 20 tools have chipping at final state and 11 out of 20 tools are worn after gun drilling operations.
|Sensor type||Vibration Sensor||
|# Channels per sensors||3||2||-|
|Total number of channels||12||2||-|
|Measurements||Vibration X,Y,Z||Thrust force, torque||Tool wear|
|Sampling frequency (Hz)||20,000||100||-|
|Number of Signal Channels||14|
|Total Number of Data Samples||19,712,414|
|Number of Training Samples||13,798,690|
|Number of Testing Samples||5,913,724|
The machining operation is carried out with the detailed hole index, drill depth, tool geometry, tool diameter, feed rates, spindle speeds, machining times and tool final states are shown in Table VI. The drilling depth is 50 mm in -axis direction. The tool wear is captured and measured by Keyence digital microscope. The tool wear is measured after each drill during gun drilling operations.
Table V lists the details of the imbalanced gun drilling dataset. The imbalanced gun drilling dataset is selected from the raw experimental data by discarding lousy noise data samples. The total number of data samples in the imbalanced gun drilling dataset is 19,712,414. The number of training data samples and test data samples are 13,798,690 and 5,913,724 respectively. The data has been labeled into healthy (i.e. maximum flank wear of the tool ) and faulty (i.e. maximum flank wear of the tool ) two classes. The imbalance ratio (IR) of this dataset is 10. The data preprocessing and time window process are the same with .
|DBN||0.9954 0.0006||0.9830 0.0027||0.9831 0.0027||0.9968 0.0005||0.9975 0.0003|
|SVM||0.9894 0.0103||0.8943 0.3143||0.9443 0.1562||0.9898 0.0284||0.9944 0.0148|
|MLP||0.9683 0.0064||0.8492 0.0350||0.8601 0.0296||0.9733 0.0056||0.9827 0.0035|
|KNN||0.9733 0.0026||0.8460 0.0214||0.8579 0.0181||0.9725 0.0035||0.9855 0.0014|
|GB||0.9821 0.0353||0.8088 0.3910||0.9025 0.1947||0.9822 0.0354||0.9906 0.0185|
|LR||0.9412 0.0096||0.5903 0.1058||0.6791 0.0545||0.9398 0.0095||0.9687 0.0049|
|AdaBoost||0.9361 0.0126||0.5362 0.1246||0.6510 0.0719||0.9349 0.0127||0.9661 0.0065|
|Lasso||0.9523 0.0414||0.5065 0.4817||0.7401 0.2303||0.9524 0.0416||0.9749 0.0216|
|SGD||0.9258 0.0169||0.3860 0.2692||0.6098 0.0894||0.9281 0.0162||0.9606 0.0095|
indicates that the difference between marked algorithm and the proposed algorithm is statistically significant using Wilcoxon rank sum test at the significance level.
|ECS-DBN||0.9960 0.0002||0.9946 0.0011||0.9946 0.0011||0.9996 0.0002||0.9978 0.0001|
|CSDBN||0.9957 0.0012||0.9916 0.0026||0.9916 0.0025||0.9987 0.0007||0.9977 0.0006|
|SMOTE-SVM-DBN||0.9961 0.0003||0.9858 0.0011||0.9859 0.0011||0.9974 0.0002||0.9978 0.0002|
|ADASYN-DBN||0.9964 0.0004||0.9857 0.0016||0.9858 0.0016||0.9973 0.0003||0.9980 0.0002|
|SMOTE-borderline2-DBN||0.9962 0.0001||0.9852 0.0010||0.9853 0.0010||0.9972 0.0002||0.9979 0.0001|
|SMOTE-borderline1-DBN||0.9958 0.0003||0.9837 0.0021||0.9838 0.0020||0.9969 0.0004||0.9977 0.0002|
|SMOTE-DBN||0.9959 0.0001||0.9837 0.0010||0.9838 0.0010||0.9969 0.0002||0.9978 0.0001|
|DBN||0.9954 0.0006||0.9830 0.0027||0.9831 0.0027||0.9968 0.0005||0.9963 0.0003|
|WELM||0.7397 0.0177||0.7532 0.0170||0.7841 0.0159||0.6274 0.0202||0.7261 0.0148|
|ECO-ensemble||0.9895 0.0014||0.9713 0.0133||0.9694 0.0185||0.9836 0.0105||0.9925 0.0042|
indicates that the difference between marked algorithm and the proposed algorithm is statistically significant using Wilcoxon rank sum test at the significance level.
The details of the gun drilling cycle are as follows.
Feed internal coolant through coolant hole of gun drill.
Start to drill through the workpiece.
Finish drilling and pull the tool back.
The internally-fed coolant will exhaust the heat generated during gun drilling process and give high accuracy and precision performance.
Vi-C Evaluation Metrics
In this section, despite the evaluation metrics used in Section V-D, AUC, precision and F1-Score are introduced to evaluate the methods. The formulation of those metrics are listed below:
Precision (11) is a measure of a classifiers exactness. For this real-world application, exactness of classifier is an important indicator. F1-score (13) is a weighted average of precision and recall. The reason for choosing F1-score in this real-world application is that F1-score is used to evaluate the performance of the minority class (i.e. faulty) which is very important in this application. G-mean and F1-score incorporate both to express their tradeoff  and indicate the overall performance. AUC is used to evaluate the overall performance of the method on both classes. Recall is also known as the true positive rate, which signifies a measure of completeness.
Vi-D Experiment Results
In this section, all the parameters of DBN and adaptive DE are the same with those in Section V. All conventional machine learning algorithms for comparison purpose in this paper are from  and their corresponding parameters are set as default. The resampling methods for comparison in this section are the same with Section V. Since there is only one real-world dataset, only Wilcoxon paired signed-rank test has been implemented in this section.
The simulation results of imbalanced gun drilling dataset with DBN, multi-layer neural network (MLP), support vector machine (SVM), K-nearest neighbors (KNN), linear classifier with stochastic gradient descent (SGD) training, logistic regression (LR), gradient boosting (GB), AdaBoost classifier, Lasso are shown in Table VII in terms of classification accuracy, G-means, Precision and F1-score on test data. For better illustration, Fig. 5 presents the errorbar plot of the performance between different algorithms evaluated with different metrics. All the simulation results include the average performances and the corresponding standard deviation values. The simulation results are obtained on test data. Based on the simulation results, it is obvious that DBN outperforms other 9 conventional machine learning algorithms. This could contribute to the strong feature learning ability of DBN. Therefore, with its automatically hierarchical feature learning ability, DBN is chosen as the base classifier for this real-world application.
Table VIII presents the comparison between ECS-DBN, CSDBN, weighted extreme learning machine (WELM), ECO-ensemble and various resampling methods including ADASYN, SMOTE, SMOTE-borderline1, SMOTE-borderline2 and SMOTE-SVM in terms of accuracy, G-mean, Precision, F1-score. For clear illustration, Fig. 6 shows the errorbar plot comparison of the performance between ECS-DBN and different resampling methods with different evaluation metrics. WELM  is a state-of-the-art cost-sensitive extreme learning machine. ECO-ensemble  incorporates synthetic data generation within an ensemble framework optimized by EA simultaneously. The experiment results show that ECS-DBN outperforms WELM and ECO-ensemble. By comparing with other resampling methods, ECS-DBN outperforms on G-mean and precision metrics. Especially on G-mean, ECS-DBN generates a significant performance improvement over the others.
As an example, we illustrate the G-mean and precision between ECS-DBN and the grid search of misclassification costs of DBN in Fig. 8. It is clear that the proposed ECS-DBN benefits from the well optimized misclassification costs to achieve better performance. In terms of accuracy and F1-score, ECS-DBN can also provide comparable performance. The performance improvement of ECS-DBN over DBN and CSDBN with randomly generated cost values on many performance metrics further illustrates the need for cost-sensitive learning and the effectiveness of optimization. Therefore, ECS-DBN could generate comparable performance not only on benchmark dataset but also on real-world application.
We further examine the effect of the proposed ECS-DBN over majority verse minority classes. Fig. 9 shows that ECS-DBN benefits from more suitable misclassification costs and improves the accuracy of minority class. Fig. 8 and Fig. 9 validate the ability of ECS-DBN of finding suitable misclassification costs via evolutionary algorithm that improves the accuracy of minority class, thus the overall performance. How ECS-DBN may impact on the majority class depends on the way we define the objective functions. While ECS-DBN improves the overall performance, it also provides a mechanism to trade off the performance between the majority class and the minority class.
Vi-E Computational Cost
Average computational time of ECS-DBN, DBN, ADASYN-DBN, SMOTE-DBN, SMOTE-borderline1-DBN, SMOTE-borderline2-DBN and SMOTE-SVM-DBN on the gun drilling imbalanced dataset are presented in Table IX. It is obvious that ECS-DBN consumes less average computational time than other competing methods. In comparison with the KEEL benchmark datasets, the gun drilling imbalanced dataset has a much larger size of data samples which increases the computational complexity for resampling methods. Hence, the proposed ECS-DBN is more efficient than some resampling methods to large dataset. If we compare the computational time required between the evolutionary algorithm to estimate the misclassification cost, and the DBN training, the former is very small and negligible. In short, the proposed ECS-DBN approach is both efficient and effective.
In this paper, an evolutionary cost-sensitive deep belief network (ECS-DBN) is proposed for imbalanced classification problem. We have shown that ECS-DBN significantly outperforms other competing techniques on 58 benchmark datasets and a real-world dataset. The proposed ECS-DBN improves DBN by applying cost-sensitive learning strategy. To tackle with unknown misclassification costs in practice, adaptive differential evolution algorithm has been utilized to find the misclassification costs. Since many real-world data are naturally imbalanced, therefore, the misclassification costs of different classes are usually unknown, ECS-DBN offers an effective solution. ECS-DBN is also computationally more efficient than some popular resampling methods on large scale datasets. It can also be easily implemented on multi-class scenarios. In this paper, we only incorporate the cost-sensitive learning technique on algorithmic level. However, the imbalanced distribution in feature space may also impact the performance of learning models. In the future, we consider that cost-sensitive methods could also be applied to high dimensional data and dynamic data. Furthermore, online learning usually suffers from concept drift with different imbalance ratio over time. ECS-DBN can be further extended for online imbalanced classification problems with some online learning strategies. The proposed approach can also be applied to other deep learning models such as convolutional neural network, etc.
Chong Zhang and Haizhou Li were supported by Neuromorphic Computing Program, RIE2020 AME Programmatic Grant, A*STAR, Singapore.
-  C. Zhang, J. H. Sun, and K. C. Tan, “Deep belief networks ensemble with multi-objective optimization for failure diagnosis,” in IEEE Int. Conf. Syst. Man Cyb. (SMC), 2015. IEEE, Oct 2015, pp. 32–37.
-  T. Fawcett and F. Provost, “Adaptive fraud detection,” Data Min. Knowl. Disc., vol. 1, no. 3, pp. 291–316, 1997.
-  K. J. Ezawa, M. Singh, and S. W. Norton, “Learning goal oriented bayesian networks for telecommunications risk management,” in ICML, 1996, pp. 139–147.
-  J. Sun, M. Rahman, Y. Wong, and G. Hong, “Multiclassification of tool wear with support vector machine by manufacturing loss consideration,” Inter. J. Machine Tools Manuf., vol. 44, no. 11, pp. 1179–1187, 2004.
-  C. Zhang, G. S. Hong, H. Xu, K. C. Tan, J. H. Zhou, H. L. Chan, and H. Li, “A data-driven prognostics framework for tool remaining useful life estimation in tool condition monitoring,” in IEEE Int. Conf. Emerg. (ETFA), 2017. IEEE, Sept 2017, pp. 1–8.
-  H. Xu, C. Zhang, G. S. Hong, J. H. Zhou, J. Hong, and K. S. Woon, “Gated recurrent units based neural network for tool condition monitoring,” in IEEE Int. Joint Conf. Neural Netw. (IJCNN), 2018, accepted. IEEE, July 2018.
-  R. M. Valdovinos and J. S. Sanchez, “Class-dependant resampling for medical applications,” in Int. Conf. Mach. Learn. Appl., 2005. IEEE, 2005, pp. 6–pp.
-  S. K. Goh, H. A. Abbass, K. C. Tan, and A. Al Mamun, “Artifact removal from eeg using a multi-objective independent component analysis model,” in Int. Conf. Neur. Inform. Process. Springer, 2014, pp. 570–577.
-  S. K. Goh, H. A. Abbass, K. C. Tan, A. Al-Mamun, C. Guan, and C. C. Wang, “Multiway analysis of eeg artifacts based on block term decomposition,” in Int. Joint Conf. Neur. Netw. (IJCNN), 2016. IEEE, 2016, pp. 913–920.
-  Y. Wang, Y.-C. Wang, Q. Zhang, F. Lin, C.-K. Goh, X. Wang, and H.-S. Seah, “Histogram equalization and specification for high-dimensional data visualization using radviz,” in Proc. 34th Ann. Conf. Comput. Graphics Int. ACM, 2017.
-  H. He and Y. Ma, Imbalanced learning: foundations, algorithms, and applications. John Wiley & Sons, 2013.
-  N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: synthetic minority over-sampling technique,” J. Artif. Intell. Res., vol. 16, pp. 321–357, 2002.
-  H. He, Y. Bai, E. A. Garcia, and S. Li, “Adasyn: Adaptive synthetic sampling approach for imbalanced learning,” in Neural Networks, 2008. IJCNN 2008.(IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on. IEEE, 2008, pp. 1322–1328.
-  H. Han, W.-Y. Wang, and B.-H. Mao, “Borderline-smote: a new over-sampling method in imbalanced data sets learning,” Adv. Intell. Comput., pp. 878–887, 2005.
-  S. García and F. Herrera, “Evolutionary undersampling for classification with imbalanced datasets: Proposals and taxonomy,” Evol. Comput., vol. 17, no. 3, pp. 275–306, 2009.
-  S. Garcı, I. Triguero, C. J. Carmona, F. Herrera et al., “Evolutionary-based selection of generalized instances for imbalanced classification,” Knowl-based Syst., vol. 25, no. 1, pp. 3–12, 2012.
-  P. Lim, C. K. Goh, and K. C. Tan, “Evolutionary cluster-based synthetic oversampling ensemble (eco-ensemble) for imbalance learning,” IEEE T. Cybernetics, 2016.
-  Z.-H. Zhou and X.-Y. Liu, “Training cost-sensitive neural networks with methods addressing the class imbalance problem,” IEEE T. Knowl. Data En., vol. 18, no. 1, pp. 63–77, 2006.
-  S. Datta and S. Das, “Near-bayesian support vector machines for imbalanced data classification with equal or unequal misclassification costs,” Neural Networks, vol. 70, pp. 39 – 52, 2015.
-  W. Zong, G.-B. Huang, and Y. Chen, “Weighted extreme learning machine for imbalance learning,” Neurocomputing, vol. 101, pp. 229 – 242, 2013.
-  C. L. Castro and A. P. Braga, “Novel cost-sensitive approach to improve the multilayer perceptron performance on imbalanced data,” IEEE T. Neur. Net. Lear., vol. 24, no. 6, pp. 888–899, 2013.
-  G. Krempl, D. Kottke, and V. Lemaire, “Optimised probabilistic active learning (opal),” Mach. Learn., vol. 100, no. 2-3, pp. 449–476, 2015.
-  C. Drummond, R. C. Holte et al., “C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling,” in Workshop on learning from imbalanced datasets II, vol. 11. Citeseer, 2003.
-  G. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural Comput., vol. 18, no. 7, pp. 1527–1554, 2006.
-  G. Hinton, “A practical guide to training restricted boltzmann machines,” Momentum, vol. 9, no. 1, p. 926, 2010.
-  A.-r. Mohamed, G. E. Dahl, and G. Hinton, “Acoustic modeling using deep belief networks,” IEEE T. Audio Speech, vol. 20, no. 1, pp. 14–22, 2012.
-  A.-r. Mohamed, T. N. Sainath, G. Dahl, B. Ramabhadran, G. E. Hinton, and M. A. Picheny, “Deep belief networks using discriminative features for phone recognition,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process. (ICASSP). IEEE, 2011, pp. 5060–5063.
-  B.-Y. Qu, P. N. Suganthan, and J.-J. Liang, “Differential evolution with neighborhood mutation for multimodal optimization,” IEEE T. Evolut. Comput., vol. 16, no. 5, pp. 601–614, 2012.
-  X.-S. Yang, “Firefly algorithms for multimodal optimization,” in International symposium on stochastic algorithms. Springer, 2009, pp. 169–178.
-  J. Zhang and A. C. Sanderson, “Jade: adaptive differential evolution with optional external archive,” IEEE T. Evolut. Comput., vol. 13, no. 5, pp. 945–958, 2009.
-  C. Zhang, P. Lim, A. K. Qin, and K. C. Tan, “Multiobjective deep belief networks ensemble for remaining useful life estimation in prognostics,” IEEE T. Neur. Net. Lear., vol. 28, no. 10, pp. 2306–2318, Oct 2017.
-  H. He, E. Garcia et al., “Learning from imbalanced data,” IEEE T. Knowl. Data En., vol. 21, no. 9, pp. 1263–1284, 2009.
-  K. NapieraÅa and J. Stefanowski, “Addressing imbalanced data with argument based rule learning,” Expert Syst. Appl., vol. 42, no. 24, pp. 9468 – 9481, 2015.
-  J. Zheng, “Cost-sensitive boosting neural networks for software defect prediction,” Expert Syst. Appl., vol. 37, no. 6, pp. 4537–4543, 2010.
-  A. Bertoni, M. Frasca, and G. Valentini, “Cosnet: a cost sensitive neural network for semi-supervised learning in graphs,” Mach. Learn. Knowl. Discov. Databases, pp. 219–234, 2011.
-  B. Gu, V. S. Sheng, K. Y. Tay, W. Romano, and S. Li, “Cross validation through two-dimensional solution surface for cost-sensitive svm,” IEEE T. Pattern Anal., vol. 39, no. 6, pp. 1103–1121, 2017.
-  S. C. Tan, J. Watada, Z. Ibrahim, and M. Khalid, “Evolutionary fuzzy artmap neural networks for classification of semiconductor defects,” IEEE T. Neur. Net. Lear., vol. 26, no. 5, pp. 933–950, 2015.
-  S. H. Khan, M. Hayat, M. Bennamoun, F. A. Sohel, and R. Togneri, “Cost-sensitive learning of deep feature representations from imbalanced data,” IEEE T. Neur. Net. Lear., 2017.
-  M. Kukar, I. Kononenko et al., “Cost-sensitive learning with neural networks.” in ECAI. Citeseer, 1998, pp. 445–449.
-  L. Zhang and D. Zhang, “Evolutionary cost-sensitive extreme learning machine,” IEEE T. Neur. Net. Lear., 2016.
-  J. Li, X. Li, and X. Yao, “Cost-sensitive classification with genetic programming,” in IEEE Congr. Evolut. Comput. 2005., vol. 3. IEEE, 2005, pp. 2114–2121.
-  D. J. Drown, T. M. Khoshgoftaar, and R. Narayanan, “Using evolutionary sampling to mine imbalanced data,” in Sixth Int. Conf. Mach. Learn. Appl. ICMLA 2007. IEEE, 2007, pp. 363–368.
-  S. Zou, Y. Huang, Y. Wang, J. Wang, and C. Zhou, “Svm learning from imbalanced data by ga sampling for protein domain prediction,” in The 9th Int. Conf. Young Comput. Scientists, ICYCS 2008. IEEE, 2008, pp. 982–987.
-  A. Ghazikhani, H. S. Yazdi, and R. Monsefi, “Class imbalance handling using wrapper-based random oversampling,” in 20th Iranian Conf. Electr. Eng. (ICEE), 2012. IEEE, 2012, pp. 611–616.
-  D. Y. Harvey and M. D. Todd, “Automated feature design for numeric sequence classification by genetic programming,” IEEE T. Evolut. Comput., vol. 19, no. 4, pp. 474–489, 2015.
-  A. Orriols-Puig and E. Bernadó-Mansilla, “Evolutionary rule-based systems for imbalanced data sets,” Soft Comput. A Fusion of Found. Method. Appl., vol. 13, no. 3, pp. 213–225, 2009.
-  P. Ducange, B. Lazzerini, and F. Marcelloni, “Multi-objective genetic fuzzy classifiers for imbalanced and cost-sensitive datasets,” Soft Comput., vol. 14, no. 7, pp. 713–728, 2010.
-  J. Luo, L. Jiao, and J. A. Lozano, “A sparse spectral clustering framework via multiobjective evolutionary algorithm,” IEEE T. Evolut. Comput., vol. 20, no. 3, pp. 418–433, 2016.
-  U. Bhowan, M. Johnston, M. Zhang, and X. Yao, “Evolving diverse ensembles using genetic programming for classification with unbalanced data,” IEEE T. Evolut. Comput., vol. 17, no. 3, pp. 368–386, 2013.
-  ——, “Reusing genetic programming for ensemble selection in classification of unbalanced data,” IEEE T. Evolut. Comput., vol. 18, no. 6, pp. 893–908, 2014.
-  P. Wang, M. Emmerich, R. Li, K. Tang, T. Bäck, and X. Yao, “Convex hull-based multiobjective genetic programming for maximizing receiver operating characteristic performance,” IEEE T. Evolut. Comput., vol. 19, no. 2, pp. 188–200, 2015.
-  M. D. Pérez-Godoy, A. Fernández, A. J. Rivera, and M. J. del Jesus, “Analysis of an evolutionary rbfn design algorithm, co 2 rbfn, for imbalanced data sets,” Patt. Recog. Lett., vol. 31, no. 15, pp. 2375–2388, 2010.
-  G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal, “The” wake-sleep” algorithm for unsupervised neural networks,” Science, vol. 268, no. 5214, p. 1158, 1995.
-  G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.
-  C. Elkan, “The foundations of cost-sensitive learning,” in Int. Joint Conf. Artif. Intell., vol. 17, no. 1. Citeseer, 2001, pp. 973–978.
-  J. O. Berger, Statistical decision theory and Bayesian analysis. Springer Science & Business Media, 2013.
-  A. K. Qin and P. N. Suganthan, “Self-adaptive differential evolution algorithm for numerical optimization,” in Evolutionary Computation, 2005. The 2005 IEEE Congress on, vol. 2. IEEE, 2005, pp. 1785–1791.
-  H. M. Nguyen, E. W. Cooper, and K. Kamei, “Borderline over-sampling for imbalanced data classification,” Int. J. Knowl. Eng. Soft Data Paradigms, vol. 3, no. 1, pp. 4–21, 2011.
-  J. Alcala-Fdez, L. Sanchez, S. Garcia, M. J. del Jesus, S. Ventura, J. Garrell, J. Otero, C. Romero, J. Bacardit, V. M. Rivas et al., “Keel: a software tool to assess evolutionary algorithms for data mining problems,” Soft Comput., vol. 13, no. 3, pp. 307–318, 2009.
-  G. Lemaître, F. Nogueira, and C. K. Aridas, “Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning,” J. Mach. Learn. Res., vol. 18, no. 17, pp. 1–5, 2017.
-  M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, and F. Herrera, “A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches,” IEEE T. Syst. Man Cy. C, vol. 42, no. 4, pp. 463–484, 2012.
-  M. Lin, K. Tang, and X. Yao, “Dynamic sampling approach to training neural networks for multiclass imbalance classification,” IEEE T. Neur. Net. Lear., vol. 24, no. 4, pp. 647–660, 2013.
-  B. Wang and J. Pineau, “Online bagging and boosting for imbalanced data streams,” IEEE T. Knowl. Data En., vol. 28, no. 12, pp. 3353–3366, 2016.
-  C. Zhang, X. Yao, J. Zhang, and H. Jin, “Tool condition monitoring and remaining useful life prognostic based on a wireless sensor in dry milling operations,” Sensors, vol. 16, no. 6, p. 795, 2016.
-  Y. Wang, W. Jia, and J. Zhang, “The force system and performance of the welding carbide gun drill to cut aisi 1045 steel,” Inter. J. Adv. Manuf. Technol., vol. 74, no. 9-12, pp. 1431–1443, 2014.
-  D. Biermann and M. Kirschner, “Experimental investigations on single-lip deep hole drilling of superalloy inconel 718 with small diameters,” J. Manuf. Processes, vol. 20, pp. 332–339, 2015.
-  J. Hong, J. Zhou, H. L. Chan, C. Zhang, H. Xu, and G. S. Hong, “Tool condition monitoring in deep hole gun drilling: a data-driven approach,” in IEEE Int. Conf. Ind. Eng. Eng. Manage. (IEEM), 2017. IEEE, Dec 2017, pp. 2148–2152.
-  F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” J. Mach. Learn. Res., vol. 12, pp. 2825–2830, 2011.
Chong Zhang received the B.Eng. degree in Engineering from Harbin Institute of Technology, China, and the M.Sc. degree from National University of Singapore in 2011 and 2012, respectively. She is currently a Ph.D. student as well as a research engineer in Department of Electrical and Computer Engineering at National University of Singapore.Her research interests include computational intelligence, machine learning/deep learning, data science and their applications in big data analytics, diagnostics, prognostics, health condition monitoring, voice conversion, etc.
Kay Chen Tan (SM’08–F’14) is a full Professor with the Department of Computer Science, City University of Hong Kong. He is the Editor-in-Chief of IEEE Transactions on Evolutionary Computation, was the EiC of IEEE Computational Intelligence Magazine (2010-2013), and currently serves on the Editorial Board member of 20+ journals. He is an elected member of IEEE CIS AdCom (2017-2019). He has published 200+ refereed articles and 6 books. He is a Fellow of IEEE.
Haizhou Li (M’91–SM’01–F’14) received the B.Sc., M.Sc., and Ph.D degree in electrical and electronic engineering from South China University of Technology, Guangzhou, China in 1984, 1987, and 1990 respectively. Dr Li is currently a Professor at the Department of Electrical and Computer Engineering, National University of Singapore (NUS). He is also a Conjoint Professor at the University of New South Wales, Australia. His research interests include automatic speech recognition, speaker/language recognition, and natural language processing.Prior to joining NUS, he taught in the University of Hong Kong (1988-1990) and South China University of Technology (1990-1994). He was a Visiting Professor at CRIN in France (1994-1995), Research Manager at the Apple-ISS Research Centre (1996-1998), Research Director in Lernout & Hauspie Asia Pacific (1999-2001), Vice President in InfoTalk Corp. Ltd. (2001-2003), and the Principal Scientist and Department Head of Human Language Technology in the Institute for Infocomm Research, Singapore (2003-2016). Dr Li is currently the Editor-in-Chief of IEEE/ACM Transactions on Audio, Speech and Language Processing (2015-2018), a Member of the Editorial Board of Computer Speech and Language (2012-2018). He was an elected Member of IEEE Speech and Language Processing Technical Committee (2013-2015), the President of the International Speech Communication Association (2015-2017), the President of Asia Pacific Signal and Information Processing Association (2015-2016), and the President of Asian Federation of Natural Language Processing (2017-2018). He was the General Chair of ACL 2012 and INTERSPEECH 2014. Dr Li is a Fellow of the IEEE. He was a recipient of the National Infocomm Award 2002 and the Presidentâs Technology Award 2013 in Singapore. He was named one of the two Nokia Visiting Professors in 2009 by the Nokia Foundation.
Geok Soon Hong received the B.Eng. degree in Control Engineering in 1982 from University of Sheffield, UK. He was awarded a university scholarship to further his studies and obtained a Ph.D. degree in control engineering in 1987. The topic of his research was in stability and performance analysis for systems with multi-rate sampling problems. He is now an Associate Professor at the Department of Mechanical Engineering, National University of Singapore (NUS), Singapore. His research interests are in Control Theory, Multirate sampled data system, Neural network Applications and Industrial Automation, Modeling and control of Mechanical Systems, Tool Condition Monitoring, AI techniques in monitoring and Diagnostics.
|Holm post-hoc Test||-||3.60922E-03||3.44739E-11||5.76082E-11||6.06238E-11||3.10875E-11||2.12982E-10|
indicates that the difference between the proposed algorithm and all other compared algorithm is statistically significant using pair-wised Wilcoxon rank sum test at the significance level.