Impacts of Dirty Data: an Experimental Evaluation

Impacts of Dirty Data: and Experimental Evaluation

Abstract

Data quality issues have attracted widespread attention due to the negative impacts of dirty data on data mining and machine learning results. The relationship between data quality and the accuracy of results could be applied on the selection of the appropriate algorithm with the consideration of data quality and the determination of the data share to clean. However, rare research has focused on exploring such relationship. Motivated by this, this paper conducts an experimental comparison for the effects of missing, inconsistent and conflicting data on classification, clustering, and regression algorithms. Based on the experimental findings, we provide guidelines for algorithm selection and data cleaning.

Data Quality, Classification, Clustering, Regression
\numberofauthors

1

1 Introduction

Data quality has become a serious issue which cannot be overlooked in both data mining and machine learning communities. We call the data with data quality problems as dirty data. Since dirty data affect the accuracy of a data mining or machine learning (e.g., classification, clustering, or regression) task, we have to know the relationship between the quality of input data set and accuracy of the results. Based on such relationship, we could select an appropriate algorithm with the consideration of data quality issues and determine the share of data to clean.

Due to the large collection of classification, clustering, and regression algorithms, it is difficult for users to decide which algorithm should be adopted. The effects of data quality on algorithms are helpful for algorithm selection. Therefore, impacts of dirty data are in urgent demand.

Before a classification, clustering, or regression task, data cleaning is necessary to guarantee data quality. Various data cleaning approaches have been proposed, e.g., data repairing with integrity constraints [1, 2], knowledge-based cleaning systems [3, 4], and crowdsourced data cleaning [3, 5], etc. These methods improve data quality dramatically, but the costs of data cleaning are still expensive [6]. If we know how dirty data affect accuracy of the results, we could clean data selectively according to the accuracy requirements instead of cleaning the entire dirty data. As a result, the data cleaning costs are reduced. Therefore, the study of the relationship between data quality and accuracy of results is in demand.

Unfortunately, rare research has been conducted to explore the specific impacts of dirty data on different algorithms. Thus, this paper aims to fill this gap. This brings following challenges.

  1. Due to the great number of classification, clustering, and regression algorithms, the first challenge is how to choose algorithms for experiments.

  2. Since existing measures (e.g., Precision, Recall, F-measure) are unable to quantify the fluctuation degrees of results, they are insufficient to evaluate the impacts of dirty data on algorithms. Thus, how to define new metrics for evaluation is the second challenge.

  3. Since there is no well-planned dirty data benchmark, we have to generate data sets with the consideration of error type, error rate, data size, and etc. Therefore, the third challenge is how to design data sets to test the impacts of dirty data.

In the challenge of these problems, this paper selects sixteen classical algorithms in data mining and machine learning communities. We make comprehensive analyses of possible dirty-data impacts on these algorithms. Then, we evaluate the specific effects of different types of dirty data on different algorithms. Based on the experimental results, we provide suggestions for algorithm selection and data cleaning. In the research field, dirty data are classified into a variety of types [7], such as missing data, inconsistent data, and conflicting data. Most existing researches focus on improving data quality for these three kinds of dirty data [8, 9, 10]. Thus, this paper focuses on the three main types.

In summary, our contributions in this paper are listed as follows.

  1. We conduct an experimental comparison for the effects of missing, inconsistent, and conflicting data on classification, clustering, and regression algorithms, respectively. To the best of our knowledge, this is the first paper that studies this issue.

  2. We introduce two novel metrics, and , to evaluate dirty-data impacts on algorithms.

  3. Based on the evaluation results, we provide guidelines of algorithm selection and data cleaning for users. We also give suggestions for future work to researchers and practitioners.

The rest paper is organized as follows. Dimensions of data quality are reviewed in Section 2. Section 3 analyzes dirty-data impacts on six classical classification algorithms. We discuss impacts of dirty data on six clustering methods in Section 4. Section 5 introduces dirty-data impacts on four regression approaches. Our experiments are described in Section 6, and our guidelines and suggestions are presented in Section 7.

2 Dimensions of Data Quality

Data quality has many dimensions [11]. In this paper, we focus on three basic dimensions, completeness, consistency, and entity identity [7]. For these dimensions, the corresponding dirty data types are missing data, inconsistent data, and conflicting data. In this section, we introduce these three kinds of dirty data.

Missing data refer to values that are missing from databases. For example, on Table 1, the values of [Country] and [City] are missing data.

Student No. Name City Country
170302 Alice NYC
170302 Steven FR
170304 Bob NYC U.S.A
170304 Bob LA U.S.A
Table 1: Student Information

Inconsistent data are identified as violations of consistency rules which describe the semantic constraint of data. For example, a consistency rule “[Student No.] [Name]” on Table 1 means that Student No. determines Name. As the table shows, [Student No.] = [Student No.], but [Name] [Name]. Thus, the values of [Student No.], [Name], and [Name] are inconsistent.

Conflicting data refer to different values which describe an attribute of the same entity. For example, on Table 1, both and describe ’s information, but [City] and [City] are different. Thus, [City] and [City] are conflicting data.

3 Classification Algorithms

In this section, we analyze possible dirty-data impacts on six classical classification algorithms, Decision Tree, K-Nearest Neighbor Classifier, Naive Bayes, Bayesian Network, Logistic Regression, and Random Forests. We choose these algorithms since they are always used as competitive classification algorithms [12, 13, 14, 15].

For simplicity, we define the notations used through this paper in Table 2.

notation definition
the th attribute
class/target attribute
the th class label
the number of attributes
the number of class labels
the training set of given data
the test set of given data
Table 2: Notation Definition

3.1 Decision Tree

The decision tree [16] splits according to an attribute set that optimizes certain criterion in the training process. To determine the best split, a measure of node impurity is needed. Popular measures include Gini index [17], information gain [18], and misclassification error [18].

Given a node , the Gini index for is defined as follows.

(1)

where is the relative frequency of class at .

Entropy at is defined as follows.

(2)

where is the relative frequency of class at .

Suppose a parent node is split into partitions. Based on entropy, information gain for is defined as follows.

(3)

where is the th partition, and is the number of records in .

Misclassification error at is defined as follows.

(4)

where is the relative frequency of class at .

In the decision tree induction, the attribute with minimum Gini index/maximum information gain/minimum misclassification error is chosen as the split node first. With the induced decision tree, the records in are classified.

When some dirty values exist in , incorrect data could affect the value of . Since determines the measure of node purity (e.g., , , ), the split attribute might be poorly chosen. Thus, the poor induced decision tree could cause an inaccurate classification result. When some values of (, ,…, ) in are dirty, these data might lead to a wrong branch of the decision tree, which results in a wrong class label.

3.2 K-Nearest Neighbor Classifier

K-nearest neighbor classifier [19] (KNN for brief) requires three things, , distance metric (e.g., Euclidean distance [20]) to compute the distance between records, and the value of £¬ which is the number of nearest neighbors to retrieve. To classify a record in , we first compute its distance to other training records. Then, nearest neighbors are identified. Finally, we use class labels of the nearest neighbors to determine the class label of the unknown record (e.g., by taking majority vote).

Given two records and , the Euclidean distance between them is defined as follows.

(5)

where is the th attribute of , and is the th attribute of .

When some dirty data exist in (, ,…, ) of or , the value of may change. Accordingly, the nearest neighbors list would be affected, which leads to a wrong class label. When some values of in are dirty, these incorrect data might affect the vote of class labels, which causes a wrong classification result.

3.3 Bayes Classifiers

Bayes classifier is a probabilistic framework for solving classification problems. It computes the posterior probability for all values of in according to the Bayes theorem, which is extended as follows.

(6)

The goal of Bayes classifiers is to choose the value of that maximizes , which is equivalent to maximize . Naive Bayes and Bayesian network are classical Bayes classifiers.

Naive Bayes

Naive Bayes classifier [21] assumes the independence among attributes when the class is given, i.e. = . Since for all and in can be estimated, records in are classified to to maximize .

When some dirty values exist in (, ,…, ) of or , incorrect data could affect the value of , which leads to the wrong value of . Since the maximal determines the final , classification result would be impacted. When some values of in are dirty, the values of both and may change. Accordingly, could be affected, which causes an incorrect class label.

Bayesian Network

Bayesian network [22] is a directed acyclic diagram (DAG for brief) based on conditional probability tables (CPT for brief). In DAG, a node is conditional independent of its non-descendants, if its parents are known.

When Bayesian network structure is fixed, we estimate conditional probabilities based on and learn CPT. When Bayesian network structure is unknown, we first estimate Bayesian network using minimum description length [23], and then learn Bayesian network and CPT. Based on the learned model, we make Bayeaian network inference and compute the maximal posterior probability to determine class labels of .

Given , the description length of Bayesian network is defined as follows.

(7)

where is the number of parameters of , is the number of bits for describing a parameter, and is the th instance in .

When some dirty values exist in , dirty data may affect the value of . Accordingly, the learned Bayesian network and CPT would be incorrect, which leads to inaccurate inference based on the network. Since probabilities are computed with inference, the maximal posterior probability could be impacted, which results in a wrong class label. When some values of (, ,…, ) in are dirty, wrong values might affect the estimation of the maximal posterior probability, which leads to an incorrect classification result.

3.4 Logistic Regression

Logistic regression [24] is a binary classifier. In order to establish a regression function (Sigmoid function [25]) of classification boundary line, we use optimization methods (e.g., gradient ascent method [26]) to determine the best regression coefficient of the function based on . Once the regression function is constructed, we use it to classify records in .

Given input data , ,…, , the input of Sigmoid function is computed as follows.

(8)

The Sigmoid function is defined as follows.

(9)

When some dirty values exist in (, ,…, ) of or , wrong values could affect the value of . Accordingly, the learned Sigmoid function would change, which leads to inaccurate class labels of . When some values of in are dirty, incorrect class labels may mislead the establishment of Sigmoid function. Based on the imprecise function, the classification of might be affected.

3.5 Random Forests

In the training process, the random forests algorithm [27] constructs a set of base classifiers of decision trees. In the testing process, class labels of are predicted by aggregating predictions made by multiple classifiers.

When some dirty values exist in , incorrect data might affect splitting attribute selection in decision trees, which causes inaccurate decision tree induction and wrong predictions of . When some values of (, ,…, ) in are dirty, these data might mislead class labels of corresponding records, which causes incorrect classification.

4 Clustering Algorithms

In this section, we discuss possible dirty-data impacts on six classical clustering algorithms, K-Means, LVQ, CLARANS, DBSCAN, BIRCH, and CURE. We choose these algorithms since they are always used as competitive clustering algorithms [28, 29, 30, 31].

4.1 Prototype-Based Clustering

Prototype-based clustering assumes that clustering structure is portrayed by a group of prototypes. This kind of algorithms initialize prototypes at first, and then update them in the iterative process. K-Means [32], learning vector quantization [33] (LVQ for brief), and CLARANS [34] are three classical clustering methods.

K-Means

K-Means clustering approach selects points as the initial centroids firstly. Then, we form clusters by assigning all points to the closest centroid, and recompute the centroid of each cluster. The iterative process ends until the centroids do not chagne.

When some dirty values exist, incorrect data might affect computation of centroids. Accordingly, some points would be clustered to wrong class labels.

Lvq

LVQ clustering method assumes that there are class labels in data samples, and these marked labels can assist clustering. Given sample set = {(, ), (, ),…,(, )}, each (1 ) is a feature vector (; ;…;), which is expressed with attributes. is the class label of . The goal of LVQ is to learn a group of -dimensional prototype vector {, ,…, }, each vector denotes a cluster, and each cluster label .

First, LVQ algorithm initializes prototype vectors. Then, vectors are optimized in the iterative process. In each iteration, algorithm selects a marked training sample randomly, and finds a prototype vector with the shortest distance from the selected sample. If their labels are not the same, the prototype vector is updated.

When some values in ( = 1, 2,…, m) of are dirty, wrong values could mislead the label updating of prototype vector {, ,…, }. When some dirty values exist in ( = 1, 2,…, m) of , incorrect class labels would directly affect the labels of {, ,…, }. When some values in ( = 1, 2,…, q) of {, ,…, } are dirty, wrong values might impact the distance computing, which leads to an incorrect class label.

Clarans

CLARANS algorithm selects points as centroids at first. Then, we randomly choose a centroid as the current point, and a neighbor point of . We compute the cost difference between and . If the cost of is less, we set it as the current point, and select a neighbor of . If not, we find another neighbor of . The iteration ends until the number of sampling is achieved.

When some dirty values exist, incorrect data might affect computation of the cost difference. Accordingly, some points could be clustered to wrong class labels.

4.2 Density-Based Clustering

Density-based clustering locates regions of high density that are separated from one another by regions of low density. DBSCAN [35] is a basic density-based clustering algorithm. All noise points are discarded in DBSCAN, and we perform clustering on the remaining points. At first, we put an edge between all core points that are within (a specified radius) of each other. Then, each connected component is taken as a separate cluster, and each border point are assigned to one of the clusters of its associated core points.

When some values are dirty, wrong values would impact computation of density and point distances since density is the number of points within of a point. Accordingly, some points might be assigned to incorrect clusters.

4.3 Hierarchical Clustering

Hierarchical clustering produces a set of nested clusters organized as a hierarchical tree. BIRCH [36] and CURE [37] are classical hierarchical algorithms.

Birch

BIRCH algorithm introduces a clustering feature tree (CF tree for brief) to summarize the inherent clustering structure of data. Firstly, we scan the given data to build an initial in-memory CF tree. Then, we use an arbitrary clustering algorithm (e.g., an agglomerative hierarchical clustering algorithm) to cluster the leaf nodes of the CF tree. Finally, we scan the data again and assign the data points using the cluster centroids found in the previous step as seeds.

When some dirty data exist, incorrect values could affect construction of CF tree and computation of cluster centroids, which makes data points assigned to wrong clusters.

Cure

Instead of representing clusters by their centroids, CURE algorithm uses a collection of representative points. There are three steps in this method. The first step is initialization. At first, we take a small sample of the given data and cluster it in main memory using a hierarchical method in which clusters are merged when they have a close pair of points (e.g., MIN clustering method). Then, we select a small set of points from each cluster to be representative points. These points should be chosen to be as far from one another as possible. Lastly, we move each of the representative points a fixed fraction of the distance between its location and the centroid of its cluster.

The second step is merging clusters. We merge two clusters if they have a pair of representative points, one from each cluster, that are sufficiently close.

The third step is point assignment. Each point is brought from secondary storage and compared with the representative points. We assign to the cluster of the representative point that is closest to .

When some values in the given data are dirty, wrong values would effect the location of representative points and the computation of distance between representative point and the centroid of its cluster. Accordingly, data points might be clustered into incorrect clusters.

5 Regression Algorithms

In this section, we analyze possible dirty-data impacts on four classical regression algorithms, Least Square Linear Regression, Maximum Likelihood Linear Regression, Polynomial Regression, and Stepwise Regression. We choose these algorithms since they are always used as competitive regression algorithms [38, 39, 40, 41].

5.1 Linear Regression

Given data set = {(, ), (, ),…,(, )}, each = (; ;…;), (1 ). Linear regression is to learn a linear model = + , s.t. . With the model, we can predict real-value labels as accurately as possible. Least square method [42] and maximum likelihood method [43] are classical approaches to solve linear models.

Least Square Method

Least square linear regression minimizes mean-square error to solve linear models. We compute the values of parameters and as follows.

(10)

After parameters and are determined, a linear model is learned. Then, we predict real-value labels in with the model.

When some dirty values exist in , incorrect values could affect the values of and . Accordingly, the linear model would be effected, which leads to inaccurate prediction of in . When some data in (, ,…, ) of are dirty, the computation of might be affected.

Maximum Likelihood Method

Maximum likelihood linear regression computes parameters with likelihood function, which is defined as follows.

(11)

The goal of maximum likelihood method is to compute and with . After parameters and are determined, a linear model is learned. Then, we predict real-value labels in with the model.

When some data in are dirty, incorrect values could affect the computation of and . Accordingly, the linear model would be impacted, which leads to inaccurate prediction of in . When some dirty values exist in (, ,…, ) of , the prediction of might be affected.

5.2 Polynomial Regression

Polynomial regression [44] constructs a polynomial of linear combination that converts a feature to higher-order features. For instance, = + can be transformed as follows.

(12)

According to , the values of all parameters can be computed. After the polynomial regression model is determined, we predict real-value labels in with the model.

When some dirty values exist in , wrong values could affect the computation of parameters. Accordingly, the polynomial model would be impacted, which leads to inaccurate prediction of in . When some data in (, ,…, ) of are dirty, the prediction of might be affected.

5.3 Stepwise Regression

Stepwise regression [45] introduces independent variables one by one. The partial regression square sum of each introduced variable is significant. After each new variable is added, old variables in the regression model need to be tested one by one. Variables which are tested to be insignificant are deleted. When no more new variables can be introduced, the iteration stops.

When some data in are dirty, incorrect values could affect the computation of partial regression square sum. Accordingly, the test and selection of independent variables would be impacted, which leads to a biased regression model and inaccurate prediction of in . When some dirty values exist in (, ,…, ) of , the computation of might be affected.

6 Experimental Study

We evaluated dirty-data impacts on sixteen classical algorithms discussed in Section 34, and 5. All datasets and source codes are available1.

6.1 Experimental Setting

Datasets We selected 13 typical data sets from UCI public datasets2 with various types and sizes. Their basic information is shown in Table 3. Due to the completeness and correctness of these original datasets, we injected errors into them, and then evaluated their performance on different algorithms. Thus, the original datasets were used as the baseline, and the accuracy of algorithms was measured based on the results on original datasets.

Name Number of Number of Algorithm
Attributes Records
Classification
Iris 4 150 Clustering
Regression
Ecoli 8 336 Classification
Car 6 1728 Classification
Chess 36 3196 Classification
Adult 14 48842 Classification
Seeds 7 210 Clustering
Abalone 8 4177 Clustering
HTRU 9 17898 Clustering
Activity 3 67651 Clustering
Servo 4 167 Regression
Housing 14 506 Regression
Concrete 9 1030 Regression
Solar Flare 10 1389 Regression
Table 3: Datasets Information

Setup All experiments were conducted on a machine powered by two Intel(R) Xeon(R) E5-2609 v3@1.90GHz CPUs and 32GB memory, under CentOS7. All the algorithms were implemented in C++ and compiled with g++ 4.8.5.

Metrics First, we used standard Precision, Recall, and F-measure to evaluate the effectiveness of classification and clustering algorithms. These measures were computed as follows.

(13)

where is #-of records which are correctly classified to class , and is #-of records which are classified to class .

(14)

where is #-of records which are correctly classified to class , and is #-of records of class .

(15)

Also, we used RMSD (Root-mean-square Deviation), NRMSD (Normalized Root-mean-square Deviation), and CV(RMSD) (Coefficient of Variation of the RMSD) to evaluate effectiveness of regression algorithms. These measures were computed as follows.

(16)

where is #-of predicted records, is the predicted value of the th record.

(17)

where is the maximal predicted value, is the minimal predicted value.

(18)

where is the mean of predicted values.

However, these metrics only showed us the variations of accuracy. They were not possible to measure the fluctuation degrees quantitatively. Therefore, novel metrics were introduced to evaluate dirty-data impacts on algorithms. We defined the first metric as follows.

Definition 1

Given the values of a measure of an algorithm with a%, (a+x)%, (a+2x)%,…, (a+bx)% (a0, x0, b0) error rate. of an algorithm on dirty data is computed as - + - + … + - .

is defined to measure the sensibility of an algorithm to the data quality. The larger the value of is, the more sensitive an algorithm is to the data quality. Therefore, shows the fluctuation degrees of algorithms quantitatively. Here, we explain the computation of with Figure 1 as an example.

Example 1

Since the values of () of the decision tree algorithm with 0%, 10%,…, 50% missing rate are given, is computed as - + - +…+ - . Thus, in Iris dataset, is 78.37%-84.16% + 84.16%-78.08% + 78.08%-74.36% + 74.36%-64.99% + 64.99%-58.71% = 31.24%. In Ecoli dataset, is 63.47%-62.93% + 62.93%-53.97% + 53.97%-50.93% + 50.93%-48.07% + 48.07%-34.5% = 28.97%. In Car dataset, is 81.33%-60.93% + 60.93%-43.7% + 43.7%-42.87% + 42.87%-40.47% + 40.47%-35.47% = 45.86%. In Chess dataset, is 82.17%-78.17% + 78.17%-76.53% + 76.53%-75.77% + 75.77%-75.9% + 75.9%-75.57% = 6.86%. And in Adult dataset, is 80.5%-75.27% + 75.27%-71.3% + 71.3%-72.93% + 72.93%-71.53% + 71.53%-67.23% = 16.53%. Thus, the average of is 25.89%.

Though tells the fluctuation degrees of algorithms, we could not determine the error rate at which an algorithm is unacceptable. Motivated by this, we defined the second novel metric as follows.

Definition 2

Given the values of a measure of an algorithm with a%, (a+x)%, (a+2x)%,…, (a+bx)% (a0, x0, b0) error rate, and a number ( 0). If the larger value of causes the better accuracy, and - (0 i b), we take min{(a+(i-1)x)%} as the . If - , we take min{(a+bx)%} as the . If the smaller value of causes the better accuracy, and - (0 i b), we take min{(a+(i-1)x)%} as . If - , we take min{(a+bx)%} as the .

is defined to measure the dirty-data tolerability of an algorithm. The larger the value of is, the higher error-tolerability of an algorithm is. Therefore, is useful to show the error rate at which an algorithm is acceptable. Here, we take Figure 1 as an example to explain of an algorithm.

Example 2

We know the values of of the decision tree algorithm with 0%, 10%,…, 50% missing rate, and set 10% as the value of . In Iris dataset, when the missing rate is 40%, - = 78.37%-64.99% = 13.38% 10%, we take 30% as the . In Ecoli dataset, when the missing rate is 30%, - = 63.47%-50.93% = 12.54% 10%, we take 20% as the . In Car dataset, when the missing rate is 10%, - = 81.33%-60.93% = 20.4% 10%, we take 0% as the . In Chess dataset, when the missing rate is 50%, - = 82.17%-75.57% = 6.6% 10%, we take 50% as the . In Adult dataset, when the missing rate is 50%, - = 80.5%-67.23% = 13.27% 10%, we take 40% as the . Thus, the average of is 28%.

In addition, we used running time to evaluate the efficiency of all algorithms. We ran each test 5 times and reported logarithms of average time.

6.2 Evaluation on Classification Algorithms

Since various kinds of dirty data could affect the performance of classification algorithms, we varied error rates, including missing rate, inconsistent rate, and conflicting rate to evaluate the classification methods in Section 3.

Missing Inconsistent Conflicting
Algorithm P R F P R F P R F
Decision Tree 25.89 31.11 26.64 35.41 40.94 38.33 16.09 21.56 16.45
KNN 18.09 13.18 17.45 21.84 19.21 20.93 11.39 6.70 9.32
Naive Bayes 27.04 23.37 26.40 29.48 37.18 35.49 15.10 21.85 20.33
Bayesian Network 46.40 34.04 35.37 33.29 21.53 23.15 17.26 15.18 16.01
Logistic Regression 38.26 18.73 30.69 37.84 28.10 38.83 31.74 18.51 25.60
Random Forests 25.77 24.57 29.39 39.21 34.86 40.74 27.93 15.85 27.53
K-Means 31.06 27.80 32.08 31.83 32.21 35.63 23.79 21.86 25.17
LVQ 11.94 21.14 19.61 20.55 18.83 21.41 9.20 19.57 20.13
CLARANS 34.26 40.16 39.48 31.11 29.45 31.56 20.67 22.64 24.04
DBSCAN 15.89 22.88 17.16 20.40 10.39 12.34 18.64 9.55 16.10
BIRCH 32.58 44.56 32.90 24.32 22.48 19.40 15.16 22.44 16.52
CURE 38.68 32.71 39.23 28.81 32.90 32.67 32.74 29.11 32.62
Table 4: Results of Classification and Clustering Algorithms (Unit: %)

Classification - Varying Missing Rate

To evaluate the impacts of missing data on classification algorithms, we deleted values from original datasets randomly varying missing rate from 10% to 50%. We used 10-fold cross validation, and generated training data and testing data randomly. In the testing process, we imputed numerical missing values with the average values and captured categorical ones with the maximum values. Experimental results were depicted in Figure 12345, and 6.

Based on the results, we had the following observations. First, for well-performed algorithms whose Precision/Recall/F-measure is larger than 80% on original datasets, as the data size increases, Precision, Recall, or F-measure of algorithms becomes stable, except Logistic Regression. The reason is that the amount of clean data is larger for larger data size. Accordingly, the impacts of missing data on algorithms reduce. However, Logistic regression establishes a regression function as the model. For regression functions, the parameter computation is more sensitive to missing data. Thus, when data size rises, the amount of missing data becomes larger, which has larger impacts on Logistic Regression.

Second, as shown in Table 4, for Precision, the order is “Bayesian Network Logistic Regression Naive Bayes Decision Tree Random Forests KNN”. For Recall, the order is “Bayesian Network Decision Tree Random Forests Naive Bayes Logistic Regression KNN”. For F-measure, the order is “Bayesian Network Logistic Regression Random Forests Decision Tree Naive Bayes KNN”. Thus, the least sensitive algorithm is KNN. This is because that as missing rate rises, the increasing missing values may not affect nearest neighbors. Even if nearest neighbors are affected, they are not necessarily voted for the final class label. In addition, the most sensitive algorithm is Bayesian Network. The reason is that the increasing missing data could affect the computation of posterior probabilities, which would directly impact classification results.

Third, as shown in Table 5, for Precision, the order is “Decision Tree Naive Bayes Random Forests KNN Logistic Regression Bayesian Network”. For Recall, the order is “Random Forests KNN Naive Bayes Logistic Regression Decision Tree Bayesian Network”. For F-measure, the order is “Decision Tree Naive Bayes Bayesian Network KNN Logistic Regression Random Forests”. Therefore, for Precision and F-measure, the most incompleteness-tolerant algorithm is Decision Tree. This is because that decision tree models only use splitting features for classification. As the missing rate rises, the increasing missing data may not affect splitting features. For Recall, the most incompleteness-tolerant algorithm is Random Forests. This is because the increasing missing values may not affect splitting attributes. Even if impacted, there is little chance to cause inaccurate classification since the final result is made by multiple base classifiers. For Precision and Recall, the least incompleteness-tolerant algorithm is Bayesian Network. This is because the increasing missing data would change the posterior probabilities, which could affect classification results directly. For F-measure, the least incompleteness-tolerant algorithm is Random Forests. The reason is that F-measure on original datasets (error rate is 0%) is high. When few missing values exist in the datasets, F-measure drops a lot.

Fourth, as data size increases, the running time of algorithms fluctuates more. This is because that as data size rises, the amount of missing data becomes larger, which introduces more uncertainty to algorithms. Accordingly, the uncertainty of running time increases.

Missing Inconsistent Conflicting
Algorithm P R F P R F P R F
Decision 28 26 28 18 16 16 50 50 50
Tree
KNN 24 32 20 22 22 22 40 50 40
Naive 26 28 24 22 12 12 50 40 40
Bayes
Bayesian 20 26 24 16 26 26 46 50 50
Network
Logistic 22 28 16 16 14 16 32 34 32
Regression
Random 26 50 10 26 14 8 42 38 34
Forests
K-Means 38 32 32 28 22 22 44 38 38
LVQ 44 40 48 28 14 20 44 44 40
CLARANS 2 2 0 22 18 18 34 34 28
DBSCAN 30 40 30 32 44 34 36 50 36
BIRCH 24 20 24 20 26 26 50 34 38
CURE 18 18 16 20 18 16 32 34 24
Table 5: Results of Classification and Clustering Algorithms (=10%, Unit: %)
Missing Inconsistent Conflicting
Algorithm RMSD NRMSD CV RMSD NRMSD CV RMSD NRMSD CV
(RMSD) (RMSD) (RMSD)
Least Square 0.662 0.066 0.192 1.204 0.090 0.278 1.056 0.054 0.284
Maximum Likelihood 1.356 0.034 4.466 2.384 0.046 1.546 1.534 0.060 3.914
Polynomial Regression 1.568 0.106 0.200 2.010 0.174 0.464 1.794 0.116 0.426
Stepwise Regression 1.338 0.104 3.466 1.616 0.102 5.996 0.890 0.074 2.748
Table 6: Results of Regression Algorithms
Missing Inconsistent Conflicting
Algorithm RMSD NRMSD CV RMSD NRMSD CV RMSD NRMSD CV
(RMSD) (RMSD) (RMSD)
Least Square 30 50 50 16 50 50 32 50 42
Maximum Likelihood 22 50 40 16 50 34 20 50 40
Polynomial Regression 14 40 20 12 50 22 12 40 26
Stepwise Regression 16 40 24 10 42 22 40 50 30
Table 7: Results of Regression Algorithms (=0.1, Unit: %)

Classification - Varying Inconsistent Rate

To evaluate the impacts of inconsistency on classification algorithms, we injected inconsistent values to original datasets randomly according to consistency rules on the given data. The inconsistent rate was varied from 10% to 50%. We used 10-fold cross validation, and generated training data and testing data randomly. Experimental results were depicted in Figure 7, 8, 9, 10, 11, and 12.

Based on the results, we had the following observations. First, as shown in Table 4, for Precision, the order is “Random Forests Logistic Regression Decision Tree Bayesian Network Naive Bayes KNN”. For Recall, the order is “Decision Tree Naive Bayes Random Forests Logistic Regression Bayesian Network KNN”. For F-measure, the order is “Random Forests Logistic Regression Decision Tree Naive Bayes Bayesian Network KNN”. Thus, the least sensitive algorithm is KNN. The reason is similar as that of the least sensitive algorithm varying missing rate. For Precision and F-measure, the most sensitive algorithm is Random Forests. And for Recall, the most sensitive algorithm is Decision Tree. These are due to the fact that as the inconsistent rate increases, more and more incorrect values cover the correct ones in decision tree training models, which leads to inaccurate classification results. Since base classifiers in Random Forests are decision trees, the reason for Random Forests is the same as that for Decision Tree.

Second, as shown in Table 5, for Precision, the order is “Random Forests KNN Naive Bayes Decision Tree Bayesian Network Logistic Regression”. For Recall, the order is “Bayesian Network KNN Decision Tree Logistic Regression Random Forests Naive Bayes”. For F-measure, the order is “Bayesian Network KNN Decision Tree Logistic Regression Naive Bayes Random Forests”. Therefore, for Precision, the most inconsistency-tolerant algorithm is Random Forests. The reason has been discussed in Section 6.2.1. For Recall and F-measure, the most inconsistency-tolerant algorithm is Bayesian Network. This is because inconsistent values contain incorrect ones and correct ones. Hence, incorrect values have little effect on the computation of posterior probabilities. Accordingly, classification results may not be affected. For Precision, the least inconsistency-tolerant algorithms are Bayesian Network and Logistic Regression. For Recall, the least inconsistency-tolerant algorithm is Naive Bayes. And for F-measure, the least inconsistency-tolerant algorithm is Random Forests. These are because that Precision/Recall/F-measure of these algorithms on original datasets (error rate is 0%) is high. When few inconsistent values are injected, Precision/Recall/F-measure drops dramatically.

Third, the observation of running time on experiments varying missing rate was still true when the inconsistent rate was varied.

Classification - Varying Conflicting Rate

To evaluate the impacts of conflicting data on classification algorithms, we injected conflicting values to original datasets randomly varying conflicting rate from 10% to 50%. We used 10-fold cross validation, and generated training and testing data randomly. Experimental results were depicted in Figure 13, 14, 15, 16, 17, and 18.

First, the observation of the relationship between the data size and algorithm stability on experiments varying missing rate was still true when the conflicting rate was varied.

Second, as shown in Table 4, for Precision, the order is “Logistic Regression Random Forests Bayesian Network Decision Tree Naive Bayes KNN”. For Recall, the order is “Naive Bayes Decision Tree Logistic Regression Random Forests Bayesian Network KNN”. For F-measure, the order is “Random Forests Logistic Regression Naive Bayes Decision Tree Bayesian Network KNN”. Thus, the least sensitive algorithm is KNN. The reason is similar as that of the least sensitive algorithm varying missing rate. For Precision, the most sensitive algorithm is Logistic Regression. This is because parameter computation of the regression function is easily affected by the increasing conflicting values, which causes an inaccurate logistic regression model. For Recall, the most sensitive algorithm is Naive Bayes. This is because that incorrect values in the increasing conflicting values affect the computation of posterior probabilities in Bayes theorem. For F-measure, the most sensitive algorithm is Random Forests. The reason is the same as that of the most sensitive algorithm varying inconsistent rate.

Third, as shown in Table 5, for Precision, the order is “Decision Tree Naive Bayes Bayesian Network Random Forests KNN Logistic Regression”. For Recall, the order is “Decision Tree KNN Bayesian Network Naive Bayes Random Forests Logistic Regression”. For F-measure, the order is “Decision Tree Bayesian Network KNN Naive Bayes Random Forests Logistic Regression”. Therefore, the most conflict-tolerant algorithm is Decision Tree. The reason is similar as that of the most incompleteness-tolerant algorithm. The least conflict-tolerant algorithm is Logistic Regression. This is due to the fact that conflicting data have much effect on parameter computation of logistic regression models.

Fourth, the observation of running time on experiments varying missing rate was still true when the conflicting rate was varied.

Discussion

In classification experiments, we first found that dirty-data impacts are related to error type and error rate. Thus, the error rate of each error type in the given data is necessary to be detected. Second, we observed that for algorithms whose Precision/Recall/F-measure is larger than 80% on original datasets, Precision, Recall, or F-measure of algorithms become stable as data size rises, except Logistic Regression. Since the parameter in was set as 10%, candidate algorithms of which Precision/Recall/F-measure is larger than 70% are acceptable. Third, we prefer to choose stable algorithms. Hence, Logistic Regression is suitable for smaller data size. Fourth, we compared the fluctuation degrees of classification algorithms in our experiments. When dirty data exist, the algorithm with the least degree is the most stable. Fifth, beyond , the accuracy of the selected algorithm becomes unacceptable. Thus, the error rate of each type needs to be controlled within its .

6.3 Evaluation on Clustering Algorithms

Since various types of dirty data could affect the performance of clustering algorithms, we varied error rates, involving missing rate, inconsistent rate, and conflicting rate to evaluate clustering approaches in Section 4.

Clustering - Varying Missing Rate

To evaluate missing-data impacts on clustering algorithms, we deleted values from original datasets randomly varying missing rate from 10% to 50%. In clustering process, we imputed numerical missing values with the average values, and captured categorical ones with the maximum values. Experimental results were depicted in Figure 19, 20, 21, 22, 23, and 24.

Based on the results, we had the following observations. First, as shown in Table 4, for Precision, the order is “CURE CLARANS BIRCH K-Means DBSCAN LVQ”. For Recall, the order is “BIRCH CLARANS CURE K-Means DBSCAN LVQ”. For F-measure, the order is “CLARANS CURE BIRCH K-Means LVQ DBSCAN”. Thus, for Precision and Recall, the least sensitive algorithm is LVQ. This is because that LVQ is a supervised clustering algorithm on the basis of marked labels. Hence, there is little chance to be affected by missing values. For F-measure, the least sensitive algorithm is DBSCAN. This is due to the fact that DBSCAN eliminates all noise points at the beginning of the algorithm, which makes it more resistant to missing values. For Precision, the most sensitive algorithm is CURE. This is because that the location of representative points in CURE is easily effected by missing values, which causes inaccurate clustering results. For Recall, the most sensitive algorithm is BIRCH. This is due to the fact that missing data could impact the construction of CF tree in BIRCH, which directly leads to wrong clustering results. For F-measure, the most sensitive algorithm is CLARANS. This is because that the computation of cost difference in CLARANS is susceptible to missing values, which makes some points clustered incorrectly.

Second, as shown in Table 5, for Precision, the order is “LVQ K-Means DBSCAN BIRCH CURE CLARANS”. For Recall, the order is “LVQ DBSCAN K-Means BIRCH CURE CLARANS”. For F-measure, the order is “LVQ K-Means DBSCAN BIRCH CURE CLARANS”. Therefore, the most incompleteness-tolerant algorithm is LVQ. This is because that LVQ is a supervised clustering algorithm based on marked labels. Hence, there is little chance for it to be affected by missing values. The least incompleteness-tolerant algorithm is CLARANS. This is due to the fact that the computation of cost difference in CLARANS is susceptible to missing data, which causes inaccurate clustering results.

Third, as data size increases, running time of algorithms fluctuates more. This is because that as data size rises, the amount of missing data becomes larger, which introduces more uncertainty to algorithms. Accordingly, the uncertainty of running time increases.

Clustering - Varying Inconsistent Rate

To evaluate inconsistent-data impacts on clustering algorithms, we injected inconsistent values to original datasets randomly according to consistency rules on the given data. Inconsistent rate was varied from 10% to 50%. Experimental results were depicted in Figure 25, 26, 27, 28, 29, and 30.

Based on the results, we had the following observations. First, for well-performed algorithms (Precision/Recall/F-measure is larger than 80% on original datasets), as data size increases, Precision, Recall, or F-measure of algorithms fluctuates more widely, except DBSCAN. This is because that the amount of inconsistent values becomes larger as data size rises. The increasing incorrect data have more effect on clustering process. However, DBSCAN discards noise points at the beginning of the algorithm. When data size rises, the number of clean data becomes larger. Accordingly, the proportion of eliminated points reduces, which has less impact on DBSCAN.

Second, as shown in Table 4, for Precision, the order is “K-Means CLARANS CURE BIRCH LVQ DBSCAN”. For Recall, the order is “CURE K-Means CLARANS BIRCH LVQ DBSCAN”. For F-measure, the order is “K-Means CURE CLARANS LVQ BIRCH DBSCAN”. Thus, the least sensitive algorithm is DBSCAN. The reason is similar as that of the least sensitive algorithm varying missing rate. For Precision and F-measure, the most sensitive algorithm is K-Means. This is due to the fact that the computation of centroids are susceptible to incorrect values, which causes wrong clustering results. For Recall, the most sensitive algorithm is CURE. The reason is similar as that of the most sensitive algorithm varying missing rate.

Third, as shown in Table 5, for Precision, the order is “DBSCAN K-Means LVQ CLARANS BIRCH CURE”. For Recall, the order is “DBSCAN BIRCH K-Means CLARANS CURE LVQ”. For F-measure, the order is “DBSCAN BIRCH K-Means LVQ CLARANS CURE”. Therefore, the most inconsistency-tolerant algorithm is DBSCAN. This is because that DBSCAN eliminates all noise points at the beginning of the algorithm, which makes it more resistant to inconsistent data. For Precision, the least inconsistency-tolerant algorithms are BIRCH and CURE. For Recall, the least inconsistency-tolerant algorithm is LVQ. For F-measure, the least inconsistency-tolerant algorithm is CURE. These are due to the fact that the distance computation of these algorithms is susceptible to incorrect values, which causes inaccurate clustering results.

Fourth, the observation of running time on experiments varying missing rate was still true when the inconsistent rate was varied.

Clustering - Varying Conflicting Rate

To evaluate impacts of conflicting data on clustering algorithms, we injected conflicting values to original datasets randomly varying conflicting rate from 10% to 50%. Experimental results were depicted in Figure 31, 32, 33, 34, 35, and 36.

Based on the results, we had the following observations. First, as shown in Table 4, for Precision, the order is “CURE K-Means CLARANS DBSCAN BIRCH LVQ”. For Recall, the order is “CURE CLARANS BIRCH K-Means LVQ DBSCAN”. For F-measure, the order is “CURE K-Means CLARANS LVQ BIRCH DBSCAN”. Thus, for Precision, the least sensitive algorithm is LVQ. The reason is similar as that of the least sensitive algorithm varying missing rate. For Recall and F-measure, the least sensitive algorithm is DBSCAN. The reason has been discussed in Section 6.3.1. The most sensitive algorithm is CURE. The reason is similar as that of the most sensitive algorithm varying missing rate.

Second, as shown in Table 5, for Precision, the order is “BIRCH K-Means LVQ DBSCAN CLARANS CURE”. For Recall, the order is “DBSCAN LVQ K-Means CLARANS BIRCH CURE”. For F-measure, the order is “LVQ K-Means BIRCH DBSCAN CLARANS CURE”. Therefore, for Precision, the most conflict-tolerant algorithm is BIRCH. This is because that conflicting data contain correct ones and incorrect ones, which makes the construction of CF tree insusceptible to incorrect values. For Recall, the most conflict-tolerant algorithm is DBSCAN. The reason is similar as that of the most inconsistency-tolerant algorithm varying inconsistent rate. For F-measure, the most conflict-tolerant algorithm is LVQ. The reason is similar as that of the most incompleteness-tolerant algorithm varying missing rate. The least conflict-tolerant algorithm is CURE. This is due to the fact that the location of representative points in CURE could be easily affected by conflicting values, which makes data points clustered inaccurately.

Third, the observation of running time on experiments varying missing rate was still true when the conflicting rate was varied.

Discussion

In clustering experiments, we first found that dirty-data impacts are related to error type and error rate. Thus, the error rate of each error type in the given data is necessary to be detected. Second, we observed that for algorithms whose Precision/Recall/F-measure is larger than 80% on original datasets, Precision, Recall, or F-measure of algorithms becomes unstable as data size rises, except DBSCAN. Since the parameter in was set as 10%, candidate algorithms of which Precision/Recall/F-measure is larger than 70% are acceptable. Third, we prefer to choose stable algorithms. Hence, DBSCAN is suitable for larger data size. Fourth, we compared the fluctuation degrees of clustering algorithms in our experiments. When dirty data exist, the algorithm with the least degree is the most stable. Fifth, beyond , the accuracy of the selected algorithm becomes unacceptable. Thus, the error rate of each type needs to be controlled within its .

6.4 Evaluation on Regression Algorithms

Since various kinds of dirty data could affect performance of regression algorithms, we varied error rates, including missing rate, inconsistent rate, and conflicting rate to evaluate different types of regression methods in Section 5.

Regression - Varying Missing Rate

To evaluate missing-data impacts on regression algorithms, we deleted values from original datasets randomly varying missing rate from 10% to 50%. We used 10-fold cross validation and generated training data and testing data randomly. In testing process, we imputed numerical missing values with the average values, and captured categorical ones with the maximum values. Experimental results were depicted in Figure 37, 38, 39, and 40.

Based on the results, we had the following observations. First, as shown in Table 6, for RMSD, the order is “Polynomial Regression Maximum Likelihood Stepwise Regression Least Square”. For NRMSD, the order is “Polynomial Regression Stepwise Regression Least Square Maximum Likelihood”. For CV(RMSD), the order is “Maximum Likelihood Stepwise Regression Polynomial Regression Least Square”. Thus, for RMSD and CV(RMSD), the least sensitive algorithm is Least Square. And for NRMSD, the least sensitive algorithm is Maximum Likelihood. These are due to the fact that the number of parameters in linear regression model is small. Hence, there is little chance for the model training to be affected by missing values. For RMSD and NRMSD, the most sensitive algorithm is Polynomial Regression. And for CV(RMSD), the most sensitive algorithm is Maximum Likelihood. These are because that these algorithms perform badly on some original datasets (error rate is 0%). When missing data are injected, the uncertainty of data becomes more, which leads to increasing uncertainty to algorithms. Accordingly, the performances of algorithms become worse.

Second, as shown in Table 7, for RMSD, the order is “Least Square Maximum Likelihood Stepwise Regression Polynomial Regression”. For NRMSD, the order is “Least Square Maximum Likelihood Stepwise Regression Polynomial Regression”. For CV(RMSD), the order is “Least Square Maximum Likelihood Stepwise Regression Polynomial Regression”. Therefore, the most incompleteness-tolerant algorithm is Least Square. This is because that the amount of parameters in least square linear regression model is small. Hence, there is little chance to be effected. The least incompleteness-tolerant algorithm is Polynomial Regression. This is due to the fact that there are many parameters in polynomial regression model, which makes it susceptible to missing data.

Third, as the running time of an algorithm on original datasets (error rate is 0%) rises, the running time of algorithms fluctuates more. This is because that as the running time rises, the uncertainty of algorithms becomes more. Accordingly, the uncertainty of running time increases.

Regression - Varying Inconsistent Rate

To evaluate inconsistent-data impacts on regression algorithms, we injected inconsistent values to original datasets randomly according to consistency rules on the given data. Inconsistent rate was varied from 10% to 50%. We used 10-fold cross validation and generated training data and testing data randomly. Experimental results were depicted in Figure 41, 42, 43, and 44.

Based on the results, we had the following observations. First, as shown in Table 6, for RMSD, the order is “Maximum Likelihood Polynomial Regression Stepwise Regression Least Square”. For NRMSD, the order is “Polynomial Regression Stepwise Regression Least Square Maximum Likelihood”. For CV(RMSD), the order is “Stepwise Regression Maximum Likelihood Polynomial Regression Least Square”. Thus, for RMSD and CV(RMSD), the least sensitive algorithm is Least Square. And for NRMSD, the least sensitive algorithm is Maximum Likelihood. The reason is similar as that of the least sensitive algorithm varying missing rate. For RMSD, the most sensitive algorithm is Maximum Likelihood. For NRMSD, the most sensitive algorithm is Polynomial Regression. And for CV(RMSD), the most sensitive algorithm is Stepwise Regression. These are due to their poor performances on some original datasets (error rate is 0%). When inconsistent data are injected, the uncertainty of data becomes more, which leads to increasing uncertainty to algorithms. Accordingly, algorithms perform worse.

Second, as shown in Table 7, for RMSD, the order is “Least Square Maximum Likelihood Polynomial Regression Stepwise Regression”. For NRMSD, the order is “Least Square Maximum Likelihood Polynomial Regression Stepwise Regression”. For CV(RMSD), the order is “Least Square Maximum Likelihood Polynomial Regression Stepwise Regression”. Therefore, the most inconsistency-tolerant algorithm is Least Square. The reason is similar as that of the most incompleteness-tolerant algorithm varying missing rate. The least inconsistency-tolerant algorithm is Stepwise Regression. This is due to the fact that there are many independent variables to be tested in stepwise regression model, which makes it easily affected by inconsistent values.

Third, the observation of running time on experiments varying missing rate was still true when the inconsistent rate was varied.

Regression - Varying Conflicting Rate

To evaluate impacts of conflicting data on regression algorithms, we injected conflicting values to original datasets randomly varying conflicting rate from 10% to 50%. We used 10-fold cross validation and generated training data and testing data randomly. Experimental results were depicted in Figure 45, 46, 47, and 48.

Based on the results, we had the following observations. First, as shown in Table 6, for RMSD, the order is “Polynomial Regression Maximum Likelihood Least Square Stepwise Regression”. For NRMSD, the order is “Polynomial Regression Stepwise Regression Maximum Likelihood Least Square”. For CV(RMSD), the order is “Maximum Likelihood Stepwise Regression Polynomial Regression Least Square”. Thus, for RMSD, the least sensitive algorithm is Stepwise Regression. This is because that there is a validation step in stepwise regression, which guarantees the regression accuracy. For NRMSD and CV(RMSD), the least sensitive algorithm is Least Square. The reason is similar as that of the least sensitive algorithm varying missing rate. For RMSD and NRMSD, the most sensitive algorithm is Polynomial Regression. And for CV(RMSD), the most sensitive algorithm is Maximum Likelihood. The reason is similar as that of the most sensitive algorithms varying missing rate.

Second, as shown in Table 7, for RMSD, the order is “Stepwise Regression Least Square Maximum Likelihood Polynomial Regression”. For NRMSD, the order is “Least Square Maximum Likelihood Stepwise Regression Polynomial Regression”. For CV(RMSD), the order is “Least Square Maximum Likelihood Stepwise Regression Polynomial Regression”. Therefore, the most conflict-tolerant algorithms are Least Square, Maximum Likelihood, and Stepwise Regression. This is due to the fact that there are a small amount of parameters in least square and maximum likelihood linear regression models. In Stepwise Regression, the validation step helps guarantee the regression accuracy. The least conflict-tolerant algorithm is Polynomial Regression. The reason is similar as that of the least incompleteness-tolerant algorithm varying missing rate.

Third, the observation of running time on experiments varying missing rate was still true when the conflicting rate was varied.

Discussion

In regression experiments, we first found that dirty-data impacts are related to error type and error rate. Thus, the error rate of each error type in the given data is necessary to be detected. Second, we compared the fluctuation degrees of regression algorithms in our experiments. When dirty data exist, the algorithm with the least degree is the most stable. Third, beyond , the accuracy of the selected algorithm becomes unacceptable. Thus, the error rate of each type needs to be controlled within its .

7 Guidelines and Future Work

Based on the discussions, we give guidelines for algorithm selection and data cleaning.

Classification Guidelines. We suggest users select classification algorithm and clean dirty data according to the following steps.

First, users are suggested to detect error rates (e.g., missing rate, inconsistent rate, conflicting rate) from the given data.

Second, according to the given task requirements (e.g., well performance on Precision/Recall/F-measure), we suggest users select candidate algorithms of which Precision/Recall/F-measure on the given data is better than 70%.

Third, if the given data size is small, we recommend Logistic Regression.

Fourth, according to task requirements and the error type which takes the largest proportion, we suggest users find the corresponding order and choose the least sensitive classification algorithm.

Finally, according to the selected algorithm, task requirements, and error rates of the given data, we suggest users find the corresponding orders and clean each type of dirty data to its .

Clustering Guidelines. We suggest users select clustering algorithm and clean dirty data according to the following steps.

First, users are suggested to detect error rates (e.g., missing rate, inconsistent rate, conflicting rate) from the given data.

Second, according to the given task requirements (e.g., well performance on Precision/Recall/F-measure), we suggest users select candidate algorithms of which Precision/Recall/F-measure on the given data is better than 70%.

Third, if the given data size is large, we recommend DBSCAN.

Fourth, according to task requirements and the error type which takes the largest proportion, we suggest users find the corresponding order and choose the least sensitive clustering algorithm.

Finally, according to the selected algorithm, task requirements, and error rates of the given data, we suggest users find the corresponding orders and clean each type of dirty data to its .

Regression Guidelines. We suggest users select regression algorithm and clean dirty data according to the following steps.

First, users are suggested to detect error rates (e.g., missing rate, inconsistent rate, conflicting rate) from the given data.

Second, according to the given task requirements (e.g., well performance on RMSD/NRMSD/CV(RMSD)), we suggest users select candidate algorithms of which RMSD/CV(RMSD) on the given data is better than 1.0, or NRMSD on the given data is better than 0.5.

Third, according to task requirements and the error type which takes the largest proportion, we suggest users find the corresponding order and choose the least sensitive regression algorithm.

Finally, according to the selected algorithm, task requirements, and error rates of the given data, we suggest users find the corresponding orders and clean each type of dirty data to its .

In addition, this work opens many noteworthy avenues for future work, which are listed as follows.

For researchers and practitioners in the fields related to data analytics and data mining. () Since dirty-data impacts on classification, clustering, and regression are valuable, their effects on other kinds of algorithms (e.g., association rules mining) need to be tested. () Dirty-data impacts are related to error type, error rate, data size, and algorithm performance on original datasets. Hence, constructing a model with these parameters to predict dirty-data impacts is in demand.

For researchers and practitioners in data-quality and data-cleaning related fields. () Since the error-tolerance ability of different algorithms on different error types are different, we are unnecessary to clean all dirty data before data mining and machine learning tasks. Instead, data cleaning to an appropriate rate (e.g., ) is suggested. However, which part of dirty data has priority to be repaired first is a challenging problem. () Since different users have different task requirements, how to clean data on demand needs a solution.

Figure 1: Results on Classification for Decision Tree Algorithm: Varying Missing Rate.
Figure 2: Results on Classification for KNN Algorithm: Varying Missing Rate.
Figure 3: Results on Classification for Naive Bayes Algorithm: Varying Missing Rate.
Figure 4: Results on Classification for Bayesian Network Algorithm: Varying Missing Rate.
Figure 5: Results on Classification for Logistic Regression Algorithm: Varying Missing Rate.
Figure 6: Results on Classification for Random Forests Algorithm: Varying Missing Rate.
Figure 7: Results on Classification for Decision Tree Algorithm: Varying Inconsistent Rate.
Figure 8: Results on Classification for KNN Algorithm: Varying Inconsistent Rate.
Figure 9: Results on Classification for Naive Bayes Algorithm: Varying Inconsistent Rate.
Figure 10: Results on Classification for Bayesian Network Algorithm: Varying Inconsistent Rate.
Figure 11: Results on Classification for Logistic Regression Algorithm: Varying Inconsistent Rate.
Figure 12: Results on Classification for Random Forests Algorithm: Varying Inconsistent Rate.
Figure 13: Results on Classification for Decision Tree Algorithm: Varying Conflicting Rate.
Figure 14: Results on Classification for KNN Algorithm: Varying Conflicting Rate.
Figure 15: Results on Classification for Naive Bayes Algorithm: Varying Conflicting Rate.
Figure 16: Results on Classification for Bayesian Network Algorithm: Varying Conflicting Rate.
Figure 17: Results on Classification for Logistic Regression Algorithm: Varying Conflicting Rate.
Figure 18: Results on Classification for Random Forests Algorithm: Varying Conflicting Rate.
Figure 19: Results on Clustering for K-Means Algorithm: Varying Missing Rate.
Figure 20: Results on Clustering for LVQ Algorithm: Varying Missing Rate.
Figure 21: Results on Clustering for CLARANS Algorithm: Varying Missing Rate.
Figure 22: Results on Clustering for DBSCAN Algorithm: Varying Missing Rate.
Figure 23: Results on Clustering for BIRCH Algorithm: Varying Missing Rate.
Figure 24: Results on Clustering for CURE Algorithm: Varying Missing Rate.
Figure 25: Results on Clustering for K-Means Algorithm: Varying Inconsistent Rate.
Figure 26: Results on Clustering for LVQ Algorithm: Varying Inconsistent Rate.
Figure 27: Results on Clustering for CLARANS Algorithm: Varying Inconsistent Rate.
Figure 28: Results on Clustering for DBSCAN Algorithm: Varying Inconsistent Rate.
Figure 29: Results on Clustering for BIRCH Algorithm: Varying Inconsistent Rate.
Figure 30: Results on Clustering for CURE Algorithm: Varying Inconsistent Rate.
Figure 31: Results on Clustering for K-Means Algorithm: Varying Conflicting Rate.
Figure 32: Results on Clustering for LVQ Algorithm: Varying Conflicting Rate.
Figure 33: Results on Clustering for CLARANS Algorithm: Varying Conflicting Rate.
Figure 34: Results on Clustering for DBSCAN Algorithm: Varying Conflicting Rate.
Figure 35: Results on Clustering for BIRCH Algorithm: Varying Conflicting Rate.
Figure 36: Results on Clustering for CURE Algorithm: Varying Conflicting Rate.
Figure 37: Results on Regression for Least Square Linear Regression: Varying Missing Rate.
Figure 38: Results on Regression for Maximum Likelihood Linear Regression: Varying Missing Rate.
Figure 39: Results on Regression for Polynomial Regression: Varying Missing Rate.
Figure 40: Results on Regression for Stepwise Regression: Varying Missing Rate.
Figure 41: Results on Regression for Least Square Linear Regression: Varying Inconsistent Rate.
Figure 42: Results on Regression for Maximum Likelihood Linear Regression: Varying Inconsistent Rate.
Figure 43: Results on Regression for Polynomial Regression: Varying Inconsistent Rate.
Figure 44: Results on Regression for Stepwise Regression: Varying Inconsistent Rate.
Figure 45: Results on Regression for Least Square Linear Regression: Varying Conflicting Rate.
Figure 46: Results on Regression for Maximum Likelihood Linear Regression: Varying Conflicting Rate.
Figure 47: Results on Regression for Polynomial Regression: Varying Conflicting Rate.
Figure 48: Results on Regression for Stepwise Regression: Varying Conflicting Rate.

Footnotes

  1. https://github.com/qizhixinhit/Dirty-dataImpacts
  2. http://archive.ics.uci.edu/ml/datasets.html

References

  1. G. Beskales, I. F. Ilyas, L. Golab, and A. Galiullin. On the relative trust between inconsistent data and inaccurate constraints. In ICDE, pages 541–552, 2013.
  2. X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, pages 458–469, 2013.
  3. X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye. Katara: A data cleaning system powered by knowledge bases and crowdsourcing. In SIGMOD, pages 1247–1261, 2015.
  4. S. Hao, N. Tang, G. Li, and J. Li. Cleaning relations using knowledge bases. In ICDE, pages 933–944, 2017.
  5. J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution. PVLDB, 5(11):1483–1494, 2012.
  6. M. Dallachiesa, A. Ebaid, A. Eldawy, A. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. Nadeef: a commodity data cleaning system. In SIGMOD, pages 541–552, 2013.
  7. W. Fan and F. Geerts. Foundations of Data Quality Management. 2012.
  8. W. Fan and F. Geerts. Capturing missing tuples and missing values. In SIGMOD, pages 169–178, 2010.
  9. G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: Consistency and accuracy. In PVLDB, pages 315–326, 2007.
  10. L. Getoor and A. Machanavajjhala. Entity resolution: theory, practice & open challenges. PVLDB, 5(12):2018–2019, 2012.
  11. F. Sidi, P. H. S. Panahy, L. S. Affendey, M. A. Jabar, H. Ibrahim, and A. Mustapha. Data quality: A survey of data quality dimensions. In Information Retrieval & Knowledge Management, pages 300–304, 2012.
  12. R. Caruana and A. Niculescu-Mizil. An empirical comparison of supervised learning algorithms. In ICML, pages 161–168, 2006.
  13. R. Caruana, N. Karampatziakis, and A. Yessenalina. An empirical evaluation of supervised learning in high dimensions. In ICML, pages 96–103, 2008.
  14. K. O. Elish and M. O. Elish. Predicting defect-prone software modules using support vector machines. Journal of Systems and Software, 81(5):649–660, 2008.
  15. B. Ghotra, S. McIntosh, and A. E. Hassan. Revisiting the impact of classification techniques on the performance of defect prediction models. In ICSE, pages 789–800, 2015.
  16. J. R. Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986.
  17. Y. Prabhu and M. Varma. Fastxml: A fast, accurate and stable tree-classifier for extreme multi-label learning. In SIGKDD, pages 263–272, 2014.
  18. C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown. Auto-weka: Combined selection and hyperparameter optimization of classification algorithms. In SIGKDD, pages 847–855, 2013.
  19. T. Hastie and R. Tibshirani. Discriminant adaptive nearest neighbor classification. IEEE transactions on pattern analysis and machine intelligence, 18(6):607–616, 1996.
  20. N. Begum, L. Ulanova, J. Wang, and E. Keogh. Accelerating dynamic time warping clustering with a novel admissible pruning strategy. In SIGKDD, pages 49–58, 2015.
  21. T. Bayes. A letter from the late reverend mr. thomas bayes, frs to john canton, ma and frs. Philosophical Transactions, 53:269–271, 1763.
  22. J. Pearl. Fusion, propagation, and structuring in belief networks. Artificial intelligence, 29(3):241–288, 1986.
  23. T. Wu, S. Sugawara, and K. Yamanishi. Decomposed normalized maximum likelihood codelength criterion for selecting hierarchical latent variable models. In SIGKDD, pages 1165–1174, 2017.
  24. D. McFadden et al. Conditional logit analysis of qualitative choice behavior. Frontiers in Econometrics, pages 105–142, 1972.
  25. H. Jain, Y. Prabhu, and M. Varma. Extreme multi-label loss functions for recommendation, tagging, ranking & other missing label applications. In SIGKDD, pages 935–944, 2016.
  26. S. Chang, W. Han, J. Tang, G. Qi, C. C. Aggarwal, and T. S. Huang. Heterogeneous network embedding via deep architectures. In SIGKDD, pages 119–128, 2015.
  27. Z. Cui, W. Chen, Y. He, and Y. Chen. Optimal action extraction for random forests and boosted trees. In SIGKDD, pages 179–188, 2015.
  28. S. Khanmohammadi, N. Adibeig, and S. Shanehbandy. An improved overlapping k-means clustering method for medical applications. Expert Systems with Applications, 67:12–18, 2017.
  29. K. Kirchner, J. Zec, and B. Delibašić. Facilitating data preprocessing by a generic framework: a proposal for clustering. Artificial Intelligence Review, 45(3):271–297, 2016.
  30. S. Wu, H. Chen, and X. Feng. Clustering algorithm for incomplete data sets with mixed numeric and categorical attributes. International Journal of Database Theory and Application, 6(5):95–104, 2013.
  31. H. Gulati and P. Singh. Clustering techniques in data mining: A comparison. In Computing for Sustainable Global Development, pages 410–415, 2015.
  32. J. MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281–297, 1967.
  33. T. Kohonen. Learning vector quantization. In Self-Organizing Maps, pages 175–189. 1995.
  34. R. T. Ng and J. Han. Efficient and effective clustering methods for spatial data mining. In PVLDB, pages 144–155, 1994.
  35. M. Ester, H. Kriegel, J. Sander, and X. Xu. Density-based spatial clustering of applications with noise. In SIGKDD, volume 240, 1996.
  36. T. Zhang, R. Ramakrishnan, and M. Livny. Birch: an efficient data clustering method for very large databases. In ACM Sigmod Record, volume 25, pages 103–114, 1996.
  37. S. Guha, R. Rastogi, and K. Shim. Cure: an efficient clustering algorithm for large databases. In ACM Sigmod Record, volume 27, pages 73–84, 1998.
  38. R. Silhavy, P. Silhavy, and Z. Prokopova. Analysis and selection of a regression model for the use case points method using a stepwise approach. Journal of Systems and Software, 125:1–14, 2017.
  39. S. Abraham, M. Raisee, G. Ghorbaniasl, F. Contino, and C. Lacor. A robust and efficient stepwise regression method for building sparse polynomial chaos expansions. Journal of Computational Physics, 332:461–474, 2017.
  40. E. Avdis and J. A. Wachter. Maximum likelihood estimation of the equity premium. Journal of Financial Economics, 125(3):589–609, 2017.
  41. L. Li and X . Zhang. Parsimonious tensor response regression. Journal of the American Statistical Association, 112(519):1131–1146, 2017.
  42. J. B. Ramsey. Tests for specification errors in classical linear least-squares regression analysis. Journal of the Royal Statistical Society, pages 350–371, 1969.
  43. P. McCullagh. Generalized linear models. European Journal of Operational Research, 16(3):285–292, 1984.
  44. A. R. Gallant and W. A. Fuller. Fitting segmented polynomial regression models whose join points have to be estimated. Journal of the American Statistical Association, 68(341):144–147, 1973.
  45. R. B. Bendel and A. A. Afifi. Comparison of stopping rules in forward ¡°stepwise¡± regression. Journal of the American Statistical association, 72(357):46–53, 1977.