Impacts of Dirty Data: and Experimental Evaluation
Abstract
Data quality issues have attracted widespread attention due to the negative impacts of dirty data on data mining and machine learning results. The relationship between data quality and the accuracy of results could be applied on the selection of the appropriate algorithm with the consideration of data quality and the determination of the data share to clean. However, rare research has focused on exploring such relationship. Motivated by this, this paper conducts an experimental comparison for the effects of missing, inconsistent and conflicting data on classification, clustering, and regression algorithms. Based on the experimental findings, we provide guidelines for algorithm selection and data cleaning.
1
1 Introduction
Data quality has become a serious issue which cannot be overlooked in both data mining and machine learning communities. We call the data with data quality problems as dirty data. Since dirty data affect the accuracy of a data mining or machine learning (e.g., classification, clustering, or regression) task, we have to know the relationship between the quality of input data set and accuracy of the results. Based on such relationship, we could select an appropriate algorithm with the consideration of data quality issues and determine the share of data to clean.
Due to the large collection of classification, clustering, and regression algorithms, it is difficult for users to decide which algorithm should be adopted. The effects of data quality on algorithms are helpful for algorithm selection. Therefore, impacts of dirty data are in urgent demand.
Before a classification, clustering, or regression task, data cleaning is necessary to guarantee data quality. Various data cleaning approaches have been proposed, e.g., data repairing with integrity constraints [1, 2], knowledgebased cleaning systems [3, 4], and crowdsourced data cleaning [3, 5], etc. These methods improve data quality dramatically, but the costs of data cleaning are still expensive [6]. If we know how dirty data affect accuracy of the results, we could clean data selectively according to the accuracy requirements instead of cleaning the entire dirty data. As a result, the data cleaning costs are reduced. Therefore, the study of the relationship between data quality and accuracy of results is in demand.
Unfortunately, rare research has been conducted to explore the specific impacts of dirty data on different algorithms. Thus, this paper aims to fill this gap. This brings following challenges.

Due to the great number of classification, clustering, and regression algorithms, the first challenge is how to choose algorithms for experiments.

Since existing measures (e.g., Precision, Recall, Fmeasure) are unable to quantify the fluctuation degrees of results, they are insufficient to evaluate the impacts of dirty data on algorithms. Thus, how to define new metrics for evaluation is the second challenge.

Since there is no wellplanned dirty data benchmark, we have to generate data sets with the consideration of error type, error rate, data size, and etc. Therefore, the third challenge is how to design data sets to test the impacts of dirty data.
In the challenge of these problems, this paper selects sixteen classical algorithms in data mining and machine learning communities. We make comprehensive analyses of possible dirtydata impacts on these algorithms. Then, we evaluate the specific effects of different types of dirty data on different algorithms. Based on the experimental results, we provide suggestions for algorithm selection and data cleaning. In the research field, dirty data are classified into a variety of types [7], such as missing data, inconsistent data, and conflicting data. Most existing researches focus on improving data quality for these three kinds of dirty data [8, 9, 10]. Thus, this paper focuses on the three main types.
In summary, our contributions in this paper are listed as follows.

We conduct an experimental comparison for the effects of missing, inconsistent, and conflicting data on classification, clustering, and regression algorithms, respectively. To the best of our knowledge, this is the first paper that studies this issue.

We introduce two novel metrics, and , to evaluate dirtydata impacts on algorithms.

Based on the evaluation results, we provide guidelines of algorithm selection and data cleaning for users. We also give suggestions for future work to researchers and practitioners.
The rest paper is organized as follows. Dimensions of data quality are reviewed in Section 2. Section 3 analyzes dirtydata impacts on six classical classification algorithms. We discuss impacts of dirty data on six clustering methods in Section 4. Section 5 introduces dirtydata impacts on four regression approaches. Our experiments are described in Section 6, and our guidelines and suggestions are presented in Section 7.
2 Dimensions of Data Quality
Data quality has many dimensions [11]. In this paper, we focus on three basic dimensions, completeness, consistency, and entity identity [7]. For these dimensions, the corresponding dirty data types are missing data, inconsistent data, and conflicting data. In this section, we introduce these three kinds of dirty data.
Missing data refer to values that are missing from databases. For example, on Table 1, the values of [Country] and [City] are missing data.
Student No.  Name  City  Country  

170302  Alice  NYC  
170302  Steven  FR  
170304  Bob  NYC  U.S.A  
170304  Bob  LA  U.S.A 
Inconsistent data are identified as violations of consistency rules which describe the semantic constraint of data. For example, a consistency rule “[Student No.] [Name]” on Table 1 means that Student No. determines Name. As the table shows, [Student No.] = [Student No.], but [Name] [Name]. Thus, the values of [Student No.], [Name], and [Name] are inconsistent.
Conflicting data refer to different values which describe an attribute of the same entity. For example, on Table 1, both and describe ’s information, but [City] and [City] are different. Thus, [City] and [City] are conflicting data.
3 Classification Algorithms
In this section, we analyze possible dirtydata impacts on six classical classification algorithms, Decision Tree, KNearest Neighbor Classifier, Naive Bayes, Bayesian Network, Logistic Regression, and Random Forests. We choose these algorithms since they are always used as competitive classification algorithms [12, 13, 14, 15].
For simplicity, we define the notations used through this paper in Table 2.
notation  definition 

the th attribute  
class/target attribute  
the th class label  
the number of attributes  
the number of class labels  
the training set of given data  
the test set of given data 
3.1 Decision Tree
The decision tree [16] splits according to an attribute set that optimizes certain criterion in the training process. To determine the best split, a measure of node impurity is needed. Popular measures include Gini index [17], information gain [18], and misclassification error [18].
Given a node , the Gini index for is defined as follows.
(1) 
where is the relative frequency of class at .
Entropy at is defined as follows.
(2) 
where is the relative frequency of class at .
Suppose a parent node is split into partitions. Based on entropy, information gain for is defined as follows.
(3) 
where is the th partition, and is the number of records in .
Misclassification error at is defined as follows.
(4) 
where is the relative frequency of class at .
In the decision tree induction, the attribute with minimum Gini index/maximum information gain/minimum misclassification error is chosen as the split node first. With the induced decision tree, the records in are classified.
When some dirty values exist in , incorrect data could affect the value of . Since determines the measure of node purity (e.g., , , ), the split attribute might be poorly chosen. Thus, the poor induced decision tree could cause an inaccurate classification result. When some values of (, ,…, ) in are dirty, these data might lead to a wrong branch of the decision tree, which results in a wrong class label.
3.2 KNearest Neighbor Classifier
Knearest neighbor classifier [19] (KNN for brief) requires three things, , distance metric (e.g., Euclidean distance [20]) to compute the distance between records, and the value of £¬ which is the number of nearest neighbors to retrieve. To classify a record in , we first compute its distance to other training records. Then, nearest neighbors are identified. Finally, we use class labels of the nearest neighbors to determine the class label of the unknown record (e.g., by taking majority vote).
Given two records and , the Euclidean distance between them is defined as follows.
(5) 
where is the th attribute of , and is the th attribute of .
When some dirty data exist in (, ,…, ) of or , the value of may change. Accordingly, the nearest neighbors list would be affected, which leads to a wrong class label. When some values of in are dirty, these incorrect data might affect the vote of class labels, which causes a wrong classification result.
3.3 Bayes Classifiers
Bayes classifier is a probabilistic framework for solving classification problems. It computes the posterior probability for all values of in according to the Bayes theorem, which is extended as follows.
(6) 
The goal of Bayes classifiers is to choose the value of that maximizes , which is equivalent to maximize . Naive Bayes and Bayesian network are classical Bayes classifiers.
Naive Bayes
Naive Bayes classifier [21] assumes the independence among attributes when the class is given, i.e. = . Since for all and in can be estimated, records in are classified to to maximize .
When some dirty values exist in (, ,…, ) of or , incorrect data could affect the value of , which leads to the wrong value of . Since the maximal determines the final , classification result would be impacted. When some values of in are dirty, the values of both and may change. Accordingly, could be affected, which causes an incorrect class label.
Bayesian Network
Bayesian network [22] is a directed acyclic diagram (DAG for brief) based on conditional probability tables (CPT for brief). In DAG, a node is conditional independent of its nondescendants, if its parents are known.
When Bayesian network structure is fixed, we estimate conditional probabilities based on and learn CPT. When Bayesian network structure is unknown, we first estimate Bayesian network using minimum description length [23], and then learn Bayesian network and CPT. Based on the learned model, we make Bayeaian network inference and compute the maximal posterior probability to determine class labels of .
Given , the description length of Bayesian network is defined as follows.
(7) 
where is the number of parameters of , is the number of bits for describing a parameter, and is the th instance in .
When some dirty values exist in , dirty data may affect the value of . Accordingly, the learned Bayesian network and CPT would be incorrect, which leads to inaccurate inference based on the network. Since probabilities are computed with inference, the maximal posterior probability could be impacted, which results in a wrong class label. When some values of (, ,…, ) in are dirty, wrong values might affect the estimation of the maximal posterior probability, which leads to an incorrect classification result.
3.4 Logistic Regression
Logistic regression [24] is a binary classifier. In order to establish a regression function (Sigmoid function [25]) of classification boundary line, we use optimization methods (e.g., gradient ascent method [26]) to determine the best regression coefficient of the function based on . Once the regression function is constructed, we use it to classify records in .
Given input data , ,…, , the input of Sigmoid function is computed as follows.
(8) 
The Sigmoid function is defined as follows.
(9) 
When some dirty values exist in (, ,…, ) of or , wrong values could affect the value of . Accordingly, the learned Sigmoid function would change, which leads to inaccurate class labels of . When some values of in are dirty, incorrect class labels may mislead the establishment of Sigmoid function. Based on the imprecise function, the classification of might be affected.
3.5 Random Forests
In the training process, the random forests algorithm [27] constructs a set of base classifiers of decision trees. In the testing process, class labels of are predicted by aggregating predictions made by multiple classifiers.
When some dirty values exist in , incorrect data might affect splitting attribute selection in decision trees, which causes inaccurate decision tree induction and wrong predictions of . When some values of (, ,…, ) in are dirty, these data might mislead class labels of corresponding records, which causes incorrect classification.
4 Clustering Algorithms
In this section, we discuss possible dirtydata impacts on six classical clustering algorithms, KMeans, LVQ, CLARANS, DBSCAN, BIRCH, and CURE. We choose these algorithms since they are always used as competitive clustering algorithms [28, 29, 30, 31].
4.1 PrototypeBased Clustering
Prototypebased clustering assumes that clustering structure is portrayed by a group of prototypes. This kind of algorithms initialize prototypes at first, and then update them in the iterative process. KMeans [32], learning vector quantization [33] (LVQ for brief), and CLARANS [34] are three classical clustering methods.
KMeans
KMeans clustering approach selects points as the initial centroids firstly. Then, we form clusters by assigning all points to the closest centroid, and recompute the centroid of each cluster. The iterative process ends until the centroids do not chagne.
When some dirty values exist, incorrect data might affect computation of centroids. Accordingly, some points would be clustered to wrong class labels.
Lvq
LVQ clustering method assumes that there are class labels in data samples, and these marked labels can assist clustering. Given sample set = {(, ), (, ),…,(, )}, each (1 ) is a feature vector (; ;…;), which is expressed with attributes. is the class label of . The goal of LVQ is to learn a group of dimensional prototype vector {, ,…, }, each vector denotes a cluster, and each cluster label .
First, LVQ algorithm initializes prototype vectors. Then, vectors are optimized in the iterative process. In each iteration, algorithm selects a marked training sample randomly, and finds a prototype vector with the shortest distance from the selected sample. If their labels are not the same, the prototype vector is updated.
When some values in ( = 1, 2,…, m) of are dirty, wrong values could mislead the label updating of prototype vector {, ,…, }. When some dirty values exist in ( = 1, 2,…, m) of , incorrect class labels would directly affect the labels of {, ,…, }. When some values in ( = 1, 2,…, q) of {, ,…, } are dirty, wrong values might impact the distance computing, which leads to an incorrect class label.
Clarans
CLARANS algorithm selects points as centroids at first. Then, we randomly choose a centroid as the current point, and a neighbor point of . We compute the cost difference between and . If the cost of is less, we set it as the current point, and select a neighbor of . If not, we find another neighbor of . The iteration ends until the number of sampling is achieved.
When some dirty values exist, incorrect data might affect computation of the cost difference. Accordingly, some points could be clustered to wrong class labels.
4.2 DensityBased Clustering
Densitybased clustering locates regions of high density that are separated from one another by regions of low density. DBSCAN [35] is a basic densitybased clustering algorithm. All noise points are discarded in DBSCAN, and we perform clustering on the remaining points. At first, we put an edge between all core points that are within (a specified radius) of each other. Then, each connected component is taken as a separate cluster, and each border point are assigned to one of the clusters of its associated core points.
When some values are dirty, wrong values would impact computation of density and point distances since density is the number of points within of a point. Accordingly, some points might be assigned to incorrect clusters.
4.3 Hierarchical Clustering
Hierarchical clustering produces a set of nested clusters organized as a hierarchical tree. BIRCH [36] and CURE [37] are classical hierarchical algorithms.
Birch
BIRCH algorithm introduces a clustering feature tree (CF tree for brief) to summarize the inherent clustering structure of data. Firstly, we scan the given data to build an initial inmemory CF tree. Then, we use an arbitrary clustering algorithm (e.g., an agglomerative hierarchical clustering algorithm) to cluster the leaf nodes of the CF tree. Finally, we scan the data again and assign the data points using the cluster centroids found in the previous step as seeds.
When some dirty data exist, incorrect values could affect construction of CF tree and computation of cluster centroids, which makes data points assigned to wrong clusters.
Cure
Instead of representing clusters by their centroids, CURE algorithm uses a collection of representative points. There are three steps in this method. The first step is initialization. At first, we take a small sample of the given data and cluster it in main memory using a hierarchical method in which clusters are merged when they have a close pair of points (e.g., MIN clustering method). Then, we select a small set of points from each cluster to be representative points. These points should be chosen to be as far from one another as possible. Lastly, we move each of the representative points a fixed fraction of the distance between its location and the centroid of its cluster.
The second step is merging clusters. We merge two clusters if they have a pair of representative points, one from each cluster, that are sufficiently close.
The third step is point assignment. Each point is brought from secondary storage and compared with the representative points. We assign to the cluster of the representative point that is closest to .
When some values in the given data are dirty, wrong values would effect the location of representative points and the computation of distance between representative point and the centroid of its cluster. Accordingly, data points might be clustered into incorrect clusters.
5 Regression Algorithms
In this section, we analyze possible dirtydata impacts on four classical regression algorithms, Least Square Linear Regression, Maximum Likelihood Linear Regression, Polynomial Regression, and Stepwise Regression. We choose these algorithms since they are always used as competitive regression algorithms [38, 39, 40, 41].
5.1 Linear Regression
Given data set = {(, ), (, ),…,(, )}, each = (; ;…;), (1 ). Linear regression is to learn a linear model = + , s.t. . With the model, we can predict realvalue labels as accurately as possible. Least square method [42] and maximum likelihood method [43] are classical approaches to solve linear models.
Least Square Method
Least square linear regression minimizes meansquare error to solve linear models. We compute the values of parameters and as follows.
(10) 
After parameters and are determined, a linear model is learned. Then, we predict realvalue labels in with the model.
When some dirty values exist in , incorrect values could affect the values of and . Accordingly, the linear model would be effected, which leads to inaccurate prediction of in . When some data in (, ,…, ) of are dirty, the computation of might be affected.
Maximum Likelihood Method
Maximum likelihood linear regression computes parameters with likelihood function, which is defined as follows.
(11) 
The goal of maximum likelihood method is to compute and with . After parameters and are determined, a linear model is learned. Then, we predict realvalue labels in with the model.
When some data in are dirty, incorrect values could affect the computation of and . Accordingly, the linear model would be impacted, which leads to inaccurate prediction of in . When some dirty values exist in (, ,…, ) of , the prediction of might be affected.
5.2 Polynomial Regression
Polynomial regression [44] constructs a polynomial of linear combination that converts a feature to higherorder features. For instance, = + can be transformed as follows.
(12) 
According to , the values of all parameters can be computed. After the polynomial regression model is determined, we predict realvalue labels in with the model.
When some dirty values exist in , wrong values could affect the computation of parameters. Accordingly, the polynomial model would be impacted, which leads to inaccurate prediction of in . When some data in (, ,…, ) of are dirty, the prediction of might be affected.
5.3 Stepwise Regression
Stepwise regression [45] introduces independent variables one by one. The partial regression square sum of each introduced variable is significant. After each new variable is added, old variables in the regression model need to be tested one by one. Variables which are tested to be insignificant are deleted. When no more new variables can be introduced, the iteration stops.
When some data in are dirty, incorrect values could affect the computation of partial regression square sum. Accordingly, the test and selection of independent variables would be impacted, which leads to a biased regression model and inaccurate prediction of in . When some dirty values exist in (, ,…, ) of , the computation of might be affected.
6 Experimental Study
We evaluated dirtydata impacts on sixteen classical algorithms discussed in Section 3, 4, and 5. All datasets and source codes are available
6.1 Experimental Setting
Datasets We selected 13 typical data sets from UCI public datasets
Name  Number of  Number of  Algorithm 

Attributes  Records  
Classification  
Iris  4  150  Clustering 
Regression  
Ecoli  8  336  Classification 
Car  6  1728  Classification 
Chess  36  3196  Classification 
Adult  14  48842  Classification 
Seeds  7  210  Clustering 
Abalone  8  4177  Clustering 
HTRU  9  17898  Clustering 
Activity  3  67651  Clustering 
Servo  4  167  Regression 
Housing  14  506  Regression 
Concrete  9  1030  Regression 
Solar Flare  10  1389  Regression 
Setup All experiments were conducted on a machine powered by two Intel(R) Xeon(R) E52609 v3@1.90GHz CPUs and 32GB memory, under CentOS7. All the algorithms were implemented in C++ and compiled with g++ 4.8.5.
Metrics First, we used standard Precision, Recall, and Fmeasure to evaluate the effectiveness of classification and clustering algorithms. These measures were computed as follows.
(13) 
where is #of records which are correctly classified to class , and is #of records which are classified to class .
(14) 
where is #of records which are correctly classified to class , and is #of records of class .
(15) 
Also, we used RMSD (Rootmeansquare Deviation), NRMSD (Normalized Rootmeansquare Deviation), and CV(RMSD) (Coefficient of Variation of the RMSD) to evaluate effectiveness of regression algorithms. These measures were computed as follows.
(16) 
where is #of predicted records, is the predicted value of the th record.
(17) 
where is the maximal predicted value, is the minimal predicted value.
(18) 
where is the mean of predicted values.
However, these metrics only showed us the variations of accuracy. They were not possible to measure the fluctuation degrees quantitatively. Therefore, novel metrics were introduced to evaluate dirtydata impacts on algorithms. We defined the first metric as follows.
Definition 1
Given the values of a measure of an algorithm with a%, (a+x)%, (a+2x)%,…, (a+bx)% (a0, x0, b0) error rate. of an algorithm on dirty data is computed as  +  + … +  .
is defined to measure the sensibility of an algorithm to the data quality. The larger the value of is, the more sensitive an algorithm is to the data quality. Therefore, shows the fluctuation degrees of algorithms quantitatively. Here, we explain the computation of with Figure 1 as an example.
Example 1
Since the values of () of the decision tree algorithm with 0%, 10%,…, 50% missing rate are given, is computed as  +  +…+  . Thus, in Iris dataset, is 78.37%84.16% + 84.16%78.08% + 78.08%74.36% + 74.36%64.99% + 64.99%58.71% = 31.24%. In Ecoli dataset, is 63.47%62.93% + 62.93%53.97% + 53.97%50.93% + 50.93%48.07% + 48.07%34.5% = 28.97%. In Car dataset, is 81.33%60.93% + 60.93%43.7% + 43.7%42.87% + 42.87%40.47% + 40.47%35.47% = 45.86%. In Chess dataset, is 82.17%78.17% + 78.17%76.53% + 76.53%75.77% + 75.77%75.9% + 75.9%75.57% = 6.86%. And in Adult dataset, is 80.5%75.27% + 75.27%71.3% + 71.3%72.93% + 72.93%71.53% + 71.53%67.23% = 16.53%. Thus, the average of is 25.89%.
Though tells the fluctuation degrees of algorithms, we could not determine the error rate at which an algorithm is unacceptable. Motivated by this, we defined the second novel metric as follows.
Definition 2
Given the values of a measure of an algorithm with a%, (a+x)%, (a+2x)%,…, (a+bx)% (a0, x0, b0) error rate, and a number ( 0). If the larger value of causes the better accuracy, and  (0 i b), we take min{(a+(i1)x)%} as the . If  , we take min{(a+bx)%} as the . If the smaller value of causes the better accuracy, and  (0 i b), we take min{(a+(i1)x)%} as . If  , we take min{(a+bx)%} as the .
is defined to measure the dirtydata tolerability of an algorithm. The larger the value of is, the higher errortolerability of an algorithm is. Therefore, is useful to show the error rate at which an algorithm is acceptable. Here, we take Figure 1 as an example to explain of an algorithm.
Example 2
We know the values of of the decision tree algorithm with 0%, 10%,…, 50% missing rate, and set 10% as the value of . In Iris dataset, when the missing rate is 40%,  = 78.37%64.99% = 13.38% 10%, we take 30% as the . In Ecoli dataset, when the missing rate is 30%,  = 63.47%50.93% = 12.54% 10%, we take 20% as the . In Car dataset, when the missing rate is 10%,  = 81.33%60.93% = 20.4% 10%, we take 0% as the . In Chess dataset, when the missing rate is 50%,  = 82.17%75.57% = 6.6% 10%, we take 50% as the . In Adult dataset, when the missing rate is 50%,  = 80.5%67.23% = 13.27% 10%, we take 40% as the . Thus, the average of is 28%.
In addition, we used running time to evaluate the efficiency of all algorithms. We ran each test 5 times and reported logarithms of average time.
6.2 Evaluation on Classification Algorithms
Since various kinds of dirty data could affect the performance of classification algorithms, we varied error rates, including missing rate, inconsistent rate, and conflicting rate to evaluate the classification methods in Section 3.
Missing  Inconsistent  Conflicting  
Algorithm  P  R  F  P  R  F  P  R  F 
Decision Tree  25.89  31.11  26.64  35.41  40.94  38.33  16.09  21.56  16.45 
KNN  18.09  13.18  17.45  21.84  19.21  20.93  11.39  6.70  9.32 
Naive Bayes  27.04  23.37  26.40  29.48  37.18  35.49  15.10  21.85  20.33 
Bayesian Network  46.40  34.04  35.37  33.29  21.53  23.15  17.26  15.18  16.01 
Logistic Regression  38.26  18.73  30.69  37.84  28.10  38.83  31.74  18.51  25.60 
Random Forests  25.77  24.57  29.39  39.21  34.86  40.74  27.93  15.85  27.53 
KMeans  31.06  27.80  32.08  31.83  32.21  35.63  23.79  21.86  25.17 
LVQ  11.94  21.14  19.61  20.55  18.83  21.41  9.20  19.57  20.13 
CLARANS  34.26  40.16  39.48  31.11  29.45  31.56  20.67  22.64  24.04 
DBSCAN  15.89  22.88  17.16  20.40  10.39  12.34  18.64  9.55  16.10 
BIRCH  32.58  44.56  32.90  24.32  22.48  19.40  15.16  22.44  16.52 
CURE  38.68  32.71  39.23  28.81  32.90  32.67  32.74  29.11  32.62 
Classification  Varying Missing Rate
To evaluate the impacts of missing data on classification algorithms, we deleted values from original datasets randomly varying missing rate from 10% to 50%. We used 10fold cross validation, and generated training data and testing data randomly. In the testing process, we imputed numerical missing values with the average values and captured categorical ones with the maximum values. Experimental results were depicted in Figure 1, 2, 3, 4, 5, and 6.
Based on the results, we had the following observations. First, for wellperformed algorithms whose Precision/Recall/Fmeasure is larger than 80% on original datasets, as the data size increases, Precision, Recall, or Fmeasure of algorithms becomes stable, except Logistic Regression. The reason is that the amount of clean data is larger for larger data size. Accordingly, the impacts of missing data on algorithms reduce. However, Logistic regression establishes a regression function as the model. For regression functions, the parameter computation is more sensitive to missing data. Thus, when data size rises, the amount of missing data becomes larger, which has larger impacts on Logistic Regression.
Second, as shown in Table 4, for Precision, the order is “Bayesian Network Logistic Regression Naive Bayes Decision Tree Random Forests KNN”. For Recall, the order is “Bayesian Network Decision Tree Random Forests Naive Bayes Logistic Regression KNN”. For Fmeasure, the order is “Bayesian Network Logistic Regression Random Forests Decision Tree Naive Bayes KNN”. Thus, the least sensitive algorithm is KNN. This is because that as missing rate rises, the increasing missing values may not affect nearest neighbors. Even if nearest neighbors are affected, they are not necessarily voted for the final class label. In addition, the most sensitive algorithm is Bayesian Network. The reason is that the increasing missing data could affect the computation of posterior probabilities, which would directly impact classification results.
Third, as shown in Table 5, for Precision, the order is “Decision Tree Naive Bayes Random Forests KNN Logistic Regression Bayesian Network”. For Recall, the order is “Random Forests KNN Naive Bayes Logistic Regression Decision Tree Bayesian Network”. For Fmeasure, the order is “Decision Tree Naive Bayes Bayesian Network KNN Logistic Regression Random Forests”. Therefore, for Precision and Fmeasure, the most incompletenesstolerant algorithm is Decision Tree. This is because that decision tree models only use splitting features for classification. As the missing rate rises, the increasing missing data may not affect splitting features. For Recall, the most incompletenesstolerant algorithm is Random Forests. This is because the increasing missing values may not affect splitting attributes. Even if impacted, there is little chance to cause inaccurate classification since the final result is made by multiple base classifiers. For Precision and Recall, the least incompletenesstolerant algorithm is Bayesian Network. This is because the increasing missing data would change the posterior probabilities, which could affect classification results directly. For Fmeasure, the least incompletenesstolerant algorithm is Random Forests. The reason is that Fmeasure on original datasets (error rate is 0%) is high. When few missing values exist in the datasets, Fmeasure drops a lot.
Fourth, as data size increases, the running time of algorithms fluctuates more. This is because that as data size rises, the amount of missing data becomes larger, which introduces more uncertainty to algorithms. Accordingly, the uncertainty of running time increases.
Missing  Inconsistent  Conflicting  
Algorithm  P  R  F  P  R  F  P  R  F 
Decision  28  26  28  18  16  16  50  50  50 
Tree  
KNN  24  32  20  22  22  22  40  50  40 
Naive  26  28  24  22  12  12  50  40  40 
Bayes  
Bayesian  20  26  24  16  26  26  46  50  50 
Network  
Logistic  22  28  16  16  14  16  32  34  32 
Regression  
Random  26  50  10  26  14  8  42  38  34 
Forests  
KMeans  38  32  32  28  22  22  44  38  38 
LVQ  44  40  48  28  14  20  44  44  40 
CLARANS  2  2  0  22  18  18  34  34  28 
DBSCAN  30  40  30  32  44  34  36  50  36 
BIRCH  24  20  24  20  26  26  50  34  38 
CURE  18  18  16  20  18  16  32  34  24 
Missing  Inconsistent  Conflicting  

Algorithm  RMSD  NRMSD  CV  RMSD  NRMSD  CV  RMSD  NRMSD  CV 
(RMSD)  (RMSD)  (RMSD)  
Least Square  0.662  0.066  0.192  1.204  0.090  0.278  1.056  0.054  0.284 
Maximum Likelihood  1.356  0.034  4.466  2.384  0.046  1.546  1.534  0.060  3.914 
Polynomial Regression  1.568  0.106  0.200  2.010  0.174  0.464  1.794  0.116  0.426 
Stepwise Regression  1.338  0.104  3.466  1.616  0.102  5.996  0.890  0.074  2.748 
Missing  Inconsistent  Conflicting  

Algorithm  RMSD  NRMSD  CV  RMSD  NRMSD  CV  RMSD  NRMSD  CV 
(RMSD)  (RMSD)  (RMSD)  
Least Square  30  50  50  16  50  50  32  50  42 
Maximum Likelihood  22  50  40  16  50  34  20  50  40 
Polynomial Regression  14  40  20  12  50  22  12  40  26 
Stepwise Regression  16  40  24  10  42  22  40  50  30 
Classification  Varying Inconsistent Rate
To evaluate the impacts of inconsistency on classification algorithms, we injected inconsistent values to original datasets randomly according to consistency rules on the given data. The inconsistent rate was varied from 10% to 50%. We used 10fold cross validation, and generated training data and testing data randomly. Experimental results were depicted in Figure 7, 8, 9, 10, 11, and 12.
Based on the results, we had the following observations. First, as shown in Table 4, for Precision, the order is “Random Forests Logistic Regression Decision Tree Bayesian Network Naive Bayes KNN”. For Recall, the order is “Decision Tree Naive Bayes Random Forests Logistic Regression Bayesian Network KNN”. For Fmeasure, the order is “Random Forests Logistic Regression Decision Tree Naive Bayes Bayesian Network KNN”. Thus, the least sensitive algorithm is KNN. The reason is similar as that of the least sensitive algorithm varying missing rate. For Precision and Fmeasure, the most sensitive algorithm is Random Forests. And for Recall, the most sensitive algorithm is Decision Tree. These are due to the fact that as the inconsistent rate increases, more and more incorrect values cover the correct ones in decision tree training models, which leads to inaccurate classification results. Since base classifiers in Random Forests are decision trees, the reason for Random Forests is the same as that for Decision Tree.
Second, as shown in Table 5, for Precision, the order is “Random Forests KNN Naive Bayes Decision Tree Bayesian Network Logistic Regression”. For Recall, the order is “Bayesian Network KNN Decision Tree Logistic Regression Random Forests Naive Bayes”. For Fmeasure, the order is “Bayesian Network KNN Decision Tree Logistic Regression Naive Bayes Random Forests”. Therefore, for Precision, the most inconsistencytolerant algorithm is Random Forests. The reason has been discussed in Section 6.2.1. For Recall and Fmeasure, the most inconsistencytolerant algorithm is Bayesian Network. This is because inconsistent values contain incorrect ones and correct ones. Hence, incorrect values have little effect on the computation of posterior probabilities. Accordingly, classification results may not be affected. For Precision, the least inconsistencytolerant algorithms are Bayesian Network and Logistic Regression. For Recall, the least inconsistencytolerant algorithm is Naive Bayes. And for Fmeasure, the least inconsistencytolerant algorithm is Random Forests. These are because that Precision/Recall/Fmeasure of these algorithms on original datasets (error rate is 0%) is high. When few inconsistent values are injected, Precision/Recall/Fmeasure drops dramatically.
Third, the observation of running time on experiments varying missing rate was still true when the inconsistent rate was varied.
Classification  Varying Conflicting Rate
To evaluate the impacts of conflicting data on classification algorithms, we injected conflicting values to original datasets randomly varying conflicting rate from 10% to 50%. We used 10fold cross validation, and generated training and testing data randomly. Experimental results were depicted in Figure 13, 14, 15, 16, 17, and 18.
First, the observation of the relationship between the data size and algorithm stability on experiments varying missing rate was still true when the conflicting rate was varied.
Second, as shown in Table 4, for Precision, the order is “Logistic Regression Random Forests Bayesian Network Decision Tree Naive Bayes KNN”. For Recall, the order is “Naive Bayes Decision Tree Logistic Regression Random Forests Bayesian Network KNN”. For Fmeasure, the order is “Random Forests Logistic Regression Naive Bayes Decision Tree Bayesian Network KNN”. Thus, the least sensitive algorithm is KNN. The reason is similar as that of the least sensitive algorithm varying missing rate. For Precision, the most sensitive algorithm is Logistic Regression. This is because parameter computation of the regression function is easily affected by the increasing conflicting values, which causes an inaccurate logistic regression model. For Recall, the most sensitive algorithm is Naive Bayes. This is because that incorrect values in the increasing conflicting values affect the computation of posterior probabilities in Bayes theorem. For Fmeasure, the most sensitive algorithm is Random Forests. The reason is the same as that of the most sensitive algorithm varying inconsistent rate.
Third, as shown in Table 5, for Precision, the order is “Decision Tree Naive Bayes Bayesian Network Random Forests KNN Logistic Regression”. For Recall, the order is “Decision Tree KNN Bayesian Network Naive Bayes Random Forests Logistic Regression”. For Fmeasure, the order is “Decision Tree Bayesian Network KNN Naive Bayes Random Forests Logistic Regression”. Therefore, the most conflicttolerant algorithm is Decision Tree. The reason is similar as that of the most incompletenesstolerant algorithm. The least conflicttolerant algorithm is Logistic Regression. This is due to the fact that conflicting data have much effect on parameter computation of logistic regression models.
Fourth, the observation of running time on experiments varying missing rate was still true when the conflicting rate was varied.
Discussion
In classification experiments, we first found that dirtydata impacts are related to error type and error rate. Thus, the error rate of each error type in the given data is necessary to be detected. Second, we observed that for algorithms whose Precision/Recall/Fmeasure is larger than 80% on original datasets, Precision, Recall, or Fmeasure of algorithms become stable as data size rises, except Logistic Regression. Since the parameter in was set as 10%, candidate algorithms of which Precision/Recall/Fmeasure is larger than 70% are acceptable. Third, we prefer to choose stable algorithms. Hence, Logistic Regression is suitable for smaller data size. Fourth, we compared the fluctuation degrees of classification algorithms in our experiments. When dirty data exist, the algorithm with the least degree is the most stable. Fifth, beyond , the accuracy of the selected algorithm becomes unacceptable. Thus, the error rate of each type needs to be controlled within its .
6.3 Evaluation on Clustering Algorithms
Since various types of dirty data could affect the performance of clustering algorithms, we varied error rates, involving missing rate, inconsistent rate, and conflicting rate to evaluate clustering approaches in Section 4.
Clustering  Varying Missing Rate
To evaluate missingdata impacts on clustering algorithms, we deleted values from original datasets randomly varying missing rate from 10% to 50%. In clustering process, we imputed numerical missing values with the average values, and captured categorical ones with the maximum values. Experimental results were depicted in Figure 19, 20, 21, 22, 23, and 24.
Based on the results, we had the following observations. First, as shown in Table 4, for Precision, the order is “CURE CLARANS BIRCH KMeans DBSCAN LVQ”. For Recall, the order is “BIRCH CLARANS CURE KMeans DBSCAN LVQ”. For Fmeasure, the order is “CLARANS CURE BIRCH KMeans LVQ DBSCAN”. Thus, for Precision and Recall, the least sensitive algorithm is LVQ. This is because that LVQ is a supervised clustering algorithm on the basis of marked labels. Hence, there is little chance to be affected by missing values. For Fmeasure, the least sensitive algorithm is DBSCAN. This is due to the fact that DBSCAN eliminates all noise points at the beginning of the algorithm, which makes it more resistant to missing values. For Precision, the most sensitive algorithm is CURE. This is because that the location of representative points in CURE is easily effected by missing values, which causes inaccurate clustering results. For Recall, the most sensitive algorithm is BIRCH. This is due to the fact that missing data could impact the construction of CF tree in BIRCH, which directly leads to wrong clustering results. For Fmeasure, the most sensitive algorithm is CLARANS. This is because that the computation of cost difference in CLARANS is susceptible to missing values, which makes some points clustered incorrectly.
Second, as shown in Table 5, for Precision, the order is “LVQ KMeans DBSCAN BIRCH CURE CLARANS”. For Recall, the order is “LVQ DBSCAN KMeans BIRCH CURE CLARANS”. For Fmeasure, the order is “LVQ KMeans DBSCAN BIRCH CURE CLARANS”. Therefore, the most incompletenesstolerant algorithm is LVQ. This is because that LVQ is a supervised clustering algorithm based on marked labels. Hence, there is little chance for it to be affected by missing values. The least incompletenesstolerant algorithm is CLARANS. This is due to the fact that the computation of cost difference in CLARANS is susceptible to missing data, which causes inaccurate clustering results.
Third, as data size increases, running time of algorithms fluctuates more. This is because that as data size rises, the amount of missing data becomes larger, which introduces more uncertainty to algorithms. Accordingly, the uncertainty of running time increases.
Clustering  Varying Inconsistent Rate
To evaluate inconsistentdata impacts on clustering algorithms, we injected inconsistent values to original datasets randomly according to consistency rules on the given data. Inconsistent rate was varied from 10% to 50%. Experimental results were depicted in Figure 25, 26, 27, 28, 29, and 30.
Based on the results, we had the following observations. First, for wellperformed algorithms (Precision/Recall/Fmeasure is larger than 80% on original datasets), as data size increases, Precision, Recall, or Fmeasure of algorithms fluctuates more widely, except DBSCAN. This is because that the amount of inconsistent values becomes larger as data size rises. The increasing incorrect data have more effect on clustering process. However, DBSCAN discards noise points at the beginning of the algorithm. When data size rises, the number of clean data becomes larger. Accordingly, the proportion of eliminated points reduces, which has less impact on DBSCAN.
Second, as shown in Table 4, for Precision, the order is “KMeans CLARANS CURE BIRCH LVQ DBSCAN”. For Recall, the order is “CURE KMeans CLARANS BIRCH LVQ DBSCAN”. For Fmeasure, the order is “KMeans CURE CLARANS LVQ BIRCH DBSCAN”. Thus, the least sensitive algorithm is DBSCAN. The reason is similar as that of the least sensitive algorithm varying missing rate. For Precision and Fmeasure, the most sensitive algorithm is KMeans. This is due to the fact that the computation of centroids are susceptible to incorrect values, which causes wrong clustering results. For Recall, the most sensitive algorithm is CURE. The reason is similar as that of the most sensitive algorithm varying missing rate.
Third, as shown in Table 5, for Precision, the order is “DBSCAN KMeans LVQ CLARANS BIRCH CURE”. For Recall, the order is “DBSCAN BIRCH KMeans CLARANS CURE LVQ”. For Fmeasure, the order is “DBSCAN BIRCH KMeans LVQ CLARANS CURE”. Therefore, the most inconsistencytolerant algorithm is DBSCAN. This is because that DBSCAN eliminates all noise points at the beginning of the algorithm, which makes it more resistant to inconsistent data. For Precision, the least inconsistencytolerant algorithms are BIRCH and CURE. For Recall, the least inconsistencytolerant algorithm is LVQ. For Fmeasure, the least inconsistencytolerant algorithm is CURE. These are due to the fact that the distance computation of these algorithms is susceptible to incorrect values, which causes inaccurate clustering results.
Fourth, the observation of running time on experiments varying missing rate was still true when the inconsistent rate was varied.
Clustering  Varying Conflicting Rate
To evaluate impacts of conflicting data on clustering algorithms, we injected conflicting values to original datasets randomly varying conflicting rate from 10% to 50%. Experimental results were depicted in Figure 31, 32, 33, 34, 35, and 36.
Based on the results, we had the following observations. First, as shown in Table 4, for Precision, the order is “CURE KMeans CLARANS DBSCAN BIRCH LVQ”. For Recall, the order is “CURE CLARANS BIRCH KMeans LVQ DBSCAN”. For Fmeasure, the order is “CURE KMeans CLARANS LVQ BIRCH DBSCAN”. Thus, for Precision, the least sensitive algorithm is LVQ. The reason is similar as that of the least sensitive algorithm varying missing rate. For Recall and Fmeasure, the least sensitive algorithm is DBSCAN. The reason has been discussed in Section 6.3.1. The most sensitive algorithm is CURE. The reason is similar as that of the most sensitive algorithm varying missing rate.
Second, as shown in Table 5, for Precision, the order is “BIRCH KMeans LVQ DBSCAN CLARANS CURE”. For Recall, the order is “DBSCAN LVQ KMeans CLARANS BIRCH CURE”. For Fmeasure, the order is “LVQ KMeans BIRCH DBSCAN CLARANS CURE”. Therefore, for Precision, the most conflicttolerant algorithm is BIRCH. This is because that conflicting data contain correct ones and incorrect ones, which makes the construction of CF tree insusceptible to incorrect values. For Recall, the most conflicttolerant algorithm is DBSCAN. The reason is similar as that of the most inconsistencytolerant algorithm varying inconsistent rate. For Fmeasure, the most conflicttolerant algorithm is LVQ. The reason is similar as that of the most incompletenesstolerant algorithm varying missing rate. The least conflicttolerant algorithm is CURE. This is due to the fact that the location of representative points in CURE could be easily affected by conflicting values, which makes data points clustered inaccurately.
Third, the observation of running time on experiments varying missing rate was still true when the conflicting rate was varied.
Discussion
In clustering experiments, we first found that dirtydata impacts are related to error type and error rate. Thus, the error rate of each error type in the given data is necessary to be detected. Second, we observed that for algorithms whose Precision/Recall/Fmeasure is larger than 80% on original datasets, Precision, Recall, or Fmeasure of algorithms becomes unstable as data size rises, except DBSCAN. Since the parameter in was set as 10%, candidate algorithms of which Precision/Recall/Fmeasure is larger than 70% are acceptable. Third, we prefer to choose stable algorithms. Hence, DBSCAN is suitable for larger data size. Fourth, we compared the fluctuation degrees of clustering algorithms in our experiments. When dirty data exist, the algorithm with the least degree is the most stable. Fifth, beyond , the accuracy of the selected algorithm becomes unacceptable. Thus, the error rate of each type needs to be controlled within its .
6.4 Evaluation on Regression Algorithms
Since various kinds of dirty data could affect performance of regression algorithms, we varied error rates, including missing rate, inconsistent rate, and conflicting rate to evaluate different types of regression methods in Section 5.
Regression  Varying Missing Rate
To evaluate missingdata impacts on regression algorithms, we deleted values from original datasets randomly varying missing rate from 10% to 50%. We used 10fold cross validation and generated training data and testing data randomly. In testing process, we imputed numerical missing values with the average values, and captured categorical ones with the maximum values. Experimental results were depicted in Figure 37, 38, 39, and 40.
Based on the results, we had the following observations. First, as shown in Table 6, for RMSD, the order is “Polynomial Regression Maximum Likelihood Stepwise Regression Least Square”. For NRMSD, the order is “Polynomial Regression Stepwise Regression Least Square Maximum Likelihood”. For CV(RMSD), the order is “Maximum Likelihood Stepwise Regression Polynomial Regression Least Square”. Thus, for RMSD and CV(RMSD), the least sensitive algorithm is Least Square. And for NRMSD, the least sensitive algorithm is Maximum Likelihood. These are due to the fact that the number of parameters in linear regression model is small. Hence, there is little chance for the model training to be affected by missing values. For RMSD and NRMSD, the most sensitive algorithm is Polynomial Regression. And for CV(RMSD), the most sensitive algorithm is Maximum Likelihood. These are because that these algorithms perform badly on some original datasets (error rate is 0%). When missing data are injected, the uncertainty of data becomes more, which leads to increasing uncertainty to algorithms. Accordingly, the performances of algorithms become worse.
Second, as shown in Table 7, for RMSD, the order is “Least Square Maximum Likelihood Stepwise Regression Polynomial Regression”. For NRMSD, the order is “Least Square Maximum Likelihood Stepwise Regression Polynomial Regression”. For CV(RMSD), the order is “Least Square Maximum Likelihood Stepwise Regression Polynomial Regression”. Therefore, the most incompletenesstolerant algorithm is Least Square. This is because that the amount of parameters in least square linear regression model is small. Hence, there is little chance to be effected. The least incompletenesstolerant algorithm is Polynomial Regression. This is due to the fact that there are many parameters in polynomial regression model, which makes it susceptible to missing data.
Third, as the running time of an algorithm on original datasets (error rate is 0%) rises, the running time of algorithms fluctuates more. This is because that as the running time rises, the uncertainty of algorithms becomes more. Accordingly, the uncertainty of running time increases.
Regression  Varying Inconsistent Rate
To evaluate inconsistentdata impacts on regression algorithms, we injected inconsistent values to original datasets randomly according to consistency rules on the given data. Inconsistent rate was varied from 10% to 50%. We used 10fold cross validation and generated training data and testing data randomly. Experimental results were depicted in Figure 41, 42, 43, and 44.
Based on the results, we had the following observations. First, as shown in Table 6, for RMSD, the order is “Maximum Likelihood Polynomial Regression Stepwise Regression Least Square”. For NRMSD, the order is “Polynomial Regression Stepwise Regression Least Square Maximum Likelihood”. For CV(RMSD), the order is “Stepwise Regression Maximum Likelihood Polynomial Regression Least Square”. Thus, for RMSD and CV(RMSD), the least sensitive algorithm is Least Square. And for NRMSD, the least sensitive algorithm is Maximum Likelihood. The reason is similar as that of the least sensitive algorithm varying missing rate. For RMSD, the most sensitive algorithm is Maximum Likelihood. For NRMSD, the most sensitive algorithm is Polynomial Regression. And for CV(RMSD), the most sensitive algorithm is Stepwise Regression. These are due to their poor performances on some original datasets (error rate is 0%). When inconsistent data are injected, the uncertainty of data becomes more, which leads to increasing uncertainty to algorithms. Accordingly, algorithms perform worse.
Second, as shown in Table 7, for RMSD, the order is “Least Square Maximum Likelihood Polynomial Regression Stepwise Regression”. For NRMSD, the order is “Least Square Maximum Likelihood Polynomial Regression Stepwise Regression”. For CV(RMSD), the order is “Least Square Maximum Likelihood Polynomial Regression Stepwise Regression”. Therefore, the most inconsistencytolerant algorithm is Least Square. The reason is similar as that of the most incompletenesstolerant algorithm varying missing rate. The least inconsistencytolerant algorithm is Stepwise Regression. This is due to the fact that there are many independent variables to be tested in stepwise regression model, which makes it easily affected by inconsistent values.
Third, the observation of running time on experiments varying missing rate was still true when the inconsistent rate was varied.
Regression  Varying Conflicting Rate
To evaluate impacts of conflicting data on regression algorithms, we injected conflicting values to original datasets randomly varying conflicting rate from 10% to 50%. We used 10fold cross validation and generated training data and testing data randomly. Experimental results were depicted in Figure 45, 46, 47, and 48.
Based on the results, we had the following observations. First, as shown in Table 6, for RMSD, the order is “Polynomial Regression Maximum Likelihood Least Square Stepwise Regression”. For NRMSD, the order is “Polynomial Regression Stepwise Regression Maximum Likelihood Least Square”. For CV(RMSD), the order is “Maximum Likelihood Stepwise Regression Polynomial Regression Least Square”. Thus, for RMSD, the least sensitive algorithm is Stepwise Regression. This is because that there is a validation step in stepwise regression, which guarantees the regression accuracy. For NRMSD and CV(RMSD), the least sensitive algorithm is Least Square. The reason is similar as that of the least sensitive algorithm varying missing rate. For RMSD and NRMSD, the most sensitive algorithm is Polynomial Regression. And for CV(RMSD), the most sensitive algorithm is Maximum Likelihood. The reason is similar as that of the most sensitive algorithms varying missing rate.
Second, as shown in Table 7, for RMSD, the order is “Stepwise Regression Least Square Maximum Likelihood Polynomial Regression”. For NRMSD, the order is “Least Square Maximum Likelihood Stepwise Regression Polynomial Regression”. For CV(RMSD), the order is “Least Square Maximum Likelihood Stepwise Regression Polynomial Regression”. Therefore, the most conflicttolerant algorithms are Least Square, Maximum Likelihood, and Stepwise Regression. This is due to the fact that there are a small amount of parameters in least square and maximum likelihood linear regression models. In Stepwise Regression, the validation step helps guarantee the regression accuracy. The least conflicttolerant algorithm is Polynomial Regression. The reason is similar as that of the least incompletenesstolerant algorithm varying missing rate.
Third, the observation of running time on experiments varying missing rate was still true when the conflicting rate was varied.
Discussion
In regression experiments, we first found that dirtydata impacts are related to error type and error rate. Thus, the error rate of each error type in the given data is necessary to be detected. Second, we compared the fluctuation degrees of regression algorithms in our experiments. When dirty data exist, the algorithm with the least degree is the most stable. Third, beyond , the accuracy of the selected algorithm becomes unacceptable. Thus, the error rate of each type needs to be controlled within its .
7 Guidelines and Future Work
Based on the discussions, we give guidelines for algorithm selection and data cleaning.
Classification Guidelines. We suggest users select classification algorithm and clean dirty data according to the following steps.
First, users are suggested to detect error rates (e.g., missing rate, inconsistent rate, conflicting rate) from the given data.
Second, according to the given task requirements (e.g., well performance on Precision/Recall/Fmeasure), we suggest users select candidate algorithms of which Precision/Recall/Fmeasure on the given data is better than 70%.
Third, if the given data size is small, we recommend Logistic Regression.
Fourth, according to task requirements and the error type which takes the largest proportion, we suggest users find the corresponding order and choose the least sensitive classification algorithm.
Finally, according to the selected algorithm, task requirements, and error rates of the given data, we suggest users find the corresponding orders and clean each type of dirty data to its .
Clustering Guidelines. We suggest users select clustering algorithm and clean dirty data according to the following steps.
First, users are suggested to detect error rates (e.g., missing rate, inconsistent rate, conflicting rate) from the given data.
Second, according to the given task requirements (e.g., well performance on Precision/Recall/Fmeasure), we suggest users select candidate algorithms of which Precision/Recall/Fmeasure on the given data is better than 70%.
Third, if the given data size is large, we recommend DBSCAN.
Fourth, according to task requirements and the error type which takes the largest proportion, we suggest users find the corresponding order and choose the least sensitive clustering algorithm.
Finally, according to the selected algorithm, task requirements, and error rates of the given data, we suggest users find the corresponding orders and clean each type of dirty data to its .
Regression Guidelines. We suggest users select regression algorithm and clean dirty data according to the following steps.
First, users are suggested to detect error rates (e.g., missing rate, inconsistent rate, conflicting rate) from the given data.
Second, according to the given task requirements (e.g., well performance on RMSD/NRMSD/CV(RMSD)), we suggest users select candidate algorithms of which RMSD/CV(RMSD) on the given data is better than 1.0, or NRMSD on the given data is better than 0.5.
Third, according to task requirements and the error type which takes the largest proportion, we suggest users find the corresponding order and choose the least sensitive regression algorithm.
Finally, according to the selected algorithm, task requirements, and error rates of the given data, we suggest users find the corresponding orders and clean each type of dirty data to its .
In addition, this work opens many noteworthy avenues for future work, which are listed as follows.
For researchers and practitioners in the fields related to data analytics and data mining. () Since dirtydata impacts on classification, clustering, and regression are valuable, their effects on other kinds of algorithms (e.g., association rules mining) need to be tested. () Dirtydata impacts are related to error type, error rate, data size, and algorithm performance on original datasets. Hence, constructing a model with these parameters to predict dirtydata impacts is in demand.
For researchers and practitioners in dataquality and datacleaning related fields. () Since the errortolerance ability of different algorithms on different error types are different, we are unnecessary to clean all dirty data before data mining and machine learning tasks. Instead, data cleaning to an appropriate rate (e.g., ) is suggested. However, which part of dirty data has priority to be repaired first is a challenging problem. () Since different users have different task requirements, how to clean data on demand needs a solution.
Footnotes
 https://github.com/qizhixinhit/DirtydataImpacts
 http://archive.ics.uci.edu/ml/datasets.html
References
 G. Beskales, I. F. Ilyas, L. Golab, and A. Galiullin. On the relative trust between inconsistent data and inaccurate constraints. In ICDE, pages 541–552, 2013.
 X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, pages 458–469, 2013.
 X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye. Katara: A data cleaning system powered by knowledge bases and crowdsourcing. In SIGMOD, pages 1247–1261, 2015.
 S. Hao, N. Tang, G. Li, and J. Li. Cleaning relations using knowledge bases. In ICDE, pages 933–944, 2017.
 J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution. PVLDB, 5(11):1483–1494, 2012.
 M. Dallachiesa, A. Ebaid, A. Eldawy, A. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. Nadeef: a commodity data cleaning system. In SIGMOD, pages 541–552, 2013.
 W. Fan and F. Geerts. Foundations of Data Quality Management. 2012.
 W. Fan and F. Geerts. Capturing missing tuples and missing values. In SIGMOD, pages 169–178, 2010.
 G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: Consistency and accuracy. In PVLDB, pages 315–326, 2007.
 L. Getoor and A. Machanavajjhala. Entity resolution: theory, practice & open challenges. PVLDB, 5(12):2018–2019, 2012.
 F. Sidi, P. H. S. Panahy, L. S. Affendey, M. A. Jabar, H. Ibrahim, and A. Mustapha. Data quality: A survey of data quality dimensions. In Information Retrieval & Knowledge Management, pages 300–304, 2012.
 R. Caruana and A. NiculescuMizil. An empirical comparison of supervised learning algorithms. In ICML, pages 161–168, 2006.
 R. Caruana, N. Karampatziakis, and A. Yessenalina. An empirical evaluation of supervised learning in high dimensions. In ICML, pages 96–103, 2008.
 K. O. Elish and M. O. Elish. Predicting defectprone software modules using support vector machines. Journal of Systems and Software, 81(5):649–660, 2008.
 B. Ghotra, S. McIntosh, and A. E. Hassan. Revisiting the impact of classification techniques on the performance of defect prediction models. In ICSE, pages 789–800, 2015.
 J. R. Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986.
 Y. Prabhu and M. Varma. Fastxml: A fast, accurate and stable treeclassifier for extreme multilabel learning. In SIGKDD, pages 263–272, 2014.
 C. Thornton, F. Hutter, H. H. Hoos, and K. LeytonBrown. Autoweka: Combined selection and hyperparameter optimization of classification algorithms. In SIGKDD, pages 847–855, 2013.
 T. Hastie and R. Tibshirani. Discriminant adaptive nearest neighbor classification. IEEE transactions on pattern analysis and machine intelligence, 18(6):607–616, 1996.
 N. Begum, L. Ulanova, J. Wang, and E. Keogh. Accelerating dynamic time warping clustering with a novel admissible pruning strategy. In SIGKDD, pages 49–58, 2015.
 T. Bayes. A letter from the late reverend mr. thomas bayes, frs to john canton, ma and frs. Philosophical Transactions, 53:269–271, 1763.
 J. Pearl. Fusion, propagation, and structuring in belief networks. Artificial intelligence, 29(3):241–288, 1986.
 T. Wu, S. Sugawara, and K. Yamanishi. Decomposed normalized maximum likelihood codelength criterion for selecting hierarchical latent variable models. In SIGKDD, pages 1165–1174, 2017.
 D. McFadden et al. Conditional logit analysis of qualitative choice behavior. Frontiers in Econometrics, pages 105–142, 1972.
 H. Jain, Y. Prabhu, and M. Varma. Extreme multilabel loss functions for recommendation, tagging, ranking & other missing label applications. In SIGKDD, pages 935–944, 2016.
 S. Chang, W. Han, J. Tang, G. Qi, C. C. Aggarwal, and T. S. Huang. Heterogeneous network embedding via deep architectures. In SIGKDD, pages 119–128, 2015.
 Z. Cui, W. Chen, Y. He, and Y. Chen. Optimal action extraction for random forests and boosted trees. In SIGKDD, pages 179–188, 2015.
 S. Khanmohammadi, N. Adibeig, and S. Shanehbandy. An improved overlapping kmeans clustering method for medical applications. Expert Systems with Applications, 67:12–18, 2017.
 K. Kirchner, J. Zec, and B. Delibašić. Facilitating data preprocessing by a generic framework: a proposal for clustering. Artificial Intelligence Review, 45(3):271–297, 2016.
 S. Wu, H. Chen, and X. Feng. Clustering algorithm for incomplete data sets with mixed numeric and categorical attributes. International Journal of Database Theory and Application, 6(5):95–104, 2013.
 H. Gulati and P. Singh. Clustering techniques in data mining: A comparison. In Computing for Sustainable Global Development, pages 410–415, 2015.
 J. MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281–297, 1967.
 T. Kohonen. Learning vector quantization. In SelfOrganizing Maps, pages 175–189. 1995.
 R. T. Ng and J. Han. Efficient and effective clustering methods for spatial data mining. In PVLDB, pages 144–155, 1994.
 M. Ester, H. Kriegel, J. Sander, and X. Xu. Densitybased spatial clustering of applications with noise. In SIGKDD, volume 240, 1996.
 T. Zhang, R. Ramakrishnan, and M. Livny. Birch: an efficient data clustering method for very large databases. In ACM Sigmod Record, volume 25, pages 103–114, 1996.
 S. Guha, R. Rastogi, and K. Shim. Cure: an efficient clustering algorithm for large databases. In ACM Sigmod Record, volume 27, pages 73–84, 1998.
 R. Silhavy, P. Silhavy, and Z. Prokopova. Analysis and selection of a regression model for the use case points method using a stepwise approach. Journal of Systems and Software, 125:1–14, 2017.
 S. Abraham, M. Raisee, G. Ghorbaniasl, F. Contino, and C. Lacor. A robust and efficient stepwise regression method for building sparse polynomial chaos expansions. Journal of Computational Physics, 332:461–474, 2017.
 E. Avdis and J. A. Wachter. Maximum likelihood estimation of the equity premium. Journal of Financial Economics, 125(3):589–609, 2017.
 L. Li and X . Zhang. Parsimonious tensor response regression. Journal of the American Statistical Association, 112(519):1131–1146, 2017.
 J. B. Ramsey. Tests for specification errors in classical linear leastsquares regression analysis. Journal of the Royal Statistical Society, pages 350–371, 1969.
 P. McCullagh. Generalized linear models. European Journal of Operational Research, 16(3):285–292, 1984.
 A. R. Gallant and W. A. Fuller. Fitting segmented polynomial regression models whose join points have to be estimated. Journal of the American Statistical Association, 68(341):144–147, 1973.
 R. B. Bendel and A. A. Afifi. Comparison of stopping rules in forward ¡°stepwise¡± regression. Journal of the American Statistical association, 72(357):46–53, 1977.