Missing Data Imputation for Classification Problems
Imputation of missing data is a common application in various classification problems where the feature training matrix has missingness. A widely used solution to this imputation problem is based on the lazy learning technique, -nearest neighbor (kNN) approach. However, most of the previous work on missing data does not take into account the presence of the class label in the classification problem. Also, existing kNN imputation methods use variants of Minkowski distance as a measure of distance, which does not work well with heterogeneous data. In this paper, we propose a novel iterative kNN imputation technique based on class weighted grey distance between the missing datum and all the training data. Grey distance works well in heterogeneous data with missing instances. The distance is weighted by Mutual Information (MI) which is a measure of feature relevance between the features and the class label. This ensures that the imputation of the training data is directed towards improving the classification performance. This class weighted grey kNN imputation algorithm demonstrates improved performance when compared to other kNN imputation algorithms, as well as standard imputation algorithms such as MICE and missForest, in imputation and classification problems. These problems are based on simulated scenarios and UCI datasets with various rates of missingness.
Missing Data Imputation for Classification ProblemsChoudhury and Kosorok \firstpageno1
Francis Bach, David Blei and Bernhard Scholkopf
Missing Data, K Nearest Neighbors, Grey Theory, Mutual Information, Classification Problem
Many of the commonly used classification algorithms such as Classification and Regression Trees (CART) (Breiman et al., 1984) and Random Forests (Breiman, 2001) do not have rigorous techniques for handle missing values in training data. Ignoring the datapoints with missing values and running the classification algorithm on complete cases only leads to loss of vital information (Little and Rubin, 2002). The occurrence of missing data is one of the biggest challenges for data scientists solving classification problems in real-world data (Duda et al., 2012). These datasets can come from any walk of life, ranging from medical data (Troyanskaya et al., 2001) and survey responses to equipment faults and limitations (Le Gruenwald, 2005). The reason for missingness can be human error in inputting data, incorrect measurements, non-response to surveys, etc. For example, an industrial database maintained by Honeywell, a company manufacturing and servicing complex equipment, has more than 50% missing data (Lakshminarayan et al., 1999) despite regulatory requirements for data collection. In wireless sensor networks, incomplete data is encountered due to sensor faults, local interference or power outage (Le Gruenwald, 2005). In medical fields, patient health care records are often a by-product of patient care activities rather than an organized research protocol which leads to significant loss of information (Cios and Moore, 2002). This leads to almost every patient record lacking some values as well as each attribute/feature having missing values. More than 40% of the datasets in the UCI Machine Learning Repository have missing values (Newman et al., 2008).
Classification problems are aimed at developing a classifier from training data, so that a new test observation can be correctly classified into one of the groups/classes. The class membership is assumed to be known for each observation of the training set whereas the corresponding attributes/features may have some missing values. The test dataset consists of new observations having the corresponding features but no class labels. The goal of the classification problem is to assign class labels to the test set (Alpaydin, 2009). In our problem setup, we assume that some of the features are missing at random (MAR) for the training as well as the test dataset. One approach to classification is ignoring the observations with missing values and building a classifier. This is only feasible when the missingness is insignificant, however, and it has been demonstrated that even with a 5% missingness, proper imputation increases the classification accuracy (Farhangfar et al., 2008). We focus on imputation of missing values in the training as well as the test dataset so as to improve the overall performance of the classifier on the test data. Our proposed method takes into account the class label during imputation of the training features, and this ensures an overall improvement in classification.
The work related to missing data imputation can be divided into two categories, single imputation and multiple imputation. Single imputation strategies provide a single value for the missing datum. The earliest single value imputation strategy was Mean Imputation (Little and Rubin, 2002) which ignores the input data distribution by imputing just one value for all missing instances of a feature. Other popular single imputation techniques are hot deck and cold deck imputation (Little and Rubin, 2002), C4.5 (Quinlan, 1993) and prediction based models (Schafer, 1997). C4.5 works well with discrete data but not with numerical data, which has to be discretized to apply the algorithm (Tsai et al., 2008). Prediction based models depend on the correct modelling of missing features and the relationship between them. Usually, incomplete datasets obtained from studies cannot be modelled accurately. The problem with single imputation techniques in general is they reduce the variance of the imputed dataset and cannot provide standard errors and confidence intervals for the missing data. They are also very case specific as they can meaningfully impute data only when the model is known or when the data is either numerical or discrete.
To solve the problems of single imputation, multiple imputation strategies generate several imputed datasets from which confidence intervals can be calculated. Multiple imputation is a process where several complete databases are created by imputing different values to reflect the uncertainty about the right values to impute (Rubin, 1977; Farhangfar et al., 2007). The earliest multiple imputation technique was the Expectation-Maximization (EM) Algorithm (Dempster et al., 1977). The EM Algorithm and its variants such as EM with bootstrapping (Honaker et al., 2011), assumes a parametric density function which fails miserably for features without a parametric density. A recent generalization of the EM Algorithm called Pattern Alternating Maximization with Lasso Penalty (MissPALasso) (Städler et al., 2014) has been applied to datasets with high dimensionality (), but also assuming normality. Bayesian multiple imputation algorithms have been applied only to multivariate normal samples (Li, 1988; Rubin and Schafer, 1990).
Regression Imputation (Gelman and Hill, 2006) is also a popular multiple imputation technique where each feature is imputed using other features as predictor variables for the regression model. Sequential Regression Multivariate Imputation (SRMI) improves upon this by fitting a sequence of regression models and drawing values from the corresponding predictive distributions (Raghunathan et al., 2001). Incremental Attribute Regression Imputation (IARI) constructs a sequence of regression models to iteratively impute the missing values and also uses the class label of each sample as a predictor variable (Stein and Kowalczyk, 2016). In Multiple Imputation using Chained Equations (MICE), the conditional distribution of each missing feature must be specified given the other features (Buuren and Oudshoorn, 1999). It is assumed that the feature matrix has a full multivariate distribution from which the conditional distribution of each feature is derived. The full distribution need not be specified, as long as the distribution of each feature is stated, a feature called fully conditional specification (Buuren, 2007). MICE can handle mixed types of data. It has options for predictive mean matching, linear regression, binary and polytomous logistic regression, etc., and uses the Gibbs sampler to generate multiple imputations. However, for a given set of conditional distributions, a multivariate distribution may not exist (Buuren et al., 2006). The ideas of MICE and SRMI are combined in the MissForest approach (Stekhoven and Bühlmann, 2011) which fits a random forest on the missing feature, using the other features as covariates and then predicts the missing values. This procedure is iterative and can handle mixed data, complex interactions, and high dimensions.
Machine Learning techniques such as Fuzzy -Means (Sefidian and Daneshpour, 2019), Multilayer Perceptrons (MLP) (García-Laencina et al., 2013) and -Nearest Neighbors (KNN) (Batista and Monard, 2002) are useful non-parametric approaches to imputation of missing data. Various machine learning algorithms such as -Nearest Neighbors (KNN), Support Vector Machines (SVM) and decision trees have been used in imputation by framing the imputation problem as an optimization problem and solving it (Bertsimas et al., 2017). The Nearest Neighbor Imputation (NNI) approach is simple since there is no need to build a predictive model for the data. The basic KNN Imputation (KNNI) algorithm was first used for estimating DNA microarrays with the contribution of the -Nearest Neighbors weighted by Euclidean distance (Troyanskaya et al., 2001). The sequential KNN method was proposed using cluster-based imputation (Kim et al., 2004), followed by an iterative variant of the KNN imputation (IKNN) (Brás and Menezes, 2007), both of which improves on KNNI. The Shelly Neighbors (SN) method improves the KNN rule by selecting only neighbors forming a shell around the missing datum, among the k closest neighbor (Zhang, 2011). The first significant work in improving KNN imputation for classification based problems uses a feature-weighted distance metric based on Mutual Information (MI) as a measure of closeness of a feature to the class label (García-Laencina et al., 2009). The method is called Mutual Information based -Nearest Neighbor (MI-KNN) Imputation. However, the distance metric used is Euclidean distance, which does not perform well with mixed-type data (Huang and Lee, 2006). Alternatively, Grey Relational Analysis is shown to be more appropriate for capturing proximity between two instances with mixed data as well as missingness. Based on this, a Grey KNN (GKNN) imputation approach was built based on Grey distance instead of Euclidean distance and it was shown to outperform traditional KNN imputation techniques (Huang and Lee, 2004; Zhang, 2012). This grey distance based KNN imputation is weighted by mutual information between features (measure of inter-feature relevance) and shown to outperform IKNN, GKNN and Fuzzy -Means Imputation (FKMI) (Li et al., 2004) in most settings, and is called the Feature Weighted Grey -Nearest Neighbor (FWGKNN) method (Pan et al., 2015). However, this method does not take into account each feature’s association with the class label, which is crucial when dealing with classification problems. The FWGKNN method also assumes inter-dependency of features.
We propose a Class-weighted Grey -Nearest Neighbor (CGKNN) imputation method where we calculate the MI of each feature with respect to the class label in the training dataset, use it for calculating the weighted Grey distance between the instances, and then find the -Nearest Neighbors of an instance with missing values. Using -Nearest Neighbors, the missing value is imputed according to the weighted Grey distance. Our contributions can be summarized as follows:
We use a combination of Mutual Information between each feature and the classifier variable to weigh the Grey distance between instances in the feature matrix . This metric is suited for tuning out any unnecessary features for classification and then finding the nearest neighbors relevant for imputation.
We solve an imputation problem with no underlying assumptions on the structure of the feature matrix except that the data is missing completely at random (MCAR) or missing at random (MAR). Our method (CGKNN) is non-parametric in nature and does not assume any dependence between the individual features. This performs well even when the features are independent of each other.
The proposed CGKNN imputation method is suited well for classification problems where the training as well as the test datasets have missing values. The feature matrix can be mixed-type, i.e., have categorical and numeric data. Our method is suitable for mixed-data classification problems faced with missing values. Moreover, our problem approach takes much less time to initialize than the most similar alternative method, Feature Weighted Grey -Nearest Neighbor (FWGKNN).
The remainder of this paper is organized as follows. In Section 2, we review the KNN imputation techniques used in previous work and then provide a detailed outline of our method. We also discuss how it can be extended to the test dataset for classification and also derive the time complexity of our algorithm. In section 3, we test our proposed method against 6 standard methods in simulation settings. We evaluate our imputation method (CGKNN) in different simulation settings with classification where we artificially introduce missingness. We compare it with standard multiple imputation algorithms MICE and MissForest as well as the previous KNN based algorithms, Iterative KNNI (IKNN), Mutual Information based KNNI (MI-KNN), Grey KNNI (GKNN) and Feature-Weighted Grey KNNI (FWGKNN). In section 4, we demonstrate how our algorithm performs with classification tasks involving 3 UCI Machine Learning Repository datasets. We also check for improvement of classification accuracy after imputation of the missing data. Our method gives the best classification performance out of all evaluated methods. We conclude with a discussion and scope for future work in section 5.
In this section, we pose the missing data problem which is encountered in classification tasks. We introduce the nearest neighbor (NN) approach and the previous works done on implementing variations on the KNN imputation approach. This is followed by the concepts of mutual information (MI) and grey relational analysis (GRA) used by our method of Class-weighted Grey -Nearest Neighbor (CGKNN) imputation approach. We then formalize our imputation algorithm and calculate its time complexity.
2.1 Formulation of the Problem
Let be an -dimensional dataset of independent observations with features/attributes and a response variable of the class labels influenced by . We assume no dependence structure between the features in . Let be an -dimensional matrix indicating the missingness of corresponding values in the dataset . In practice, we obtain a random sample of size of incomplete data associated with a population , called the training data (Hastie et al., 2009) used to train the classifier
where all the class labels in are observed, represents the features of the -th observation along with indicator variables such that
Without loss of generality, we can assume for each , the observation contains categorical features for and continuous features for such that . Let the -th categorical feature contain different values and the -th continuous variable representing the -th feature of , indexed by take values from a continuous set . For each of the categorical features, we can map the different values to the first natural numbers, such that .
In this setting, we can assume that satisfy the model
where is an unknown function mapping a -dimensional number (belonging to a subspace of ) to a discrete set representing the class labels and . We assume that contains values and thus the classification problem is based on classes.
The task of any classification algorithm is to use the training dataset to estimate , which is referred to as ‘training’ a classifier . Given a new set of observations, , called the test dataset (Hastie et al., 2009), the classifier predicts the corresponding class using Note that the test dataset can also contain missing values. Many classification algorithms have been shown to perform better in terms of classification accuracy after imputing the missing values in the feature matrix (Farhangfar et al., 2008; Luengo et al., 2012) and then training the classifier. In this paper, we propose a nearest neighbor based imputation algorithm which is used to impute the missing values in X and then train the classifier . The same algorithm can be extended to the test dataset and impute the missing values in .
In general, there are three different missing data mechanisms as defined in the statistical literature (Little and Rubin, 2002):
Missing Completely at Random (MCAR): When the missingness of does not depend on the missing or observed values of . In other words, using is as defined in (2),
Missing at Random (MAR): When the missingness of depends on the observed values of but not on the missing values of . If we split the training dataset into two parts, observed and missing , then
Not Missing at Random (NMAR): When the data is neither MCAR or MAR, the missingness of depends on the missing values of itself. This sort of missingness is difficult to model as the observed values of X give biased estimates of the missing values.
For our problem, we assume that the missing data mechanism of is either MCAR or MAR.
2.2 k-Nearest Neighbors (KNN) Imputation Algorithm
KNN is a lazy, instance-based learning algorithm and is one of the top 10 data mining algorithms (Wu et al., 2008). Instance-based learning is based on the principle that instances within a dataset will generally exist in close proximity with other cases that have similar properties (Aha et al., 1991). The KNN approach has been extended to imputation of missing data in various datasets (Troyanskaya et al., 2001). In general, the KNN imputation is an appropriate choice when we have no prior knowledge about the distribution of the data. Given an incomplete instance, this method selects its closest neighbours according to a distance metric, and estimates missing data with the corresponding mean or mode. The mean rule is used to predict missing numerical features and the mode rule is used to predict missing categorical features. KNN imputation does not create explicit predictive models, because the training dataset is used as a lazy model. Also, this method can easily treat cases with multiple missing values.
Distance Metric for Mixed Data
Let there be two input vectors, and - whose features can be both continuous as well as categorical. The Heterogeneous Euclidean Overlap Metric (HEOM) (Batista and Monard, 2003), denoted as , is defined as
where means the maximum value of observations of feature and means the minimum value of when it is quantitative. The distance ranges from to and also takes the value 1, when either of the observations are missing.
However, to effectively apply the KNN imputation approach, a challenging issue is the optimal value of , and the other is selecting neighbours. The optimal -value can be selected using only non-missing parts (Kim et al., 2004). This -value estimating procedure considers some elements of the non-missing parts as artificial missing values, and finds an expected -value that produces the best estimation ability for the artificial missing values. This method does not perform well when there are large amounts of missing data. In the proposed approach we determine this parameter optimally using cross validation (Stone, 1974).
Suppose the -th input feature of is missing (i.e., from (2)) and has to be imputed. After the distances (calculated by the HEOM defined in (2.2.1)) from to all other training instances () are computed, its -nearest neighbours are chosen from the training set. In our notation, represents the set of -nearest neighbors of arranged in increasing order of its distance as defined by (7)-(10). The -closest cases are selected after instances with missing entries in the incomplete feature are imputed using mean or mode imputation, depending on the type of feature (Troyanskaya et al., 2001). Once its -nearest neighbours have been chosen, the unknown value is imputed by an estimate from the -th feature values of .
For continuous variables, the imputed value () is . One obvious refinement is to weight the contribution of each according to their distance to (Dudani, 1976), such that
For categorical or discrete variables, we choose among the discrete values of . using the values of the -th input features in . A popular choice is to impute the mode of to , where all neighbours have the same importance in the imputation stage (Troyanskaya et al., 2001). An improvement to this is assigning a weight to each , with closer neighbours having greater weights. The category of is chosen by the category whose weights sum up to the highest value in . Using an approach similar to a distance-weighted KNN classifier (Dudani, 1976), a suitable choice of is
Suppose the -th input feature has possible discrete values with be the number of samples from that belong to category . Now for each possible category, is calculated by
Then, the category imputed to is
Initialization: Given the training dataset , the missing values of categorical features are imputed by mode imputation and the missing values of the continuous features are imputed by mean imputation using the observed data. We call the initially imputed matrix .
Choosing : We use this imputed matrix, , to calculate the optimum value of using 10-fold cross validation (Stone, 1974) to minimize the misclassification rate of predicting the class labels . This is the used for choosing the nearest neighbors.
Stopping Criterion: We stop at the -th iteration when a stopping criteria is met. The stopping criteria we propose is
where is the chosen accuracy level.
2.3 Mutual Information (MI) for Classification
We can see that the above imputation algorithm does not consider the class label while computing the -nearest neighbors. We can solve this using an effective procedure where the neighbourhood is selected by considering the input feature relevance for classification (García-Laencina et al., 2009). This input feature relevance for classification is measured by calculating the Mutual Information (MI) between the feature and the class variable .
Notion of MI
The entropy of a random variable, , measures the uncertainty of the variables. If a discrete random variable has alphabets and the probability density function (pdf) is , then the entropy is defined by
Here the unit of entropy is the bit and the base of the logarithm is 2. The joint entropy of and is defined as
where is the joint pdf of and , both of them being discrete.
The conditional entropy quantifies the resulting uncertainty of given , given by
where is the conditional pdf of given . The mathematical definition of MI quantifying the dependency of two random variables is given by
For continuous random variables, the definitions of entropy and of MI are
The entropy and MI satisfy the following relationship
which is the reduction of the uncertainty of when is known (Kullback, 1997). The MI can also be rewritten as , where is the joint entropy of and . When the variables and are independent, the MI of the two variables is zero. Compared to the Pearson correlation coefficient which only measures linear relationships, MI can measure any relationship between variables (Kullback, 1997).
Computation of MI in Classification Problems
Consider the class label for an -class classification problem and let the number of observations in the -th class be such that , as mentioned in (1). In terms of classification problems, we are interested in finding the relevance of the -th feature with the class label , which is measured by their Mutual Information (MI)
In this equation, the entropy of class variable can be computed using (16) as
We can easily estimate by . The estimation of is given by (18) when is discrete and by (21) when is continuous. For discrete feature variables, estimating the probability densities is straightforward by means of a histogram approximation (Kwak and Choi, 2002).
For continuous features, entropy estimation is not straightforward due to the problem of estimation of , where is discrete and is continuous. Note that we need to estimate the conditional density of at the classes represented by and not the joint density. We can use a Parzen window estimation approach to estimate (Kwak and Choi, 2002) given by
where is the window function and is smoothing parameter. Rectangular and Gaussian functions are suitable window functions (Duda et al., 2012) and if is selected appropriately, converges to (Kwak and Choi, 2002). We can calculate using the Parzen window approach
and then estimate from (21) by replacing the integral by summation over training observations and using to arrive at
Using the Parzen window approach, along with (25) and (29), we can calculate the Mutual Information from (24) between any feature and the class variable , which measures the relevance of the feature in classification. Using this, a weight is assigned to each feature in Mutual Information based KNNI (MI-KNN) (García-Laencina et al., 2009), such that
and then the distance between instances is calculated similar to (7):
where is as defined in (8). Using this feature relevance weighted distance, replacing with (from (31)) in (11)-(12), and following Algorithm 1, we obtain the MI-KNN imputation algorithm (García-Laencina et al., 2009).
2.4 Grey Relational Analysis (GRA) based KNNI
Grey System Theory (GST) has been developed to tackle systems with partially known and partially missing information (Deng, 1982). The system was named grey since missing data is represented by black whereas known data is white, and this system contains both missing and known data. To obtain Grey-based -nearest neighbors, we used Grey Relational Analysis (GRA) in our algorithm which is calculating Grey Distance between two instances. Grey distance measures similarity of two random instances, which involves the Grey Relational Coefficient (GRC) and the Grey Relational Grade (GRG).
Consider the setup in (1) where the training dataset has observations and features. The Grey Relational Coefficient (GRC) between two instances/observation and , when the -th feature is continuous and observed for both instances, is
where , , (usually is taken (Deng, 1982)), , and, and for categorical feature , is 1 if they have the same values, 0 otherwise. If either or is missing, then is 0. The Grey Relational Gradient (GRG) between the instances is defined as:
where . We note that if is larger than then the difference between and is less than the difference between and , which is the opposite of the Heterogenous Euclidean Overlap Metric (HEOM) (7) defined in Section 2.2.1. The Grey Relational Gradient satisfies the following axioms which makes it a distance metric (Deng, 1982):
Normality: The value of is between 0 and 1.
Dual Symmetry: Given only two observations and in the relational space, then .
Wholeness: If 3 or more observations are made in the relational space then is generally not equal to for any .
GRA is generally preferred over metrics such as Heterogeneous Euclidean Overlap Metric (HEOM) for grey systems with missing data (Huang and Lee, 2004). It gives us a normalized measuring function for both missing/available and categorical/continuous data due to its normality. It also gives whole relational orders due to its wholeness over the entire relational space. So instead of in (7), if we use to select the -nearest neighbors and then proceed with the KNN Imputation technique without using weights, then it becomes Grey KNN (GKNN) Imputation (Zhang, 2012).
2.5 Transformation of the Data
Before we apply our version of the algorithm, we make some transformation of the continuous features contained in the training dataset, since we deal with a wide variety of features whose ranges vary vastly. For example, the range of marks in a 10 point exam would be less than the range of marks for a 100 point exam, and both these marks may be in the same training dataset. The distance metric and subsequently the -nearest neighbor would be biased unless the ranges of the continuous variables are normalized. In our algorithm, we transformed the -th feature of observation as
where and . Thus (34) ensures all the continous variables are between 0 and 1. Note that the distance metric associated with categorical variables (Euclidean or Grey-based) lie within 0 and 1 as well.
2.6 The Proposed Class-weighted Grey k-Nearest Neighbor (CGKNN) Algorithm
We consider the class weight associated with the -th attribute and use this to weigh the Grey Relational Gradient (GRG) between two observations and as follows
Since increases for closer neighbors unlike the other distance metrices, we use in section 2.2.2 and then measure the distance between instances to choose the -nearest neighbors, . From (11), we derive that the corresponding weights of are
Using these weights, we impute the continuous variables, and the new definition of in (12)-(14) is used to impute the categorical variables for our Class-weighted Grey KNN (CGKNN) Imputation Algorithm.
The overall framework for our proposed algorithm is formalized as Algorithm 2 below.
Initialization: We use the class labels in to split into . For each class , given , we pre-impute the missing values of categorical features by mode imputation and the missing values of the continuous features by mean imputation using the observed data in that class. We call the initially imputed matrix . Repeat this for . Fuse them to form .
Choosing : We use this imputed matrix, , to calculate the optimum value of using 10-fold cross validation (Stone, 1974) to minimize the misclassification rate of predicting the class labels . This is the used for choosing the nearest neighbors.
Iterative Step: Consider the iteration number and class number . For each instance in the class which has a missing value, calculate the GRG of that instance with all other instances of the class . After sorting the GRG in descending order, the first observations are chosen to form the set of nearest neighbors . Using the weights as described in (36), the imputed matrix is obtained by imputing the missing continuous features (with corresponding ) using the steps in (11) and the missing categorical features using (12)-(14) with . This is repeated for each until all missing values are imputed to obtain . If the stopping criterion is not met, then the iteration on continues.
Stopping Criterion: We stop at the -th iteration when a stopping criteria is met. The stopping criteria we propose is
2.7 Time Complexity of the Algorithm
Consider the setup (1) with observations, features and classes. The time complexity for calculating the in the biggest class containing (say) observations is and the average processing time for sorting the is . If we assume iterations are taken for the algorithm to converge, then the algorithm has a complexity of to impute an matrix. We do this for classes and thus the time complexity for imputing an matrix is . Now, generally whenever , which implies , and since was the biggest class. This gives rise to the inequality
We initially calculate the Mutual Information of each attribute with the class variable, which takes time along with the imputation of the mean/mode which again takes time and choosing an optimum which takes time if we assume values of are tested using 10-fold cross-validation. So our total complexity becomes which can be approximated to if the value of is large compared to . We note that this time complexity is similar to Grey KNNI (GKNN) and Feature-Weighted Grey KNNI (FWGKNN) but less than the complexity of Iterative KNNI (IKNN) and the Grey-Based Nearest Neighbor (GBNN) algorithm (Huang and Lee, 2004).
3 Simulation Studies
In this section we explore the performance of our proposed Class-weighted Grey KNN (CGKNN) algorithm in recovering missing entries and improving the classification accuracy, and we report on computational efficiency of the algorithm. We compare our method with 6 other well-established methods which are as follows:
MICE (Multiple Imputation using Chained Equations): Multiple imputation using Fully Conditional Specification (FCS) of the variables of is implemented by the MICE algorithm developed by Van Buuren and Oudshoorn (Buuren and Oudshoorn, 1999). Each variable has its own imputation model. We use the built-in imputation model for continuous data (predictive mean matching), binary data (logistic regression), unordered categorical data (polytomous logistic regression), and ordered categorical data (proportional odds) based on the type of variable encountered during imputation.
MissForest: This is an iterative imputation method based on a random forest developed by Stekhoven and Buhlmann (Stekhoven and Bühlmann, 2011). This non-parametric algorithm can handle both continuous and categorical data, and also give an out-of-bag error estimate to estimate the imputation error.
Iterative -Nearest Neighbor imputation (IKNN): This method adopts the Euclidean distance to compute similarities between random instances. Mean imputation is regarded as a preliminary estimate. The closest objects are selected from the candidate data which contain all instances, except the one that is to be imputed. The complete instances are upgraded after first imputation, and the iteration procedure is repeated until reaching the convergence criterion.
Mutual Information based -Nearest Neighbor imputation (MI-KNN): This approach first measures the relevance of each feature in the classification problem similar to the approach described in (2.3), and uses a weighted Euclidean distance to measure the distance between instances, with the mutual information being the weights (García-Laencina et al., 2009). Mean or mode imputation is used as a preliminary estimate. All imputed instances and all complete instances are considered to be known information for estimating missing values iteratively. The missing values are then imputed based on the weighted mean or mode of the nearest neighbours.
Grey -Nearest Neighbor imputation (GKNN): This approach is used to handle heterogeneous data (numerical and categorical data). It uses GRA to measure the relationships between instances and seek out the nearest neighbours to execute the missing values estimation (Zhang, 2012). The dataset is divided into several parts based on the class label and imputation is performed on each of them. Mean or mode imputation is used as a preliminary estimate. The missing values are then imputed iteratively, based on mean or mode of the nearest neighbors sorted by Grey distance.
Feature Weighted Grey -Nearest Neighbor (FWGKNN): This approach employs Mutual Information (MI) to measure inter-feature relevance in the matrix. It then uses the GRA to find the distance between instances and weighs them by the inter-feature relevance (Pan et al., 2015). The difference between FWGKNN and our CGKNN algorithm is that the mutual information is computed between the class variable and the features in our algorithm whereas it is for the FWGKNN algorithm. Our approach is focused towards classification relevance instead of inter-feature relevance.
3.1 Performance Measure
We measure the performance of each algorithm according to the following metrics:
Accuracy of Prediction: The root mean square error (RMSE) is used to evaluate the precision of imputation as follows:
where is the true value, is the imputed value of the missing data, and denotes the number of missing values.
Classification accuracy (CA): After estimating the missing values, an incomplete dataset can be treated as a complete dataset. We used classification accuracy to evaluate all the imputation methods and to show the impact of imputation on the accuracy of classification as follows:
where denotes the number of the instances in the dataset, and are the classification results for the -th instance and the true class label, and is the indicator function.
3.2 Simulation Scenarios:
Missing Completely at Random (MCAR) Example
We use an artificial example to demonstrate the effect of mutual information with the class variable while selecting the k-Nearest Neighbors. We took a separable example with four cubes drawn in a three dimensional space. Fig. 1 shows this artificial problem. Two cubes belong to class 1, and they are centered on and . The remaining two cubes are labeled with the class 2, being centered on and . In all the cubes, the radius is equal to 0.10, and they are composed of 100 samples which are uniformly distributed inside the cube. In this problem, the MI values between the three attributes and the target class are computed: 0.40 for , 0.28 for , and 0.21 for .
To this 3 dimensional, 2 class dataset we add 20 U[-1,1] variables. For these irrelevant variables, the MI between the feature and class variable is almost 0. We try to find out what happens when we add irrelevant attributes to classification. We insert and of missing data to , which is most relevant according to MI. The missingness of data in is generated completely at random, which means it does not depend on the variable values in the matrix .
This advantage is clearer for higher percentages of missing values, as it is shown by the differences in Table 1. The class weighting procedure based on the MI concept discards the irrelevant features, and the selected neighbourhood for missing data estimation tends to provide reliable values for solving the classification task. We provide a detailed analysis of how all the 6 algorithms performed in this simulation setting with and classes in Table 1. Note that we used predictive mean matching as the imputation model for MICE.
We also calculated the classification accuracy using the Naive Bayes method on the non-imputed and imputed datasets with 10% and 20% missing data, with the help of 10-fold cross validation process. The resulting improvement in accuracy for both the cases is highest for our CGKNN Algorithm, as shown in Table 2.
Missing at Random (MAR) Example
For this section, we illustrate how our method performs with respect to the six other techniques. We simulate our data from the multivariate normal distribution and then artificially introduce missingness in the data, at random (MAR), by letting the probability of missingness depend on the observed values. We take the number of classes , the number of attributes and generate observations for each class. Specifically,
where stands for the th class, and ’s are randomly generated positive definite matrices using partial correlations (Joe, 2006). This simulation procedure ensures us that we do not have the same mean and variance for two different classes during simulation. Also, the missingness is induced using a logistic model on the missingness matrix . In real life, we often encounter covariates which are demographic in nature and thus non-missing. For this example, we assume and to be non-missing and the missingness of and to be dependent on these demographic, non-missing variables, for each class . Recall the missing matrix , which we modify to a layered 3D matrix with entries. We assume to be all and
where and are vectors of size chosen by us.
We provide a detailed analysis of how all the 6 algorithms performed in this simulation setting with and classes in Table 3. Note that we used predictive mean matching as the imputation model for MICE.
We also calculated the classification accuracy using the Naive Bayes method on the non-imputed and imputed datasets with 10% and 20% missing data, with the help of 10-fold cross validation process. The resulting improvement in accuracy for both the cases is highest for our CGKNN Algorithm, as shown in Table 4.
4 Applications to UCI Machine Learning Repository Datasets
We evaluate the effectiveness of our imputation algorithm on 3 datasets obtained from UCI Machine Learning Repository (Newman et al., 2008), the Iris (Fisher’s Iris Dataset), Voting and Hepatitis datasets, having respectively, characteristics mentioned in Table5.
|Dataset||Instances||Features||Classes||Feature type||% Missing Rate|
We then introduce 3 different rates of artificial missingness at random (MAR) - 5%, 10% and 20%. Then we run each of the imputation algorithms and calculate the RMSE of imputation after each algorithm converges. For MICE, we used predictive mean matching for continuous variables and polytomous logistic regression for categorical variables. Looking at Table 6 - Table 6 note that in almost all cases, our algorithm CGKNN performs better than the other algorithms, usually at higher percentages of missing values. MICE performs the worst in most cases, followed by MissForest, probably because they do not take into account any sort of feature relevance.
We use a Naive Bayes classifier on the Iris dataset with 5% - 20% missingness and see that our CGKNN algorithm outperforms the closest approach FWGKNN and also GKNN, when used as an imputation approach before the classifier. The CGKNN algorithm also converges quite fast with respect to classification accuracy as shown in Fig. 5.
Missing data is a classical drawback for most classification algorithms. However, most of the missing data imputation techniques have been developed without taking into account the class information, which is always present for a supervised machine learning problem. -Nearest Neighbors is a good technique for imputation of missing data and has shown to perform well against many other imputation procedures. We have proposed a method which not only takes into account the class information, but also uses a better metric to calculate the nearest neighbors in KNN imputation. Our Class-weighted Grey -Nearest Neighbor (CGKNN) approach has same time complexity as the previous algorithms and even better than some KNN imputation algorithms like Grey-Based k-Nearest Neighbor (GBNN) and Iterative -Nearest Neighbor (IKNN) imputation. We have shown that it outperforms all the other algorithms in simulated settings, as well as high rates of missingness in actual (non-simulated) datasets as far as imputation is concerned. We also show that it improves the accuracy of classification better than other imputation procedures. We do not make any assumptions regarding the variables of the feature matrix and thus, for any classification problem, our method can be used to impute missing data in the feature matrix.
However, an open problem is the selection of in our nearest neighbors approach and we have chosen it through cross-validation and this method takes time. The reason why is difficult to predict is because we do not have anything to validate the true value of in our datasets. A potential future research could be to select the value of in a smart, effective manner without involving cross-validation. Our algorithm has not been theoretically proven to converge, although it has been shown empirically. Finding the rate of convergence of our CGKNN algorithm is a good theoretical problem to consider.
Another potentially interesting future research topic would be to extend this idea to regression problem where the outcome is continuous instead of categorical. The imputation of the data matrix could be done with the help of information from since they are assumed to be related in a regression setting. We could also look into better methods of measuring the relationship between the features and class variable than mutual information (MI) and then use them as weights for the Grey distance. Another potential future research paper is to develop an algorithm which imputes and classifies simultaneously, thus yielding a better classification in a single step instead of imputation and classification at two different stages. This idea has already been worked on in Learning Vector Quantization (LVQ) (Villmann et al., 2006) but can be vastly improved.
The most difficult challenge, however, to find imputation techniques when the data is Not Missing at Random (NMAR). It is difficult to model this setting without making strong assumptions, and much development is still possible in that area. The main difficulty is to tackle the problem without assuming anything that may cause a bias - and that is not possible. Hopefully, new ideas will crop up in the future which will make NMAR problem easier to handle.
We would like to acknowledge support for this project from the National Cancer Institute grant P01 CA142538.
- David W Aha, Dennis Kibler, and Marc K Albert. Instance-based learning algorithms. Machine Learning, 6(1):37–66, 1991.
- Ethem Alpaydin. Introduction to Machine Learning. MIT press, 2009.
- Gustavo EAPA Batista and Maria C Monard. A study of k-nearest neighbour as an imputation method. In Second International Conference on Hybrid Intelligent Systems, volume 87, pages 251–260, 2002.
- Gustavo EAPA Batista and Maria Carolina Monard. An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence, 17(5-6):519–533, 2003.
- Dimitris Bertsimas, Colin Pawlowski, and Ying Daisy Zhuo. From predictive methods to missing data imputation: An optimization approach. Journal of Machine Learning Research, 18:196–1, 2017.
- Lígia P Brás and José C Menezes. Improving cluster-based missing value estimation of dna microarray data. Biomolecular engineering, 24(2), 2007.
- Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.
- Leo Breiman, Jerome H Friedman, Richard A Olshen, and Charles J Stone. Classification and Regression Trees. Wadsworth, Belmont, California, 1984.
- Stef van Buuren. Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research, 16(3):219–242, 2007.
- Stef van Buuren and Karin Oudshoorn. Flexible mutlivariate imputation by MICE. TNO Prevention and Health, 1999.
- Stef van Buuren, Jaap PL Brand, Catharina GM Groothuis-Oudshoorn, and Donald B Rubin. Fully conditional specification in multivariate imputation. Journal of Statistical Computation and Simulation, 76(12):1049–1064, 2006.
- Krzysztof J Cios and George W Moore. Uniqueness of medical data mining. Artificial Intelligence in Medicine, 26(1-2):1–24, 2002.
- Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood estimation from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39:1–38, 1977.
- Ju-Long Deng. Control problems of grey systems. Systems & Control Letters, 1(5):288–294, 1982.
- Richard O Duda, Peter E Hart, and David G Stork. Pattern Classification. John Wiley & Sons, 2012.
- Sahibsingh A Dudani. The distance-weighted k-nearest-neighbor rule. IEEE Transactions on Systems, Man, and Cybernetics, SMC-6(4):325–327, 1976.
- Alireza Farhangfar, Lukasz A Kurgan, and Witold Pedrycz. A novel framework for imputation of missing values in databases. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 37(5):692–709, 2007.
- Alireza Farhangfar, Lukasz Kurgan, and Jennifer Dy. Impact of imputation of missing values on classification error for discrete data. Pattern Recognition, 41(12):3692–3705, 2008.
- Pedro J García-Laencina, José-Luis Sancho-Gómez, Aníbal R Figueiras-Vidal, and Michel Verleysen. K nearest neighbours with mutual information for simultaneous classification and missing data imputation. Neurocomputing, 72(7-9):1483 – 1493, 2009.
- Pedro J García-Laencina, José-Luis Sancho-Gómez, and Aníbal R Figueiras-Vidal. Classifying patterns with missing values using multi-task learning perceptrons. Expert Systems with Applications, 40(4):1333–1341, 2013.
- Andrew Gelman and Jennifer Hill. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge university press, 2006.
- Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning. Springer Series in Statistics. Springer, New York, second edition, 2009. ISBN 978-0-387-84857-0. doi: 10.1007/978-0-387-84858-7. URL http://dx.doi.org/10.1007/978-0-387-84858-7. Data mining, inference, and prediction.
- James Honaker, Gary King, and Matthew Blackwell. Amelia II: A program for missing data. Journal of Statistical Software, 45(7):1–47, 2011.
- Chi-Chun Huang and Hahn-Ming Lee. A grey-based nearest neighbor approach for missing attribute value prediction. Applied Intelligence, 20(3):239–252, 2004.
- Chi-Chun Huang and Hahn-Ming Lee. An instance-based learning approach based on grey relational structure. Applied Intelligence, 25(3):243–251, 2006.
- Harry Joe. Generating random correlation matrices based on partial correlations. Journal of Multivariate Analysis, 97(10):2177–2189, 2006.
- Hyunsoo Kim, Gene H Golub, and Haesun Park. Missing value estimation for dna microarray gene expression data: local least squares imputation. Bioinformatics, 21(2):187–198, 2004.
- Solomon Kullback. Information theory and statistics. Courier Corporation, 1997.
- Nojun Kwak and Chong-Ho Choi. Input feature selection by mutual information based on parzen window. IEEE Transactions on Pattern Analysis & Machine Intelligence, 24(12):1667–1671, 2002.
- Kamakshi Lakshminarayan, Steven A Harp, and Tariq Sammad. Imputation of missing data in industrial databases. Applied Intelligence, 11(3):259–275, 1999.
- Mihail Halatchev Le Gruenwald. Estimating missing values in related sensor data streams. In COMAD, 2005.
- Dan Li, Jitender Deogun, William Spaulding, and Bill Shuart. Towards missing data imputation: a study of fuzzy k-means clustering method. In International Conference on Rough Sets and Current Trends in Computing, pages 573–579. Springer, 2004.
- Kim-Hung Li. Imputation using markov chains. Journal of Statistical Computation and Simulation, 30(1):57–79, 1988.
- Roderick JA Little and Donald B Rubin. Statistical Analysis with Missing Data, volume 2. John Wiley & Sons, 2002.
- Julián Luengo, Salvador García, and Francisco Herrera. On the choice of the best imputation methods for missing values considering three groups of classification methods. Knowledge and Information Systems, 32(1):77–108, 2012.
- David J Newman, Sett CLB Hettich, Cason L Blake, and Christopher J Merz. UCI repository of machine learning databases, 2008.
- Ruilin Pan, Tingsheng Yang, Jianhua Cao, Ke Lu, and Zhanchao Zhang. Missing data imputation by k nearest neighbours based on grey relational structure and mutual information. Applied Intelligence, 43(3):614–632, 2015.
- John R Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., 1993.
- Trivellore E Raghunathan, James M Lepkowski, John Van Hoewyk, and Peter Solenberger. A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodology, 27(1):85–96, 2001.
- Donald B Rubin. Formalizing subjective notions about the effect of nonrespondents in sample surveys. Journal of the American Statistical Association, 72(359):538–543, 1977.
- Donald B Rubin and Joseph L Schafer. Efficiently creating multiple imputations for incomplete multivariate normal data. In Proceedings of the Statistical Computing Section of the American Statistical Association, volume 83, page 88. American Statistical Association, 1990.
- Joseph L Schafer. Analysis of incomplete multivariate data. Chapman and Hall, 1997.
- Amir M Sefidian and Negin Daneshpour. Missing value imputation using a novel grey based fuzzy c-means, mutual information based feature selection, and regression model. Expert Systems with Applications, 115:68–94, 2019.
- Nicolas Städler, Daniel J Stekhoven, and Peter Bühlmann. Pattern alternating maximization algorithm for missing data in high-dimensional problems. The Journal of Machine Learning Research, 15(1):1903–1928, 2014.
- Bas van Stein and Wojtek Kowalczyk. An incremental algorithm for repairing training sets with missing values. In International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, pages 175–186. Springer, 2016.
- Daniel J Stekhoven and Peter Bühlmann. Missforestânon-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1):112–118, 2011.
- Mervyn Stone. Cross-validation and multinomial prediction. Biometrika, 61(3):509–515, 1974.
- Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein, and Russ B Altman. Missing value estimation methods for dna microarrays. Bioinformatics, 17(6):520–525, 2001.
- Cheng-Jung Tsai, Chien-I Lee, and Wei-Pang Yang. A discretization algorithm based on class-attribute contingency coefficient. Information Sciences, 178(3):714–731, 2008.
- Thomas Villmann, Frank-Michael Schleif, and Barbara Hammer. Comparison of relevance learning vector quantization with other metric adaptive classification methods. Neural Networks, 19(5):610–622, 2006.
- Xindong Wu, Vipin Kumar, J Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J McLachlan, Angus Ng, Bing Liu, S Yu Philip, et al. Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1):1–37, 2008.
- Shichao Zhang. Shell-neighbor method and its application in missing data imputation. Applied Intelligence, 35(1):123–133, 2011.
- Shichao Zhang. Nearest neighbor selection for iteratively knn imputation. Journal of Systems and Software, 85(11):2541–2552, 2012.