Clustering with Missing Features: A Penalized Dissimilarity Measure based approach

# Clustering with Missing Features: A Penalized Dissimilarity Measure based approach

[    [    [
###### Abstract

Many real-world clustering problems are plagued by incomplete data characterized by missing or absent features for some or all of the data instances. Traditional clustering methods cannot be directly applied to such data without preprocessing by imputation or marginalization techniques. In this article, we put forth the concept of Penalized Dissimilarity Measures which estimate the actual distance between two data points (the distance between them if they were to be fully observed) by adding a penalty to the distance due to the observed features common to both the instances. We then propose such a dissimilarity measure called the Feature Weighted Penalty based Dissimilarity (FWPD). Using the proposed dissimilarity measure, we also modify the traditional k-means clustering algorithm and the standard hierarchical agglomerative clustering techniques so as to make them directly applicable to datasets with missing features. We present time complexity analyses for these new techniques and also present a detailed analysis showing that the new FWPD based k-means algorithm converges to a local optimum within a finite number of iterations. We also report extensive experiments on various benchmark datasets showing that the proposed clustering techniques have generally better results compared to some of the popular imputation methods which are commonly used to handle such incomplete data. We append a possible extension of the proposed dissimilarity measure to the case of absent features (where the unobserved features are known to be undefined).

\@definecounter

proof

isical]Shounak Datta

juext]Supritam Bhattacharjee

isical]Swagatam Das

$\ast$$\ast$footnotetext: Corresponding author11footnotetext: Phone Number: +91-33-2575-2323

Clustering with Missing Features: A Penalized Dissimilarity Measure based approach

aElectronics and Communication Sciences Unit, Indian Statistical Institute,

203, B. T. Road, Kolkata-700 108, India.

Salt Lake City , Block-LB, Plot No. 8, Sector - III, Kolkata - 700098, India.

Keywords: Missing Features, Penalized Dissimilarity Measure, k-means, Hierarchical Agglomerative Clustering, Absent Features

## 1 Introduction

### 1.1 Overview

In data analytics, clustering is a fundamental problem concerned with partitioning a given dataset into useful groups (called clusters) according to their relative similarity. Clustering algorithms attempt to partition a set of data instances (characterised by some features), into different clusters such that the member instances of any given cluster are akin to each other and are different from the members of the other clusters. Greater the similarity within a group and the dissimilarity between groups, better is the clustering obtained by a suitable algorithm.

Clustering techniques are of extensive use and are hence being constantly investigated in statistics, machine learning, and pattern recognition. Clustering algorithms find applications in various fields such as economics, marketing, electronic design, space research, etc. For example, clustering has been used to group related documents for web browsing, by banks to cluster the previous transactions of clients to identify suspicious (possibly fraudulent) behaviour, for formulating effective marketing strategies by clustering customers with similar behaviour, in earthquake studies for identifying dangerous zones based on previous epicentre locations, and so on. However, when we analyze such real-world data, we may encounter incomplete data where some features of some of the data instances are missing. For example, web documents may have some expired hyper-links. Such missingness may be due to a variety of reasons such as data input errors, inaccurate measurement, equipment malfunction or limitations, and measurement noise or data corruption, etc. This is known as unstructured missingness Chan and Dunn (1972); Rubin (1976). Alternatively, not all the features may be defined for all the data instances in the dataset. This is termed as structural missingness or absence of features Chechik et al. (2008). For example, credit-card details may not be defined for non-credit clients of a bank.

Missing features have always been a challenge for researchers because traditional learning methods (which assume all data instances to be fully observed, i.e. all the features are observed) cannot be directly applied to such incomplete data, without suitable preprocessing. When the rate of missingness is low, such as , the data instances with missing values are ignored. This approach is known as marginalization. Marginalization cannot be applied to data having missing values, as it may lead to the loss of a sizable amount of information. Therefore, sophisticated methods are required to fill in the vacancies in the data, so that traditional learning methods can be applied subsequently. This approach of filling in the missing values is called imputation. Inferences drawn from data having more than missingness may be severely warped, despite the use of such sophisticated imputation methods Acuña and Rodriguez (2004).

### 1.2 Literature

The initial models for feature missingness are due to Rubin and Little Rubin (1976); Little and Rubin (1987). They proposed a three-fold classification of missing data mechanisms, viz. Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). MCAR refers to the case where missingness is entirely haphazard, i.e. the likelyhood of a feature being unobserved for a certain data instance depends neither on the observed nor on the unobserved characteristics of the instance. For example, in an annual income survey, a citizen is unable to participate, due to unrelated reasons such as traffic or schedule problems. MAR eludes to the cases where the missingness is conditional to the observed features of an instance, but is independent of the unobserved features. Suppose, college-goers are less likely to report their income than office-goers. But, whether a college-goer will report his/her income is independent of the actual income. MNAR is characterised by the dependence of the missingness on the unobserved features. For example, people who earn less are less likely to report their incomes in the annual income survey. Schafer & Graham Schafer and Graham (2002) and Zhang et al. Zhang et al. (2012) have observed that MCAR is a special case of MAR and that MNAR can also be converted to MAR by appending a sufficient number of additional features. Therefore, most learning techniques are based on the validity of the MAR assumption.

A lot of research on the problem of learning with missing/absent features has been conducted over the past few decades, mostly focussing on imputation methods. Several works such as Little and Rubin (1987) and Schafer (1997) provide elaborate theories and analyses of missing data. Common imputation methods Donders et al. (2006) involve filling the missing features of data instances with zeros (Zero Imputation (ZI)), or the means of the corresponding features over the entire dataset (Mean Imputation (MI)). Class Mean Imputation or Concept Mean Imputation (CMI) is a slight modification of MI that involves filling the missing features with the average of all observations having the same label as the instance being filled. Yet another common imputation method is k-Nearest Neighbour Imputation (kNNI) Dixon (1979), where the missing features of a data instance are filled in by the means of corresponding features over its k-Nearest Neighbours (kNN), on the observed subspace. Grzymala-Busse & Hu Grzymala-Busse and Hu (2001) suggested different approaches for filling in the missing feature values, viz selecting the most common feature value, selecting the most common value of the feature within the same class or concept, C4.5 based imputation, assigning all possible values of the feature, assigning all possible values of the feature restricted to the given concept or class, marginalization, treating missing attribute values as special values, event-covering method, etc.

Rubin’s book Rubin (1987) on Multiple Imputation (MtI) proposes a technique where the missing values are imputed by a typically small (e.g. 5-10) number of simulated versions, depending on the percentage of missing data. This method of repeated imputation incorporates the uncertainty inherent in imputation. Techniques such as Markov Chain Monte Carlo (MCMC) Gilks et al. (1996) (which simulates random draws from nonstandard distributions via Markov chains) have been used for MtI by making a few independent estimates of the missing data from a predictive distribution; and these estimates are then used for MtI (Chen (2013), Horton and Lipsitz (2001)). However, imputation methods may introduce noise and create more problems than they solve, as documented by Little & Rubin Little and Rubin (1987) and others Barceló (2008); Myrtveit et al. (2001); Dempster and Rubin (1983).

Model-based methods that make use of the inter-relationships among the features to handle the missing data have shown vast improvements over traditional imputation approaches Ahmad and Tresp (1993); Wang and Rao (2002a, b). These procedures are somewhat more efficient than MtI because they often achieve better estimates of the missing feature values. Generally, in incomplete datasets, Maximum Likelihood Estimation (MLE) is used when we can maximize the likelihood function. The likelihoods are separately calculated for cases with unobserved features and for cases with complete data on all features. Then, these two likelihoods are maximized together to obtain the estimates. Dempster & Rubin Dempster and Rubin (1983) proposed the use of an iterative solution, based on the Expectation Maximization (EM) algorithm, when closed form solutions to the maximization of the likelihoods are not possible. Some more sophisticated techniques have been developed, especially by the bioinformatics community, to impute the missing values by exploiting the correlations between data. Troyanskaya et al. Troyanskaya et al. (2001) proposed a weighted variant of kNNI and also put forth the Singular Value Decomposition based Imputation (SVDI) technique, which performs regression based estimation of the missing values using the k most significant eigenvectors of the dataset. Two variants of the Least Squares Imputation (LSI) technique were proposed by Bo et al. Bo et al. (2004). Sehgal et al. Sehgal et al. (2005) further combined LSI with Non-Negative LSI (NNLSI) in the Collateral Missing Value Estimation (CMVE) technique. These methods have also been shown to vastly improve the results Sehgal et al. (2004); Ouyang et al. (2004).

However, model-based approaches are often computationally expensive. Moreover, most imputation methods assume that the pattern of missingness is ignorable. When data are MCAR or MAR, it is said that the response mechanism is ignorable. In such cases, the reasons for missing data may be ignored and simpler models can be used for missing data analysis. Heitjan & Basu Heitjan and Basu (1996) provided a thorough discussion of this topic. Therefore, most imputation techniques make sense only when the features are known to exist and to be missing at random (MAR/MCAR). On the other hand, MNAR has non-ignorable response mechanism because the mechanism governing the missingness of data itself has to be modeled to deal with the missing data. Hence, other methods have to be developed to tackle incomplete data due to MNAR Marlin (2008). Also, as observed earlier, imputation may often lead to the introduction of noise and uncertainty in the data. In the analysis of incomplete data, the mechanism and extent of missingness, both are crucial in determining the methods to process them. Therefore, it also makes little sense to use imputation to handle structural missingness (where the unobserved features are known to be undefined).

In light of the observations made in the preceding paragraph, some learning methods avoid the inexact methods of imputation (as well as marginalization) altogether, while dealing with missingness. Krause & Polikar Krause and Polikar (2003) proposed a modification of the Learn++ incremental learning algorithm which can work around the need for imputation by using an ensemble of multiple classifiers learned on random subspaces of the dataset. Juszczak & Duin Juszczak and Duin (2004) trained single class classifiers on each of the features (or combinations thereof) so that an inference about the class to which a particular data instance belongs can be drawn even when some of the features are missing, by combining the individual inferences drawn by each of the classifiers pertaining to the observed features. Random subspace learning was also used by Nanni et al. Nanni et al. (2012) and compared with other approaches such as MI, MtI, etc. Chechik et al. Chechik et al. (2008) used the geometrical insight of max-margin classification to formulate an objective function which was optimized to directly classify the incomplete data. This was extended to the max-margin regression case for software effort prediction with absent features in Zhang et al. (2012). Hathaway & Bezdek Hathaway and Bezdek (2001) used the Partial Distance Strategy (PDS) of Dixon (1979) to extend the Fuzzy C-Means (FCM) clustering algorithm to cases with missing features. PDS scales up the observed distance, i.e. the distance between two data instances in their common observed subspace (the subspace consisting of the observed features common to both data instances) by the ratio of the total number of features (observed as well as unobserved) and the number of common observed features between them to obtain an estimate of their distance in the fully observed space. However, the PDS may not always provide a good estimate of the actual distance as the observed distance between two instances may be unrelated to the distance between them in the unobserved subspace. Wagstaff et al. Wagstaff (2004); Wagstaff and Laidler (2005) suggested a k-means algorithm with Soft Constraints (KSC) where soft constraints determined by fully observed objects are introduced to facilitate the grouping of instances with missing features. It was applied as an alternative to imputation or marginalization techniques for astronomical data from the Sloan Digital Sky Survey where missing values are also of significance and should not be filled in with imputations. Himmelspach & Conrad Himmelspach and Conrad (2010) provided a good review of parititional clustering techniques for incomplete datasets, which mentions some other techniques that do not resort to imputation.

### 1.3 Motivation

One possible way to adapt supervised as well as unsupervised learning methods to problems with missingness is to modify the distance or dissimilarity measure underlying the learning method, so that the modified dissimilarity measure uses the common observed features to provide approximations of the distances between the data instances if they were to be fully observed. PDS is one such measure. Such approaches neither require marginalization nor imputation and are likely to yield better better results than either of the two. For example, let be a dataset consisting of three points in . Then, we have and , being the Euclidean distance between any two fully observed points and in . Suppose that the first coordinate of the point be unobserved, resulting in the incomplete dataset (‘*’ denotes a missing value), on which learning must be carried out. Notice that this is a case of unstructured missingness (because the unobserved value is known to exist), as opposed to the structural missingness of Chechik et al. (2008). Using ZI, MI and 1NNI respectively, we obtain the following filled in datasets:

 XZI={^x1=(0,2),x2=(1.8,1),x3=(2,2.5)}, XMI={^x1=(1.9,2),x2=(1.8,1),x3=(2,2.5)}, and X1NNI={^x1=(2,2),x2=(1.8,1),x3=(2,2.5)},

where denotes an estimate of . If PDS is used to estimate the corresponding distances in , then the estimated distance between and some other instance is obtained by

 dPDS(x1,xi)=√21(x1,2−xi,2)2,

where and respectively denote the 2nd features of and , and 2 is the numerator of the multiplying factor due to the fact that and 1 is the denominator owing to the fact that only the 2nd feature is observed for both and . Then, we get

 dPDS(x1,x2)=√21(2−1)2=1.41, and dPDS(x1,x3)=√21(2−2.5)2=1.

The improper estimates obtained by PDS are due to the fact that the distance in the common observed subspace does not reflect the distance in the unobserved subspace. This is the principal drawback of the PDS method, as discussed earlier. Since the observed distance between two data instances is essentially a lower bound on the Euclidean distance between them (if they were to be fully observed), adding a suitable penalty to this lower bound can resulti in a reasonable approximation of the actual distance. This new approach, which we call the Penalized Dissimilarity Measure (PDM), may be able to overcome the drawback which plagues PDS. Let the penalty between and be given by the ratio of the number of features which are unobserved for at least one of the two data instances and the total number of features in the entire dataset. Then, the dissimilarity between and some other is

 δ′(x1,xi)=√(x1,2−xi,2)2+12,

where the 1 in the numerator of the penalty term is due to the fact that the 1st feature of is unobserved. Therefore, the dissimilarities and are

 δ′(x1,x2)=√(2−1)2+12=1.5, and δ′(x1,x3)=√(2−2.5)2+12=1.

The situation is illustrated in Figure 0(a). The reader should note that the points estimated using ZI, MI and 1NNI exist in the same 2-D Cartesian space to which is native. On the other hand, the points estimated by both PDS and PDM exist in their individual abstract spaces (likely distinct from the native 2-D space). Therefore, for the sake of easy comparison, we illustrate all the estimates together by superimposing both these abstract spaces on the native 2-D space so as to coincide at the points and . It can be seen that the approach based on the PDM does not suffer from the drawback of PDS and is better able to preserve the relationship between the points. Moreover, it should be noted that there are two possible images for each of the estimates obtained by both PDS and PDM. Therefore, had the partially observed point instead been with the first feature missing (giving rise to the same incomplete dataset ), PDS and PDM would still find reasonably good estimates (PDM being slightly better than PDS). This situation is also illustrated in Figure 0(b). In general,

1. ZI works well only for missing values in the vicinity of the origin and is also origin dependent;

2. MI works well only when the missing value is near the observed mean of the missing feature;

3. kNNI is reliant on the assumption that neighbours have similar features, but suffers from the drawbacks that missingness may give rise to erroneous neighbour selection and that the estimates are restricted to the range of observed values of the feature in question;

4. PDS suffers from the assumption that the common observed distances reflect the unobserved distances; and

5. none of these methods differentiate between identical incomplete points, i.e. and are not differentiated between.

However, the proposed idea of a PDM successfully steers clear of all these drawbacks (notice that ). Furthermore, such a PDM can also be easily applied to the case of absent features, by slightly modifying the penalty term (see Appendix A). This knowledge motivates us to propose a PDM which can be used to adapt traditional learning methods to problems with missing/absent features.

### 1.4 Contribution

In this study, we propose a PDM called the Feature Weighted Penalty based Dissimilarity (FWPD) measure and use it to adapt common clustering algorithms to the missing data problem. The FWPD between two data instances is a weighted sum of two terms; the first term being the observed distance between the instances and the second being a penalty term. The penalty term is a sum of the penalties corresponding to each of the features which are missing from at least one of the data instances; each penalty being directly proportional to the probability of its corresponding feature being observed. The proposed weighting scheme is meant to limit the impact of each feature as per its availability in the observed dataset. This novel dissimilarity measure is incorporated into both partitional as well as Hierarchical Agglomerative Clustering (HAC) algorithms over data suffering from unstructured missingness. We formulate the k-means clustering problem for datasets with missing features based on the proposed FWPD and develop an algorithm to solve the new formulation. We prove that the proposed algorithm is guaranteed to converge to a locally optimal solution of the modified k-means optimization problem formulated with the FWPD measure. We also propose Single Linkage (SL), Average Linkage (AL), and Complete Linkage (CL) based HAC methods for datasets plagued by missingness, based on the proposed FWPD. Experiments are reported on diverse datasets and the results are compared with the popularly used imputation techniques. The comparative results indicate that our approach generally achieves better performance than the common imputation approaches used to handle incomplete data. Moreover, since this work presents an alternative to imputation and can be useful in scenarios where imputation is not practical (such as structural missingness), we append an extension of the proposed FWPD to the case of absent features. The principal difference between missing and absent features lies in the fact that the unobserved features are known to be undefined in the latter case, unlike the former. Therefore, while it makes sense to add penalties for features which are observed for only one of the data instances (as the very existence of such a feature sets the points apart), it makes little sense to add penalties for features which are undefined for both the data points. This is in contrast to problems with unstructured missingness where a feature missing from both the data instances is known to be defined for both points (which potentially have distinct values of this feature). Hence, it is essential to add penalties for features missing from both points in the case of missing features, but not in the case of absent features. We also show that the FWPD becomes a semi-metric in the case of structural missingness.

### 1.5 Organization

The rest of this paper is organized in the following way. In Section 2, we elaborate on the proposed FWPD measure. The next section (Section 3) presents a formulation of the k-means clustering problem which is directly applicable to datasets with missing features, based on the FWPD discussed in Section 2. This section also puts forth an algorithm to solve the optimization problem posed by this new formulation. The subsequent section (Section 4) covers the SL, AL, and CL based HAC algorithms which are formulated using FWPD to be directly applicable to incomplete datasets. Experimental results are presented in Section 5 and the relevant conclusions are drawn in Section 6. Subsequently, Appendix A deals with the extension of the proposed FWPD to the case of absent features (structural missingness).

## 2 Feature Weighted Penalty based Dissimilarity Measure for Datasets with Missing Features

Let the dataset consist of instances (), some of which have missing features. Let denote the set of observed features for the data point . Then, the set of all features and . The set of features which are observed for all data instances in is defined as . may or may not be non-zero. is the set of features which are unobserved for at least one data point in .

###### Definition 1

Taking any two data instances , the observed distance between these two points (in the common observed subspace) is defined as

 d(xi,xj)=√∑l∈γxi∩γxj(xi,l−xj,l)2, (1)

where denotes the s-th feature of the data instance .

###### Definition 2

If both and were to be fully observed, the Euclidean distance between and would be defined as

 dE(xi,xj)=√∑l∈S(xi,l−xj,l)2.

Now, since , and , it follows that

 d(xi,xj)≤dE(xi,xj)%∀ xi,xj∈X.

Therefore, to compensate for the distance in the unobserved subspace, we add a Feature Weighted Penalty (FWP) (defined below) to .

###### Definition 3

The FWP between and is defined as

 p(xi,xj)=∑l∈S∖(γxi⋂γxj)wl∑l′∈Swl′, (2)

where is the number of instances in having observed values of the feature . It should be noted that FWP extracts greater penalty for unobserved occurrences of those features which are observed for a large fraction of the data instances.

Then, the definition of the proposed FWPD follows.

###### Definition 4

The FWPD between and is

 δ(xi,xj)=(1−α)×d(xi,xj)dmax+α×p(xi,xj), (3)

where is a parameter which determines the relative importance between the two terms and is the maximum observed distance between any two points in in their respective common observed subspaces.

### 2.1 Properties of the proposed FWPD

In this subsection, we discuss some of the important properties of the proposed FWPD measure. The following theorem discusses some of the important properties of the proposed FWPD measure and the subsequent discussion is concerned with the triangle inequality in the context of FWPD.

###### Theorem 1

The FWPD measure satisfies the following important properties:

1. ,

2. iff , and

3. .

1. From Equations (1) and (3), it follows that

 δ(xi,xi)=α×p(xi,xi). (4)

It also follows from Equation (2) that . Therefore, . Now, it follows from Equation (3) that . Hence, we get .

2. It is easy to see from Equation (2) that iff . Hence, it directly follows from Equation (4) that iff .

3. From Equation (3) we have

 and

However, and (by definition). Therefore, it can be easily seen that .

The triangle inequality is an important criterion which lends some useful properties to the space induced by a dissimilarity measure. Therefore, the conditions under which FWPD satisfies the said criterion are investigated below. However, it should be stressed that the satisfaction of the said criterion is not essential for the functioning of the clustering techniques proposed in the subsequent text.

###### Definition 5

For any three data instances , the triangle inequality with respect to (w.r.t.) the FWPD measure is defined as

 δ(xi,xj)+δ(xj,xk)≥δ(xk,xi). (5)

The two following lemmas deal with the conditions under which Inequality (5) will hold.

###### Lemma 2

For any three data points , Inequality (5) is satisfied when .

• Let denote the penalty corresponding to the subspace , i.e. . Then, from Equation (3) we have

 δ(xi,xj)=(1−α)×d(xi,xj)dmax+α×(pS∖(γxi⋂γxj)), δ(xj,xk)=(1−α)×d(xj,xk)dmax+α×(pS∖(γxj⋂γxk)), and δ(xk,xi)=(1−α)×d(xk,xi)dmax+α×(pS∖(γxk⋂γxk)).

Therefore, Inequality (5) can be rewritten as

 (1−α)×d(xi,xj)dmax+α×(pS∖(γxi⋂γxj)) (6) +(1−α)×d(xj,xk)dmax+ α×(pS∖(γxj⋂γxk)) ≥ (1−α)×d(xk,xi)dmax+α×(pS∖(γxk⋂γxi)).

Further simplifying (6), we get

 α×(p(γxi⋃γxk)∖γxj+p(γxi⋂γxk)∖γxj +pγxj∖(γxi⋃γxk)+pS∖(γxi⋃γxj⋃γxk)) (7) ≥(1−α)dmax×(d(xk,xi)−(d(xi,xj)+d(xj,xk))).

When , as , the Right Hand Side (RHS) of Inequality (7) is less than or equal to zero. Now, the Left Hand Side (LHS) of Inequality (7) is always greater than or equal to zero as and . Hence, LHS RHS, which completes the proof.

###### Lemma 3

If , and , then Inequality (7) tends to be satisfied.

• When , and , then LHS and RHS for the Inequality (7). As , Inequality (7) tends to be satisfied.

The following definition and lemma deal with the value of the parameter for which a relaxed form of the triangle inequality is satisfied for any three data instances in a dataset .

###### Definition 6

For any three data instances let

 ρi,j,k=p(γxi⋃γxk)∖γxj+p(γxi⋂γxk)∖γxj+pγxj∖(γxi⋃γxk)+pS∖(γxi⋃γxj⋃γxk).

Then we define the quantity as

 P=min{ρi,j,k:xi,xj,xk∈X,ρi,j,k>0}.
###### Lemma 4

For any arbitrary constant satisfying , if , then the following relaxed form of the triangle inequality

 δ(xi,xj)+δ(xj,xk)≥δ(xk,xi)−ϵ2, (8)

is satisfied for any .

1. If , , and are all fully observed, then Inequality (5) holds. Hence Inequality (8) must also hold.

2. If i.e. at least one of the data instances is not fully observed, and , then , , , and . This implies that , and . Consequently, , and . Let denote the distance between and in the subspace , i.e. . Therefore, . Now, by the triangle inequality in subspace , . As, and , we have . Hence, Inequalities (5) and (8) are satisfied.

3. If and , as , LHS of Inequality (7) . Since , we further get that LHS . Moreover, as , we get RHS of Inequality (7) . Therefore, LHS - RHS . Now, as Inequality (7) is obtained from Inequality (5) after some algebraic manipulation, it must hold that (LHS - RHS) of Inequality (7) = (LHS - RHS) of Inequality (5). Hence, we get which can be simplified to obtain Inequality (8). This completes the proof.

Let us now elucidate the proposed FWP (and consequently the proposed FWPD measure) by using the following example.

###### Example 1

Let be a dataset consisting of data points, each having three features (), some of which (marked by ’*’) are unobserved. The dataset is presented below (along with the feature observation counts and the observed feature sets for each of the instances).

 Data Point xi,1 xi,2 xi,3 γxi x1 * 3 2 {2,3} x2 1.2 * 4 {1,3} x3 * 0 0.5 {2,3} x4 2.1 3 1 {1,2,3} x5 −2 * * {1} Obs. Count w1=3 w2=3 w3=4 -

The pairwise observed distance matrix and the pairwise penalty matrix , are as follows:

 Ad=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣0,2,3.35,1,02,0,3.5,3.13,3.23.35,3.5,0,3.04,01,3.13,3.04,0,4.10,3.2,0,4.1,0⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦ and Ap=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣0.3,0.6,0.3,0.3,10.6,0.3,0.6,0.3,0.70.3,0.6,0.3,0.3,10.3,0.3,0.3,0,0.71,0.7,1,0.7,0.7⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦.

From it is observed that the maximum pairwise observed distance . Then, the normalized observed distance matrix is

 A¯d=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣0,0.49,0.82,0.24,00.49,0,0.85,0.76,0.780.82,0.85,0,0.74,00.24,0.76,0.74,0,10,0.78,0,1,0⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦.

. While it is not necessary, let us choose . Using Equation (3) to calculate the FWPD matrix , we get:

 Aδ=0.3×A¯d+0.7×Ap=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣0.21,1.02,1.22,0.51,0.71.02,0.21,1.47,1.15,1.451.22,1.47,0.21,1.12,0.70.51,1.15,1.12,0,1.720.7,1.45,0.7,1.72,0.49⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦.

It should be noted that in keeping with the properties of the FWPD described in Subsection 2.1, is a symmetric matrix with the diagonal elements being the smallest entries in their corresponding rows (and columns) and the diagonal element corresponding to the fully observed point being the only zero element. Moreover, it can be easily checked that the relaxed form of the triangle inequality, as given in Inequality (8), is always satisfied.

## 3 k-means Clustering for Datasets with Missing Features using the proposed FWPD

This section presents a reformulation of the k-means clustering problem for datasets with missing features, using the FWPD measure proposed in Section 2. The k-means problem (a term coined by MacQueen MacQueen (1967)) deals with the partitioning of a set of data instances into clusters so as to minimize the sum of within-cluster dissimilarities. The standard heuristic algorithm to solve the k-means problem, referred to as the k-means algorithm, was first proposed by Lloyd in 1957 Lloyd (1982), and rediscovered by Forgy Forgy (1965). Starting with random assignments of each of the data instances to one of the clusters, the k-means algorithm functions by iteratively recalculating the cluster centroids and reassigning the data instances to the nearest cluster (the cluster corresponding to the nearest cluster centroid), in an alternating manner. Selim & Ismail Selim and Ismail (1984) showed that the k-means algorithm converges to a local optimum of the non-convex optimization problem posed by the k-means problem, when the dissimilarity used is the Euclidean distance between data points.

The proposed formulation of the k-means problem for datasets with missing features using the proposed FWPD measure, referred to as the k-means-FWPD problem hereafter, differs from the standard k-means problem not only in that the underlying dissimilarity measure used is FWPD (instead of Euclidean distance), but also in the addition of a new constraint which ensures that a cluster centroid has observable values for exactly those features which are observed for at least one of the points in its corresponding cluster. Therefore, the k-means-FWPD problem to partition the dataset into clusters , can be formulated in the following way:

 P: minimize f(U, (9a) subject to k∑j=1ui,j=1 ∀ i∈{1,2,⋯,n}, (9b) ui,j∈{0,1} ∀ i∈{1,2,⋯,n},j∈{1,2,⋯,k}, (9c) and γzj=⋃xi∈Cjγxi ∀ j∈{1,2,⋯,k}, (9d)

where is the real matrix of memberships, denotes the maximum observed distance between any two data points , denotes the set of observed features for , denotes the -th cluster (corresponding to the centroid ), , and it is said that when .

### 3.1 The k-means-FWPD Algorithm

To find a solution to the problem P, which is a non-convex program, we propose a Lloyd’s heuristic-like algorithm based on the FWPD (referred to as k-means-FWPD algorithm), as follows:

1. Start with a random initial set of cluster assignments such that . Set .

2. For each cluster , calculate the observed features of the cluster centroid as:

 ztj,l=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩∑xi:l∈γxiuti,j×xi,l∑xi:l∈γxiuti,j ,∀ l∈⋃xi∈Ctjγxi,zt−1j,l,∀ l∈γzt−1j∖⋃xi∈Ctjγxi. (10)
3. Assign each data point to the cluster corresponding to its nearest (in terms of FWPD) centroid, i.e.

 ut+1i,j=⎧⎪⎨⎪⎩1,if ztj=argminz∈{zt1,⋯,ztk}δ(xi,z),0,otherwise.

Set . If , then go to Step 4; otherwise go to Step 2.

4. Calculate the final cluster centroid set as:

 z∗j,l=∑xi:l∈γxiut+1i,j×xi,l∑xi:l∈γxiut+1i,j ∀ l∈⋃xi∈Ct+1jγxi. (11)

Set .

### 3.2 Notions of Feasibility in Problem P

Let and respectively denote the sets of all possible and . Unlike the traditional k-means problem, the entire space is not feasible for the Problem P. There exists a set of feasible for a given . Similarly, there exist sets of feasible and super-feasible (a special subset of infeasible ) for a given . We formally define these notions in this subsection.

###### Definition 7

Given a cluster centroid set , the set of feasible membership matrices is given by

 F(^Z)={U:ui,j=0 ∀j∈{1,2,⋯,k} % such that γzj⊂γxi}.
###### Definition 8

Given a membership matrix , the set of feasible cluster centroid sets can be defined as

 F(^U)={Z:Z satisfies constraint (???)}.
###### Definition 9

Given a membership matrix , the set of super-feasible cluster centroid sets is

 S(^U)={Z:γzj⊇⋃xi∈Cjγxi ∀ j∈{1,2,⋯,k}}. (12)

### 3.3 Partial Optimal Solutions

This subsection deals with the concept of partial optimal solutions of the problem P, to one of which the k-means-FWPD algorithm is shown to converge (prior to Step 4). The following definition formally presents the concept of a partial optimal solution.

###### Definition 10

A partial optimal solution of problem P, a point , satisfies the following conditions Wendel and Hurter Jr. (1976):

 f(~U,~Z)≤f(U,~Z) ∀ U∈F(~Z), and f(~U,~Z)≤f(~U,Z) ∀ Z∈S(~U).

To obtain a partial optimal solution of P, the two following subproblems are defined:

 P1: Given ^U∈U, minimize f(^U,Z) over S(^U). P2: Given ^Z satisfying (???), % minimize f(U,^Z) over U∈U.

The following lemmas establish that Steps 2 and 3 of the k-means-FWPD algorithm respectively solve the problems P1 and P2 for a given iterate. The subsequent theorem shows that the k-means-FWPD algorithm converges to a partial optimal solution of P.

###### Lemma 5

For any satisfying (9d) and any , is independent of the values of the observed features of .

• Given a satisfying (9d) and a , we know that . Hence, we have . Consequently,

 p(xi,zj)=∑l∈S∖(γxi⋂γzj)wl∑l′∈Swl′=∑l∈S∖γxiwl∑l′∈Swl′,

which is independent of the values of the features of .

###### Lemma 6

Given a , the centroid matrix calculated using Equation (10) is an optimal solution of the Problem P1.

• For a fixed , the objective function is minimized when . For a particular , we know from Lemma 5 that is independent of the values of the features of . Since an observed features of has no contribution to the observed distances, . For an observed feature