An Entropy-based Variable Feature Weighted Fuzzy k-Means Algorithm for High Dimensional Data
This paper presents a new fuzzy k-means algorithm for the clustering of high dimensional data in various subspaces. Since, In the case of high dimensional data, some features might be irrelevant and relevant but may have different significance in the clustering. For a better clustering, it is crucial to incorporate the contribution of these features in the clustering process. To combine these features, in this paper, we have proposed a new fuzzy k-means clustering algorithm in which the objective function of the fuzzy k-means is modified using two different entropy term. The first entropy term helps to minimize the within-cluster dispersion and maximize the negative entropy to determine clusters to contribute to the association of data points. The second entropy term helps to control the weight of the features because different features have different contributing weights in the clustering process for obtaining the better partition of the data. The efficacy of the proposed method is presented in terms of various clustering measures on multiple datasets and compared with various state-of-the-art methods.
Clustering is a method that tries to organize unlabelled input data points into clusters or groups such that data points within a cluster have the most similarity than those belonging to different clusters, i.e., to maximize the intra-cluster similarity while minimizing the inter-cluster similarity. Based on belongingness, in the literature, clustering algorithms have been classified into various categories: hierarchical, density-based, grid-based, partition-based, and model-based clustering [bezdek, hierarchical, density, grid]. Among them, the partition-based clustering is most widely studied. The most popular partition-based k-means algorithms reported in the literature are well known due to their performance in clustering large scale datasets, however these algorithms are susceptible to the initial cluster centers [Ac, Ad, ball]. Ruspini [ruspini] and Bezdek [bezdek1980] have presented fuzzy variants of the k-means algorithm, where each data point can be a subset of multiple clusters with a different membership degree. However, the major problem with the k-means and fuzzy k-means algorithm remain the same, i.e., all features are assumed equally important during the clustering. Due to that, these algorithms are easily affected by the outliers.
To overcome the above limitations, feature weighted techniques had been studied by assigning different feature weight values based on the usefulness of the features in the identification of the clusters. In partition-based methods, k-means (KM)[km] and fuzzy c-means (FCM) [fcm] clustering algorithms have been more explored due to its simplicity and data handling characteristics. Feature weighted clustering algorithms, i.e, weighted k-means (WKM) [wkm], entropy weighted k-means (EWKM) [ewkm], sparse k-means (SKM) [fs], weighted fuzzy c-means (WFCM) [wfcm], feature weighted FCM based on simultaneous clustering and attribute discrimination (SCAD) [scad], and feature reduction FCM [frfcm], have been studied in the past. The WKM is an extension of KM in which features are assigned equal weights to all clusters. The EWKM is the entropy weighted clustering method in which features are assigned distinct weights in the various clusters. The SKM uses the norm in the objective function, which helps for making the feature weights zero when the feature weights are smaller than a predefined threshold. To improve the learning performance of FCM, the WFCM is proposed in which the feature weights are learned through the gradient descent method. The SCAD is presented to improve the FCM performance by simultaneous clustering of the data with attribute discrimination.
The variable weighting of features in the clustering is an important research area in machine learning, which automatically determines the weight of each feature based on its significance in the clustering process [wkm, ewkm, fwk]. For learning the number of cluster (k), the differences between clusters, based on the relevance of individual features in each cluster have been considered, which overcomes the problem of k-means and fuzzy k-means clustering algorithms in which all features are considered equally important during the clustering process. Specifically, for handling the sparse and high dimensional data, the contribution of each feature is essential, maybe some of the features are irrelevant, but they have different significance to the clustering process [tw, agglomerative].
For a better clustering, this paper present a novel entropy-based variable feature weighted fuzzy k-means algorithm for clustering the data by considering the different feature weights in each cluster. These features have various contributions to identify the data points in a cluster during the clustering process. The difference in the contribution of these features is formulated in terms of weight that can be expressed as the degree of certainty to the cluster. In the clustering, the decrease of the weight entropy in a cluster shows the increase of certainty of a subset of features with more substantial weights in the determination of the cluster. However, the membership entropy makes the clustering is insensitive to the initial cluster centers. In the proposed approach, we have simultaneously minimized inter-cluster dispersion, maximizes the negative data points-to-clusters membership entropy, and negative weight entropy to impel more number of features to contribute to the identification of a cluster.
The major contributions are briefly summarized as follows:
For handling the high dimensional, sparse and noisy data, the fuzzy k-means objective function is modified to increase the robustness using the fuzzy membership entropy with weight entropy in the objective function.
The fuzzy membership entropy helps in minimizing the within-cluster dispersion and maximize inter-cluster dispersion to identify the data points in a cluster.
The weighted feature entropy helps to simulate more dimensions to add different contributions for the identification of data points in a cluster.
Ii Proposed Methodology
In this section, we present a novel entropy-based variable feature weighted fuzzy k-means clustering algorithm by considering the different contributions of each feature in each cluster. In the proposed approach, the weighted fuzzy k-means algorithm is combined with membership entropy and feature weighted entropy term simultaneously. As defined in (1) the first term measure the dissimilarity between the samples within clusters, the second term measures the membership entropy between samples and clusters during the clustering process, and the third term, i.e., the entropy of the feature weights, represents the degree of certainty to the features in the identification of a cluster.
Let us consider a data matrix , and are the number of features and number of samples respectively. Here, is the sample in the data matrix. To group the data matrix into number of clusters, the following objective function can be minimized
where, is the dissimilarity measure between sample and cluster, is the value of feature of sample, is the value of feature of cluster, is the , fuzzy partition matrix, is the membership degree value of the cluster of sample, is the , matrix having the cluster centers, is an , weight matrix, is the weight value of feature to cluster, and are the input parameters used to control the fuzzy partition and feature weight, respectively.
Ii-a Optimization Procedure
The objective function, as given in (1), is a constrained nonlinear optimization problem whose solutions are unknown. The main aim is to minimize with respect to , , and using alternating optimization method. In the alternating optimization method, first we fix and and minimize with respect to . Then we fix and and minimize with respect to . Afterward, we fix and and minimize with respect to .
First for the fixed and the objective function is minimized with respect to as
From (3) the cluster center can be obtained using
From the above equation two cases are arises.
Case 1: If , the feature is totally irrelevant respective to the cluster. Hence, for any value of , this feature will not contribute to the overall weighted distance computation. Therefore, in this case, any random value can be taken for .
Case 2: If , the feature is relevant respective to the cluster, then the (4) is written as
Then for given , the constraint optimization problem in (1) is changed into unconstrained minimization problem using Lagrangian multiplier technique as follows:
where, and are the vectors containing the Lagrangian multipliers. If are the optimal values of , then the gradient with respect to these variable are vanishes.
As shown in (15) and (16) the feature weights within the cluster and membership degree of the data points are depend on each other. The dependency of both term on each other helps to find the better partition of the dataset which are reflected in the results and discussion section.
Ii-B Parameter Selection
The choice of parameters and in (15) and (16) are essential to the performance of the proposed approach since they reflect the importance of the second and third term relative to the first term in (1). The parameter controls the clustering process in two ways: First, when is large such that the first term in (1), i.e., within-cluster dispersion, is small in comparison to the second term, then the second term plays a crucial role to minimize (1). In the clustering process, it is trying to assign data points to more than one cluster to make the second term more negative. When the membership degrees of a data point to all clusters are the same, the membership entropy value is large. Since the locations of points are fixed, to get the largest entropy, move the cluster centers to the same location. Second, when is small, the first term is large, and it will play the role of minimizing the within-cluster dispersion. However, the control parameter is used to control the feature weights, when is positive, then the weight as given in (15) is inversely proportional to . The small of this term will make large , i.e., features in cluster are more important. If in (15) is too large, the third term will dominate, and all features in cluster will be relevant and assigned equal weights of . The parameters and in (15) and (16) are computed by (17) and (18) in iteration, t, as
The parameter and are a constant, and the superscript () represents the values in iteration (). The presented approach is briefly described in Algorithm 1 as follows:
|Datasets||# of samples||# of features||# of cluster|
|Ovarian cancer [ovar]||2|
|Colon cancer [colon]||2|
|Ovarian cancer [ova]||2|
|Lung cancer [data]|
|Datasets||KM [km]||EWKM [ewkm]||AFKM [agglomerative]||FCM [fcm]||SCAD [scad]||Proposed Approach|
|Ovarian cancer [ovar]||AR||55.73||55.73||55.73||64.81||56.13||81.03|
|Colon cancer [colon]||AR||51.61||53.23||53.23||51.61||53.23||55.65|
|Ovarian cancer [ova]||AR||56.02||76.85||71.29||64.81||71.30||86.29|
|Lung cancer [data]||AR||49.75||62.44||51.72||55.56||59.61||64.04|
|Datasets||AFKM [km]||FCM [fcm]||SCAD[scad]||Proposed Approach|
|Iris [Dua]||3||PC||3 (0.798)||3 (0.783)||3 (0.783)||3 (0.822)|
|CE||3 (0.299)||3 (0.395)||3 (0.395)||3 (0.294)|
|XB||3 (3.614)||3 (4.231)||3 (4.230)||3 (3.575)|
|DI||3 (0.030)||3 (0.105)||3 (0.105)||3 (0.052)|
|Ionosphere [Dua]||2||PC||2 (0.500)||2 (0.651)||2 (0.651)||2 (0.729)|
|CE||2 (0.693)||2 (0.521)||2 (0.521)||2 (0.394)|
|XB||2 (2.061)||2 (3.156)||2 (3.157)||2 (3.176)|
|DI||2 (0.081)||2 (0.071)||2 (0.071)||2 (0.082)|
|Ovarian cancer [ovar]||2||PC||2 (0.915)||2 (0.779)||2 (0.773)||2 (0.935)|
|CE||2 (0.143)||2 (0.365)||2 (0.374)||2 (0.106)|
|XB||2 (2.583)||2 (2.072)||2 (2.087)||2 (2.170)|
|DI||2 (0.118)||2 (0.118)||2 (0.118)||2 (0.133)|
|Colon cancer [colon]||2||PC||2 (0.500)||2 (0.568)||2 (0.568)||2 (0.733)|
|CE||2 (0.693)||2 (0.622)||2 (0.622)||2 (0.394)|
|XB||2 (0.982)||2 (0.997)||2 (0.997)||2 (1.114)|
|DI||2 (0.311)||2 (0.326)||2 (0.326)||2 (0.271)|
|Ovarian cancer [ova]||2||PC||2 (0.869)||2 (0.779)||2 (0.773)||2 (0.884)|
|CE||2 (0.215)||2 (0.365)||2 (0.374)||2 (0.129)|
|XB||2 (1.834)||2 (2.072)||2 (2.087)||2 (1.633)|
|DI||2 (0.165)||2 (0.118)||2 (0.118)||2 (0.094)|
|Glioma [data]||4||PC||4 (0.465)||4 (0.252)||4 (0.263)||4 (0.637)|
|CE||4 (0.926)||4 (1.382)||4 (1.360)||4 (0.664)|
|XB||4 (0.607)||4 (0.406)||4 (0.405)||4 (0.910)|
|DI||4 (0.434)||4 (0.474)||4 (0.397)||4 (0.412)|
|Lung cancer[data]||5||PC||5 (0.308)||5 (0.200)||5 (0.200)||5 (0.729)|
|CE||5 (1.309)||5 (1.609)||5 (1.609)||5 (0.497)|
|XB||5 (0.419)||5 (0.292)||5 (0.292)||5 (0.975)|
|DI||5 (0.431)||5 (0.342)||5 (0.364)||5 (0.451)|
Iii Results and Discussion
The proposed approach is validated on the various datasets as described in Table I and efficacy is presented in terms of various clustering performance measures such as accuracy rate (AR), rand index (RI) [rand] and normalized mutual information (NMI) [NMI]. These performance measure are mathematically written as follows:
where, denote the number of data points correctly obtained in cluster , is the total number of data points. The large value of AR represents better clustering performance. The RI index is used to measure the similarity between the two clustering partitions. Let is the number of actual clusters, and is the number of clusters obtained through the various clustering methods. For a pair of points , is the number of pairs of points that are the same in clusters and , is the number of pairs of points that are same in cluster and different in cluster , is the number of pairs of points that are different in cluster , and same in cluster , and is the number of pairs of points that are different in clusters , and In the clustering, NMI is used to measure the information of a term to contribute for making the correct classification decision, and its values always lie in between and .
The above clustering performance measures are supervised i.e., a number of cluster k is known. However, it is generally unknown in clustering. Therefore, to find the effectiveness of proposed approach we have used different cluster validity indices, i.e., Partition Coefficient (PC) [pc], Xie and Beni’s (XB) Index [xb], Classification Entropy (CE) [ce], and Dunn’s Index (DI) [di]. These validity indices are mathematically written as follows:
The range of PC lies in between , if PC is closer to represents the best partition, whereas closer to , the partition becomes fuzzier in the clustering. CE measures the fuzziness of the cluster partition similar to the PC, smaller the values of CE move towards the optimal cluster. In XB, the numerator represents the compactness of the fuzzy partition, and denominator denotes the strength between clusters, smaller the values of XB move towards the optimal cluster. DI measure the compactness in the well-separated clusters. The large value of DI represent better clustering results.
The proposed approach is validated on both small and high dimension datasets and compared with various state-of-the-art methods, i.e., KM, EWKM, AFKM, FCM, and SCAD. The KM algorithm gives equal importance to all features in the clustering and unable to provide the optimal clusters. This problem is overcome by the EWKM clustering, where different weight is assigned to the features in the clustering process. But the problem with this algorithm is in weight assignment because weight values depend on the initial cluster center. If the initial cluster center changed, the algorithm would lead to different clustering performance and unable converge. In the AFKM, an entropy term is added in the cost function to obtain the optimal cluster in the dataset, but the problem is that it treats all features equally in the clustering process. FCM assigns the soft partition of the data. Still, it gives equal importance to all features, however, in the SCAD they consider the different features weight in the different cluster, but they do not consider the entropy which controls the partition. In the proposed approach, the objective function is simultaneously optimized by taking weight entropy and fuzzy partition entropy. The weight entropy help to assign the relevant feature weight to the feature in the clustering whereas, fuzzy partition help to find the optimal number of the partition. The weight control and fuzzy partition control parameter in the case of EWKM and AFKM are heuristics, however, in the case of SCAD, weight control parameters are updated automatically in each iteration, but the fuzzy partition is not consistent. In the proposed approach, weight control and fuzzy partition control parameters are updated automatically in each iteration, and a better fuzzy partition is obtained due to relevant feature weight assignment during the clustering process.
As shown in Table II, the clustering performance in terms of AR, RI, and NMI of the proposed approach is much better for both high dimension and non-sparse datasets in comparison to state-of-the-art methods. The cluster validity index, i.e., PC, CE, XB, and DI, are also computed and compared with state-of-the-art methods, as given in Table III. The cluster validity measure shows that the proposed approach is able to provide the optimal clustering result with improved clustering performance. As shown in Table IV, the computational complexity of the proposed approach is little large in comparison to state-of-the-art methods due to the second and third term in the objective function. However, due to these terms, the clustering performance are improved.
The contribution of weight in each cluster corresponding to the features is also plotted for the IRIS and GLIOMA dataset, as shown in Figures 1 and 4. As shown in Figures 1 and 4, the features have different weight contribution in each cluster, which generalize the approach. The fuzzy partition coefficients are shown in Figure 2 by varying the number of cluster which shows that proposed approach provide the better partition. The fuzzy partition control parameters are also plotted to show how parameters are changed from each iteration. The results in Tables II and III are computed by running all algorithms in ten trials independently, and average results are reported.
In this paper, an entropy-based variable feature weighted fuzzy k-means algorithm is presented for the clustering of high dimensional data with improved performance. In this approach, two different entropy terms are added in the objective function which helps in identifying the better clustering structure of the data. The major advantage of the presented method is that the clustering performance are consistent because it is insensitive to the initial cluster center due to the assignment of different feature weights to each cluster in the clustering process. The performance is compared with state-of-the-art methods in terms of different clustering measures, which shows that the proposed approach is a new clustering algorithm to partition the data with improved performance. In the future, correlation between the features can be considered to minimize the effect of redundant features, and the proposed method can be extended to categorical or mixed attributes.