Selection of the Number of Clusters in Functional Data AnalysisAdriano Zanin Zambom - Julian A. Collazos - Ronaldo Dias

Selection of the Number of Clusters in Functional Data Analysis Adriano Zanin Zambom - Julian A. Collazos - Ronaldo Dias

Abstract

Identifying the number of clusters in a dataset is one of the most difficult problems in clustering analysis. A choice of that correctly characterizes the features of the data is essential for building meaningful clusters. In this paper we tackle the problem of estimating the number of clusters in functional data analysis by introducing a new measure that can be used with different procedures in selecting the optimal . The main idea is to use a combination of two test statistics, which measure the lack of parallelism and the mean distance between curves, to compute criteria such as the within and between cluster sum of squares. Simulations in challenging scenarios suggest that procedures using this measure can detect the correct number of clusters more frequently than existing methods in the literature. The application of the proposed method is illustrated on several real datasets.

Keywords:
parallelism test statistic -means algorithm ANOVA clustering

1 Introduction

Various methods have been proposed for clustering items in multidimensional spaces. The identification of homogeneous subsets of the data is a fundamental task in diverse statistical applications such as gene expression data (Sørlie et al. (2003), Ma et al. (2006), Nair et al. (2019)), finance (Wang et al. (2006), Lascio et al. (2018)), medicine (McClelland and Kronmal (2002), Kim et al. (2014), Zhang et al. (2017)), just to cite a few. Selecting the number of clusters is an essential problem, so that objects in the same group have features as similar as possible, and objects in different groups have distinct features. In general, the number of clusters in a dataset is unknown, however in some cases, the choice can be motivated by the dataset itself. An area where finding the number of clusters and the clusters themselves can be especially challenging is functional data analysis, which requires working with spaces of infinite dimension. In this paper we introduce a new measure that can be used with several criteria to select the number of clusters in functional data analysis.

Several authors in the literature of statistics have considered the problem of selecting the number of clusters in a dataset. The many criteria for finding the best (in some sense) number of clusters are a general guideline, however the researcher has to decide the appropriate definition of “clusters”, or the intended definition of separation between clusters, for each specific application (Hennig et al. (2015)). A comparison of early methods in the general multidimensional setting can be found in Milligan and Cooper (1985) and Hardy (1996). The well-known distance-based clustering algorithms, such as the -means, quantify the closeness between observations by some distance measure. One of the classical approaches is the Caliński and Harabasz (1974) dendrite method, which selects the number of clusters that maximizes the ratio of the between to the within cluster sums of squares. Krzanowski and Lai (1985) suggest choosing to maximize , where is a scaled difference between the within sum of squares when using and that when using clusters. Similarly, Hartigan (1975) proposes minimizing a sequential measure of ratios between the within sums of squares. Tibshirani et al. (2001) introduce the interesting Gap statistic, which uses the output of any clustering algorithm and compares the expected within-cluster dispersion of the null distribution and the one computed from the data. Other methods include Kaufman and Rousseeuw (2009), Fischer (2011), Steinwart (2015), Hyrien and Baran (2016), Chakraborty and Das (2018), and references therein. In general, methods either perform clustering for different values of and select the one that minimizes/maximizes some criterium, or decide at each step whether a cluster should be split into two clusters.

Clustering functional data is not new in the statistics literature, see for instance Abraham et al. (2003), Tokushige et al. (2007), Bouveyron and Jacques (2011a), Febrero-Bande and de la Fuente (2012), Ieva et al. (2013a), Jacques and Preda (2013) Yamamoto and Terada (2014), Wang et al. (2014), Jacques and Preda (2014), García et al. (2015), Bouveyron et al. (2015) and references therein. Model-based clustering techniques, which assume that the data comes from a mixture of probability distributions, usually select the number of clusters using the Bayesian Information Criteria (BIC). For instance, Ma et al. (2006) and Ma and Zhong (2008) use the BIC while performing clustering via the rejection-controlled EM-algorithm initialized by -means clustering for functional Gaussian mixture models. Bouveyron and Jacques (2011b) study clustering of time series in group-specific functional subspaces, while Giacofci et al. (2013) develop a clustering method for mixed-effects functional models in high dimensions, both using the BIC to select the number of clusters. Other authors who perform clustering in functional data analysis and use specially designed methods to select the number of clusters include Serban and Jiang (2012) in a multilevel functional clustering; James and Sugar (2003), who use the largest jump in distortion function, which is defined as the average Mahalanobis distance between each spline coefficient in the functional clustering model and its closest cluster center; and Ieva et al. (2013b), who choose the optimal number of clusters using the so-called silhouette (Kaufman and Rousseeuw (2009)) values.

A basic framework to select the number of clusters in functional datasets is to use well-established procedures with some distance measure between curves (and distance between curves and cluster centers). These distances can be used, for example, to compute the within and between cluster sum of squares (Caliński and Harabasz (1974), Krzanowski and Lai (1985), Hartigan (1975)). In functional data analysis, the distance between curves can be computed as the square root of the sum of the distances at each observed time point. However, the distance may not always be the best measure when clustering functional data. In this paper a new measure of similarity of curves is introduced, which combines two test statistics that assess the parallelism and the mean height of each curve. The test of parallelism is a novel concept adapted from Zambom and Akritas (2014) and Zambom et al. (2018). The proposed measure is then used to alter some existing approaches that attempt to minimize a distance-based dissimilarity measure of within and between clusters as well as other criteria. The choice of the number of clusters using the proposed measure seems to enhance the stability and agreement of the modified procedures when compared to the same procedures using the version and those methods specially designed for functional data.

The remainder of the paper is organized as follows. Section 2 presents the proposed measure and how it can be used to adapt existing methods to select the number of clusters. Section 3 shows a comparative numerical simulation of the performance of the proposed method and other existing approaches in the literature, including some that use BIC or AIC. A data analysis of real datasets is presented in Section 4, and Section 5 provides some concluding remarks.

2 Methodology

Let denote the true functional data of the -th observational unit, , in cluster , , where is a compact set in , usually representing time. Since functional data are in general discretely recorded and contaminated with measurement errors, let

(1)

be the observed values of the -th function at , where is the independent and identically distributed measurement error. This is a general formulation of functional data arising from clusters, which includes for example the model in Ma et al. (2006) who uses , where is the mean curve of cluster and is a Gaussian parallel deviation. The cluster centers in this case are the mean functions . For a more general notation we will denote the center of the -th cluster by . Assuming that the true data generating mechanism is the one in model (1) but the curves are observed without the cluster labels, the objective is to correctly identify , the true number of clusters.

The proposed measure to be used in selecting the number of clusters is based on a combination of two test statistics, which we derive using ideas similar to those in Zambom et al. (2018). These test statistics evaluate the parallelism and average height between curves, so that our method aims at identifying clusters that are determined by the curve shape and height. The first statistic checks whether the curve of an individual is parallel to a cluster center in the following way. Compute the residuals

Consider as data from a one-way ANOVA design with being the observation at “level” . Augment each ANOVA factor level by including the corresponding to the , for odd, nearest grid points on either side of , that is

The choice of determines the size of each factor level in the augmented ANOVA, and can be chosen by running the test after randomly permuting the observed variables among the times , in order to induce the validity of the null hypothesis, and selecting to approximate the desired level of the test (Zambom and Akritas (2014), Zambom and Akritas (2015), Zambom and Kim (2017)).

If the curve is parallel to the cluster center, the residuals should have constant mean and thus a test for parallelism is equivalent to , for a constant . This test can be accomplished with an ANOVA F test of constant mean across the factor levels defined by . Hence we propose using the one-way ANOVA test statistic

where , , , and and are the treatment mean sum of squares and error mean sum of squares, respectively, from an ANOVA whose factor levels are defined by the windows . The form of this statistic is due to Akritas and Papadatos (2004), who showed that its distribution is asymptotically equivalent to the distribution of the well-known when the number of factor levels goes to infinity. In order to obtain a non-negative measure, we compute the standardized version

where

is the estimated standard deviation of the test statistic. Large values of indicate low probability (small p-value) that the residuals have constant mean, which suggests that the curve is not parallel to that cluster center, and hence should not be assigned to that cluster.

The second statistic is based on the t-test for differences in averages, which is defined as

where , , and is an unbiased estimator of their variance. Again, large values of suggest that the curve does not belong to that cluster.

Because in practice there is more than one possible definition of clusters, and in fact the true underlying number of clusters may not even exist, the choice of method to determine the best number of clusters should depend on the application at hand and the researcher’s definition of clusters and expected separation. The proposed method can be used to modify a variety of procedures in the literature to select the number of clusters, hence being adaptive to different requirements one may have when defining clusters and their meaningful structure in each application. The adaptation of some procedures in the literature with the aforementioned test statistics can be performed in the following way. Define the measure that combines the two test statistics by

The tuning parameter determines how much weight should be given to the measure of parallelism and how much weight should be given to the average height distance between curves and cluster centers (see Section 2.1 for a discussion on how to select ). Define the within-cluster sums as

(2)

and the between-cluster sums as

(3)

where is the cluster center for all curves and is the number of curves in the -th cluster. Using the between and within sums defined in Equations (2) and (3), we modify the following methods for choosing the optimal number of clusters.

Calinski and Harabasz (Caliński and Harabasz (1974)) Let

(4)

and choose

Krzanowski and Lai (Krzanowski and Lai (1985)) Let

(5)

where

and choose

Hartigan (Hartigan (1975)) Let

(6)

and choose

Kaufman and Rousseeuw (Kaufman and Rousseeuw (2009)) Define the Silhouette statistic as

(7)

and choose

where is the average of for corresponding to the data in the same cluster as curve ; and is the average of for curves in the nearest cluster, where nearest is defined as the cluster minimizing .

Tibshirani, Walther and Hastie (Tibshirani et al. (2001)) Define the Gap statistic as

(8)

where

Here, different uniform data sets, each with the same range as the original data, are generated and the within cluster sum is calculated for different numbers of clusters. is the within cluster sum for the -th simulated uniform data set. One approach would be to maximize . However, to avoid adding unnecessary clusters an estimate of the standard deviation of , , is computed and the optimal choice is

James and Sugar (James and Sugar (2003)) Use the distortion function defined as

(9)

The distortion is the average measure between each curve and its closest cluster center , for . Choose

with .

Fischer (Fischer (2011)). Define

(10)

where is a minimizer of the criterium over , where is the set of , that is, . The first term for every , is defined as the empirical distortion

The second term in Equation (10) is a first approximation for the penalty shape which depends on and tends to 0 at the rate . To determine the constant , the slope heuristics (Birgé and Massart (2007)) is used. The penalty can be calibrated by two methods: the Jump method (Djump), which consists in identifying an abrupt jump in the model complexity, and the DDSE (DDSE) method, which computes the slope of the line of the empirical contrast and the penalty shape for complex models. Both methods have been implemented in the R (www.r-project.org) package capushe (CAlibrating Penalty Using Slope HEuristics) by Baudry et al. (2012).

2.1 Choice of

The parameter determines the weight given to the parallelism of the curves or the difference in mean values of the curves. It is convenient to design an automatic procedure for choosing the weight , so that one does not have to rely on ad hoc choices that can differ from dataset to dataset and from one procedure to another. Note that by modifying the methods as described in Section 2 with the dependence in , each method depends on and so does each selected , however, this was omitted from the description for ease of notation.

Because each method of selecting is computed using different combinations of , , and other measures, the choice of may depend on the characteristics of each criteria and how they are used. As such, we propose selecting and for each method by optimizing the criteria concomitantly on and . For example, when using Calinski and Harabasz’s procedure, select the optimal by computing

and selecting where . Similar arguments can be used to select the optimal with optimal for all the methods considered in this paper.

3 Numerical simulations

In this section we investigate the finite sample performance of the proposed method in selecting the number of clusters through several simulation scenarios. For comparison purposes, we evaluate the performance of the selection procedures using the proposed measure as well as the distance. We also include the results of some competing methods in the literature that are designed specifically for functional data, namely funclust (Jacques and Preda (2013)), funHDDC (Bouveyron and Jacques (2011b)), curvclust (Giacofci et al. (2013)), funFEM (Bouveyron et al. (2015)), and MFclust (Ma et al. (2006)). These methods are readily available in the statistical software R.

We consider 4 scenarios with varying degrees of difficulty. For each cluster in each scenario, curves are generated as follows

The smoothed versions from an example of the simulated curves for the 4 scenarios considered are shown in Figure 1, which present challenges not only to the selection of the number of clusters, but also to the identification of the clusters and the affiliation of each curve. The simulation results for each scenario in this section were based on 100 Monte Carlo runs while considering the possible number of clusters . For comparison purposes, the initial clustering for the and reported results were based on the K-means clustering algorithm in Zambom et al. (2018). Table 1 shows the results of the simulations for Scenarios 1 to 4 when using the proposed measure with chosen as described in Section 2.1, Table 2 shows the results using , and Table 3 shows the results of the methods specifically designed for functional data.

Figure 1: Functional data by clusters for (a) Scenario 1 (), (b) Scenario 2 (), (c) Scenario 3 () and (d) Scenario 4 ().
Estimated number of clusters Method 1 2 3 4 5 6 7 8 Scenario 1 () CH 0 100 0 0 0 0 0 0 KL 0 93 2 0 1 1 1 2 Hartigan 0 100 0 0 0 0 0 0 Silhouette 0 100 0 0 0 0 0 0 GAP 0 92 8 0 0 0 0 0 JUMP 0 100 0 0 0 0 0 0 DDSE 0 65 8 23 3 1 0 0 Djump 0 81 6 4 8 1 0 0 Scenario 2 () CH 0 23 74 3 0 0 0 0 KL 0 0 92 5 3 0 0 0 Hartigan 0 27 56 16 1 0 0 0 Silhouette 0 5 95 0 0 0 0 0 GAP 29 43 28 0 0 0 0 0 JUMP 0 0 97 2 1 0 0 0 DDSE 8 14 75 1 2 0 0 0 Djump 0 0 85 7 5 3 0 0 Scenario 3 () CH 0 0 100 0 0 0 0 0 KL 0 1 0 90 3 2 3 1 Hartigan 0 51 0 49 0 0 0 0 Silhouette 0 100 0 0 0 0 0 0 GAP 19 63 0 18 0 0 0 0 JUMP 0 0 100 0 0 0 0 0 DDSE 0 9 3 85 2 1 0 0 Djump 0 0 1 84 5 10 0 0 Scenario 4 () CH 0 0 0 0 100 0 0 0 KL 0 0 0 0 100 0 0 0 Hartigan 0 0 0 0 100 0 0 0 Silhouette 0 0 0 0 100 0 0 0 GAP 0 3 19 34 39 5 0 0 JUMP 0 0 0 0 100 0 0 0 DDSE 0 0 0 2 96 1 1 0 Djump 1 4 0 1 90 4 0 0
Table 1: Estimated number of clusters using methods in (4) - (10) with for scenarios 1 through 4.
Estimated number of clusters Method 1 2 3 4 5 6 7 8 Scenario 1 () CH 0 92 5 3 0 0 0 0 KL 0 9 6 26 19 16 13 11 Hartigan 97 0 2 1 0 0 0 0 Silhouette 0 100 0 0 0 0 0 0 GAP 24 0 7 9 26 33 1 0 JUMP 19 29 0 20 20 10 1 1 DDSE 0 4 7 39 27 21 2 0 Djump 1 57 12 22 8 0 0 0 Scenario 2 () CH 0 34 2 2 6 32 19 5 KL 0 11 2 6 8 24 22 27 Hartigan 63 1 1 5 5 10 10 2 Silhouette 0 98 0 0 0 2 0 0 GAP 76 0 2 5 6 10 1 0 JUMP 11 0 0 0 14 29 28 18 DDSE 2 47 7 12 8 21 3 0 Djump 25 57 7 7 3 1 0 0 Scenario 3 () CH 0 0 0 99 0 1 0 0 KL 0 36 0 63 0 0 1 0 Hartigan 86 6 0 0 1 2 5 0 Silhouette 0 1 0 99 0 0 0 0 GAP 0 47 0 5 25 21 2 0 JUMP 0 0 0 100 0 0 0 0 DDSE 0 0 0 77 11 6 6 0 Djump 90 0 0 9 1 0 0 0 Scenario 4 () CH 0 100 0 0 0 0 0 0 KL 0 76 0 0 0 5 12 7 Hartigan 64 0 0 0 0 16 19 1 Silhouette 0 100 0 0 0 0 0 0 GAP 0 29 46 23 2 0 0 0 JUMP 0 98 0 0 0 0 0 2 DDSE 0 17 46 27 7 3 0 0 Djump 7 28 51 10 2 2 0 0
Table 2: Estimated number of clusters using methods in (4) - (10) with for scenarios 1 through 4.
Estimated number of clusters Method 1 2 3 4 5 6 7 8 Scenario 1 () funclust 26 44 24 5 1 0 0 0 funHDDC 0 41 56 1 2 0 0 0 curvclust 0 0 0 0 0 8 37 5 funFEM 0 0 0 7 39 46 8 0 MFclust 0 0 0 0 0 1 17 82 Scenario 2 () funclust 1 75 23 1 0 0 0 0 funHDDC 0 63 28 1 2 3 2 1 curvclust 0 0 0 0 22 7 8 63 funFEM 0 0 0 0 0 11 36 53 MFclust 0 0 1 0 1 7 21 71 Scenario 3 () funclust 0 62 37 1 0 0 0 0 funHDDC 0 49 42 1 0 1 1 6 curvclust 0 0 0 0 0 10 54 36 funFEM 0 0 0 0 0 0 24 76 MFclust 0 0 0 0 3 4 12 81 Scenario 4 () funclust 0 93 6 1 0 0 0 0 funHDDC 0 2 14 0 7 0 5 72 curvclust 0 44 35 0 1 0 0 0 funFEM 0 0 0 0 0 0 0 100 MFclust 0 0 0 0 1 5 31 63
Table 3: Estimated number of clusters using methods designed for functional data.

The results suggest that the proposed measure , in general, improves the accuracy of all the modified procedures in selecting the number of clusters when compared to the respective results with the distance. When using , most methods have high variability in the sense that the selected number of clusters frequently varies for different simulated datasets, even though the datasets come from the same underlying distributions. This high variability is probably due to the fact that the distance does not differ simultaneously the parallelism features and the mean difference of the curves, and thus the criteria used in each method fails to distinguish the correct optimal number of clusters. For example, Hartigan’s and Gap’s procedures using failed to select the correct number of clusters almost always in all simulation scenarios, while KL, JUMP, DDSE, and Djump only perform well in 1 out of the 4 scenarios. On the other hand, the results reported in Table 1 suggest that the proposed measure induces more stability to the selection of in the sense that throughout the simulated datasets, each method seems to frequently select the same optimal number of clusters, which is in the majority of cases correct. Only a few notable deviations from the correct number of clusters can be observed with the measure. The Gap criterium on Scenario 2 and 3 did not select the correct number of clusters in the majority of the simulations, and had high variability in Scenario 4. CH incorrectly selected in all simulations of Scenario 3, which also happened to JUMP and Silhouette. Finally, the results of the methods especially designed for functional data reported in Table 3 suggest that their performance is unsatisfactory for the challenging scenarios considered. Because their selection criterium is usually based on the BIC, it might have failed to identify the challenging characteristics of the curves by only computing the penalized likelihood.

4 Real data analysis

In this section we examine the effectiveness of using and , as well as the functional methods, for the selection of the number of clusters on six real datasets: the Growth, the Moisture wheat, the Canadian weather, the Kneading, the Poblenou, and the Fat spectrum datasets. These datasets have been studied in the literature of clustering methods for functional data. In the first four datasets, the grouping is either known or artificially generated according to previous studies in the literature, while the last two datasets have no known groupings.

The Growth dataset, a Berkeley longitudinal study, consists of the heights of 54 girls and 39 boys measured at 31 stages from 1 to 18 years of age. This dataset is available in the R package fda. The Moisture wheat dataset consists of near-infrared reflectance spectra of 100 wheat samples, measured in 2 nm intervals from 1100 to 2500nm, along with the samples’ moisture content. Since these samples had no natural grouping, we artificially split them into two groups as in Ferraty and Vieu’s (section 8.4.2): assign the curves whose moisture content is smaller than 15 to group 1 and the remaining curves to group 2. This data was also analyzed in Kalivas (1997), and is available in R package fds. The Canadian weather dataset is also available in the R package fda and consists of the daily temperature at 35 different weather stations in Canada averaged over 1960-1994; it is grouped in 4 regions (referring to climate zones Atlantic, Pacific, Continental, Arctic). The Kneading data consists of the recorded resistance of the dough during the kneading process over a time period of 480 seconds, with measurements at every 2 seconds. The data is publicly available at http://math.univ-lille1.fr/preda/FDA/flours.txt. The first group is composed of 50 flours that yielded good quality cookies and 40 flours that yielded low quality cookies. The last two datasets, whose groupings are not known, are: the Poblenou dataset, which measures the levels of NOx every hour by a control station in Poblenou, Barcelona, Spain (available in the R package fda.usc); and the Fat spectrum dataset, which measures the absorbance at 100 wavelengths (from 852 to 1050 in step of 2nm) of fat content obtained by an analytic chemical processing (available in the R package fds). The smoothed curves of the datasets considered are shown in Figures 2 and 3.

Figure 2: Functional datasets for first scenario: (a) Growth data, (b) Moisture wheat samples, (c) Canadian weather and (d) Kneading data.
Figure 3: Functional datasets for second scenario: (a) Poblenou data, (b) Fat spectrum data.

Because the selection of the number of clusters depends on the clustering algorithm at each size , and the K-means clustering is initialized with random initial centers, the results shown in this section are based on 100 analyses of the same dataset, each of which has starting clusters randomly initialized for the K-means algorithm. The results of the analysis of the first four datasets with procedures using and , and the ones designed for functional data are shown in Tables 4, 5, and 6, respectively. The conclusions of this real data analysis are similar to those in the simulation studies in that the proposed modification of the methods using yields more stable selection of the optimal , which is mostly accurate. The Gap method performed poorly with both and , followed by DDSE and Djump, which presented high variability. All methods performed poorly in scenario 4, both with and , probably due to the high proportion of the domain with overlapping curves. Interestingly, the methods designed for functional data performed poorly in all cases in selecting the number of clusters, as can be seen on Table 6. Overall, the results suggest that proposed method is more accurate in selecting the true number of clusters.

Estimated number of clusters Method 1 2 3 4 5 6 7 8 Growth () CH 0 100 0 0 0 0 0 0 KL 0 46 9 9 5 6 15 10 Hartigan 0 100 0 0 0 0 0 0 Silhouette 0 100 0 0 0 0 0 0 GAP 83 17 0 0 0 0 0 0 JUMP 0 100 0 0 0 0 0 0 DDSE 0 100 0 0 0 0 0 0 Djump 0 98 2 0 0 0 0 0 Moisture () CH 0 100 0 0 0 0 0 0 KL 0 43 19 7 18 2 0 11 Hartigan 0 100 0 0 0 0 0 0 Silhouette 0 100 0 0 0 0 0 0 GAP 0 52 48 0 0 0 0 0 JUMP 0 100 0 0 0 0 0 0 DDSE 0 34 28 23 3 12 0 0 Djump 1 56 33 10 0 0 0 0 Canadian Weather () CH 0 89 0 11 0 0 0 0 KL 0 0 12 24 29 17 13 5 Hartigan 0 94 6 0 0 0 0 0 Silhouette 0 92 8 0 0 0 0 0 GAP 85 0 15 0 0 0 0 0 JUMP 0 0 5 32 34 12 7 10 DDSE 0 1 17 42 29 8 3 0 Djump 3 28 21 25 22 1 0 0 Kneading () CH 0 100 0 0 0 0 0 0 KL 0 81 5 2 8 0 3 1 Hartigan 0 100 0 0 0 0 0 0 Silhouette 0 100 0 0 0 0 0 0 GAP 0 89 11 0 0 0 0 0 JUMP 0 100 0 0 0 0 0 0 DDSE 0 80 1 19 0 0 0 0 Djump 0 69 5 23 3 0 0 0
Table 4: Estimated number of clusters using methods in (4) - (10) with for the datasets: Growth, Moisture, Canadian Weather, and Kneading.
Estimated number of clusters Method 1 2 3 4 5 6 7 8 Growth () CH 0 100 0 0 0 0 0 0 KL 0 80 0 8 0 6 2 4 Hartigan 70 30 0 0 0 0 0 0 Silhouette 0 100 0 0 0 0 0 0 GAP 100 0 0 0 0 0 0 0 JUMP 100 0 0 0 0 0 0 0 DDSE 0 82 0 11 5 2 0 0 Djump 0 63 0 35 2 0 0 0 Moisture () CH 0 100 0 0 0 0 0 0 KL 0 30 4 25 15 9 11 6 Hartigan 77 0 3 9 10 1 0 0 Silhouette 0 100 0 0 0 0 0 0 GAP 100 0 0 0 0 0 0 0 JUMP 0 80 0 9 7 0 3 1 DDSE 0 3 25 41 23 4 4 0 Djump 2 14 24 35 23 2 0 0 Canadian Weather () CH 0 1 0 7 16 1 28 47 KL 0 4 7 7 45 2 24 11 Hartigan 72 4 6 9 9 0 0 0 Silhouette 0 34 63 0 3 0 0 0 GAP 100 0 32 25 31 12 0 0 JUMP 0 0 0 6 26 2 38 28 DDSE 0 2 10 37 50 1 0 0 Djump 15 10 10 25 40 0 0 0 Kneading () CH 0 100 0 0 100 0 0 0 KL 0 92 0 1 3 3 1 0 Hartigan 66 27 5 1 1 0 0 0 Silhouette 0 100 0 0 100 0 0 0 GAP 0 98 2 0 0 0 0 0 JUMP 0 0 0 1 13 25 34 27 DDSE 0 87 0 1 3 7 1 1 Djump 8 89 0 2 5 1 2 1
Table 5: Estimated number of clusters using methods in (4) - (10) with for the datasets: Growth, Moisture, Canadian Weather, and Kneading.
Estimated number of clusters Method 1 2 3 4 5 6 7 8 Growth () funclust 0 91 9 0 0 0 0 0 funHDDC 0 100 0 0 0 0 0 0 curvclust 0 0 0 100 0 0 0 0 funFEM 0 0 0 0 0 0 0 100 MFclust 0 0 0 0 0 0 6 94 Moisture () funclust 0 83 14 3 0 0 0 0 funHDDC 0 0 0 0 100 0 0 0 curvclust 0 0 0 0 1 5 32 62 funFEM 0 0 0 0 0 0 0 100 MFclust 0 0 0 1 7 17 31 44 Canadian Weather () funclust 0 95 5 0 0 0 0 0 funHDDC 0 100 0 0 0 0 0 0 curvclust 0 0 0 0 0 0 3 97 funFEM 0 0 0 100 0 0 0 0 MFclust 0 0 0 0 5 5 0 90 Kneading () funclust 0 65 25 8 2 0 0 0 funHDDC 0 0 0 0 0 0 100 0 curvclust 0 0 0 0 0 0 3 97 funFEM 0 0 0 0 0 100 0 0 MFclust 0 0 0 0 0 3 20 77
Table 6: Estimated number of clusters using methods designed for functional data for the datasets: Growth, Moisture, Canadian Weather, and Kneading..

The results for the last two datasets, whose numbers of clusters are unknown, are shown in Tables 7, 8, and 9, for , , and functional methods, respectively. Although the real number of clusters is unknown, the proposed procedure yields a higher agreement across all the different methods, while the different procedures using and the functional methods estimate the optimal number of clusters with little agreement.

Estimated number of clusters Method 1 2 3 4 5 6 7 8 Poblenou CH 0 100 0 0 0 0 0 0 KL 0 13 5 17 9 14 30 12 Hartigan 0 100 0 0 0 0 0 0 Silhouette 0 100 0 0 0 0 0 0 GAP 0 100 0 0 0 0 0 0 JUMP 0 100 0 0 0 0 0 0 DDSE 5 95 0 0 0 0 0 0 Djump 0 97 1 0 2 0 0 0 Fat CH 0 100 0 0 0 0 0 0 KL 0 0 68 5 0 2 10 15 Hartigan 0 0 100 0 0 0 0 0 Silhouette 0 100 0 0 0 0 0 0 GAP 0 100 0 0 0 0 0 0 JUMP 0 100 0 0 0 0 0 0 DDSE 0 0 85 0 0 9 0 6 Djump 0 14 86 0 0 0 0 0
Table 7: Estimated number of clusters using methods in (4) - (10) with for the datasets: Poblenou and Fat Spectrum.
Estimated number of clusters Method 1 2 3 4 5 6 7 8 Poblenou CH 0 100 0 0 0 0 0 0 KL 0 77 0 2 3 3 4 11 Hartigan 67 16 8 7 1 1 0 0 Silhouette 0 100 0 0 0 0 0 0 GAP 100 0 0 0 0 0 0 0 JUMP 4 0 0 0 1 15 46 34 DDSE 21 76 0 0 0 3 0 0 Djump 19 81 0 0 0 0 0 0 Fat CH 0 0 0 0 0 11 42 47 KL 0 0 36 0 0 36 11 17 Hartigan 94 0 0 0 0 0 4 0 Silhouette 0 100 0 0 0 0 0 0 GAP 100 0 0 0 0 0 0 0 JUMP 0 0 0 0 100 63 11 26 DDSE 49 0 0 0 0 51 0 0 Djump 100 0 0 0 0 0 0 0
Table 8: Estimated number of clusters using methods in (4) - (10) with for the datasets: Poblenou and Fat Spectrum
Estimated number of clusters Method 1 2 3 4 5 6 7 8 Poblenou funclust 0 97 3 0 0 0 0 0 funHDDC 0 100 0 0 0 0 0 0 curvclust 0 0 97 3 0 0 0 0 funFEM 0 0 0 0 100 0 0 0 MFclust 0 0 0 0 0 0 2 98 Fat funclust 0 42 45 8 5 0 0 0 funHDDC 0 0 0 0 0 0 0 100 curvclust 0 0 100 0 0 0 0 0 funFEM 0 0 0 0 0 0 0 100 MFclust 0 0 0 0 0 55 26 19
Table 9: Estimated number of clusters using methods designed for functional data for the datasets: Poblenou and Fat Spectrum.

5 Conclusion

In this paper, we considered the problem of selection of the number of clusters for functional datasets. We proposed a new measure that combines two test statistics, one testing the difference in means of the curves and another testing their parallelism, which is inspired on the recent theory of high-dimensional one-way anova. The combination of these tests was used to modify the criteria in existing procedures, and select the optimal number of clusters. Results obtained in simulation studies illustrated the efficacy of the proposed measure compared to the use of and other methods specially designed for functional data analysis. Overall, the proposed modification improved the accuracy of procedures in selecting the correct number of clusters, as well as increased the agreement of the different procedures when compared to . The illustration of the proposed method in real datasets also showed higher agreement across procedures when using the when compared to the use of or the functional methods. In general, the proposed approach combines a measure of parallelism and mean height difference between curves, which outperformed the competitors considered in both simulated and real data analysis considered in this paper.

Footnotes

  1. email: adriano.zambom@csun.edu
  2. email: jualacco@gmail.com
  3. email: dias@ime.unicamp.br
  4. email: adriano.zambom@csun.edu
  5. email: jualacco@gmail.com
  6. email: dias@ime.unicamp.br
  7. email: adriano.zambom@csun.edu
  8. email: jualacco@gmail.com
  9. email: dias@ime.unicamp.br
  10. email: adriano.zambom@csun.edu
  11. email: jualacco@gmail.com
  12. email: dias@ime.unicamp.br

References

  1. Abraham, C., Cornillon, P. A., Matzner-Løber, E., and Molinari, N. (2003). Unsupervised curve clustering using b-splines. Scandinavian Journal of Statistics, 30(3):581–595.
  2. Akritas, M. and Papadatos, N. (2004). Heteroscedastic one-way anova and lack-of-fit tests. Journal of the American Statistical Association, 99:368–382.
  3. Baudry, J.-P., Maugis, C., and Michel, B. (2012). Slope heuristics: overview and implementation. Statistics and Computing, 22(2):455–470.
  4. Birgé, L. and Massart, P. (2007). Minimal penalties for gaussian model selection. Probability theory and related fields, 138(1-2):33–73.
  5. Bouveyron, C., Côme, E., Jacques, J., et al. (2015). The discriminative functional mixture model for a comparative analysis of bike sharing systems. The Annals of Applied Statistics, 9(4):1726–1760.
  6. Bouveyron, C. and Jacques, J. (2011a). Model-based clustering of time series in group-specific functional subspaces. Advances in Data Analysis and Classification, 5(4):281–300.
  7. Bouveyron, C. and Jacques, J. (2011b). Model-based clustering of time series in group-specific functional subspaces. Advances in Data Analysis and Classification, 5(4):281–300.
  8. Caliński, T. and Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics-theory and Methods, 3(1):1–27.
  9. Chakraborty, S. and Das, S. (2018). Simultaneous variable weighting and determining the number of clusters a weighted gaussian means algorithm. Statistics & Probability Letters, 137:148 – 156.
  10. Febrero-Bande, M. and de la Fuente, M. (2012). Statistical computing in functional data analysis: The r package fda.usc. Journal of Statistical Software, Articles, 51(4):1–28.
  11. Fischer, A. (2011). On the number of groups in clustering. Statistics & Probability Letters, 81(12):1771–1781.
  12. García, M. L. L., García-Ródenas, R., and Gómez, A. G. (2015). K-means algorithms for functional data. Neurocomputing, 151:231 – 245.
  13. Giacofci, M., Lambert-Lacroix, S., Marot, G., and Picard, F. (2013). Wavelet-based clustering for mixed-effects functional models in high dimension. Biometrics, 69(1):31–40.
  14. Hardy, A. (1996). On the number of clusters. Comput. Stat. Data Anal., 23(1):83–96.
  15. Hartigan, J. (1975). Clustering algorithms. John Wiley & Sons.
  16. Hennig, C., Meila, M., Murtagh, F., and Rocci, R. (2015). Handbook of cluster analysis. CRC Press.
  17. Hyrien, O. and Baran, A. (2016). Fast nonparametric density-based clustering of large datasets using a stochastic approximation mean-shift algorithm. J. Comput. Graph. Statist., 25(3):899–916.
  18. Ieva, F., Paganoni, A. M., Pigoli, D., and Vitelli, V. (2013a). Multivariate functional clustering for the morphological analysis of electrocardiograph curves. Journal of the Royal Statistical Society. Series C (Applied Statistics), 62(3):401–418.
  19. Ieva, F., Paganoni, A. M., Pigoli, D., and Vitelli, V. (2013b). Multivariate functional clustering for the morphological analysis of electrocardiograph curves. Journal of the Royal Statistical Society: Series C (Applied Statistics), 62(3):401–418.
  20. Jacques, J. and Preda, C. (2013). Funclust: A curves clustering method using functional random variables density approximation. Neurocomputing, 112:164–171.
  21. Jacques, J. and Preda, C. (2014). Model-based clustering for multivariate functional data. Computational Statistics & Data Analysis, 71:92 – 106.
  22. James, G. M. and Sugar, C. A. (2003). Clustering for sparsely sampled functional data. Journal of the American Statistical Association, 98(462):397–408.
  23. Kaufman, L. and Rousseeuw, P. J. (2009). Finding groups in data: an introduction to cluster analysis, volume 344. John Wiley & Sons.
  24. Kim, H., Luo, J., Kim, J., Chen, H., and Feuer, E. J. (2014). Clustering of trend data using joinpoint regression models. Statistics in Medicine, 33(23):4087–4103.
  25. Krzanowski, W. J. and Lai, Y. (1985). A criterion for determining the number of groups in a data set. Biometrics, 44:23–34.
  26. Lascio, F. D., Giammusso, D., and Puccetti, G. (2018). A clustering approach and a rule of thumb for risk aggregation. Journal of Banking & Finance, 96:236 – 248.
  27. Ma, P., Castillo-Davis, C. I., Zhong, W., and Liu, J. S. (2006). A data-driven clustering method for time course gene expression data. Nucleic Acids Research, 34(4):1261–1269.
  28. Ma, P. and Zhong, W. (2008). Penalized clustering of large-scale functional data with multiple covariates. Journal of the American Statistical Association, 103(482).
  29. McClelland, R. L. and Kronmal, R. A. (2002). Regression-based variable clustering for data reduction. Statistics in Medicine, 21(6):921–941.
  30. Milligan, G. W. and Cooper, M. C. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50(2):159–179.
  31. Nair, G. G., Liu, J. S., Russ, H. A., Tran, S., Saxton, M. S., Chen, R., Juang, C., Li, M.-l., Nguyen, V. Q., Giacometti, S., Puri, S., Xing, Y., Wang, Y., Szot, G. L., Oberholzer, J., Bhushan, A., and Hebrok, M. (2019). Recapitulating endocrine cell clustering in culture promotes maturation of human stem-cell-derived cells. Nature Cell Biology, 21(2):263–274.
  32. Serban, N. and Jiang, H. (2012). Multilevel functional clustering analysis. Biometrics, 68(3):805–814.
  33. Sørlie, T., Tibshirani, R., Parker, J., Hastie, T., Marron, J. S., Nobel, A., Deng, S., Johnsen, H., Pesich, R., Geisler, S., Demeter, J., Perou, C. M., Lønning, P. E., Brown, P. O., Børresen-Dale, A.-L., and Botstein, D. (2003). Repeated observation of breast tumor subtypes in independent gene expression data sets. Proceedings of the National Academy of Sciences, 100(14):8418–8423.
  34. Steinwart, I. (2015). Fully adaptive density-based clustering. Ann. Statist., 43(5):2132–2167.
  35. Tibshirani, R., Walther, G., and Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2):411–423.
  36. Tokushige, S., Yadohisa, H., and Inada, K. (2007). Crisp and fuzzy k-means clustering algorithms for multivariate functional data. Computational Statistics, 22(1):1–16.
  37. Wang, G., Lin, N., and Zhang, B. (2014). Functional k-means inverse regression. Computational Statistics & Data Analysis, 70:172 – 182.
  38. Wang, X., Smith, K., and Hyndman, R. (2006). Characteristic-based clustering for time series data. Data Mining and Knowledge Discovery, 13(3):335–364.
  39. Yamamoto, M. and Terada, Y. (2014). Functional factorial k-means analysis. Computational Statistics & Data Analysis, 79:133 – 148.
  40. Zambom, A. Z. and Akritas, M. G. (2014). Nonparametric lack-of-fit testing and consistent variable selection. Statistica Sinica, 24:1838–1858.
  41. Zambom, A. Z. and Akritas, M. G. (2015). Nonparametric significance testing and group variable selection. Journal of Multivariate Analysis, 133:51 – 60.
  42. Zambom, A. Z., Collazos, J. A., and Dias, R. (2018). Functional data clustering via hypothesis testing k-means. Computational Statistics, pages 1–23.
  43. Zambom, A. Z. and Kim, S. (2017). Lag selection and model validation in nonparametric autoregressive conditional heteroscedastic models. Journal of Statistical Planning and Inference, 186:13–27.
  44. Zhang, Z., Murtagh, F., Van Poucke, S., Lin, S., and Lan, P. (2017). Hierarchical cluster analysis in clinical research with heterogeneous study population: highlighting its visualization with r. Annals of translational medicine, 5(75).
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
362696
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description