A The existence of \Theta^{\prime}

# Strong Consistency of Reduced K-means Clustering

## Abstract

Reduced -means clustering is a method for clustering objects in a low-dimensional subspace. The advantage of this method is that both clustering of objects and low-dimensional subspace reflecting the cluster structure are simultaneously obtained. In this paper, the relationship between conventional -means clustering and reduced -means clustering is discussed. Conditions ensuring almost sure convergence of the estimator of reduced -means clustering as unboundedly increasing sample size have been presented. The results for a more general model considering conventional -means clustering and reduced -means clustering are provided in this paper. Moreover, a new criterion and its consistent estimator are proposed to determine the optimal dimension number of a subspace, given the number of clusters.

\kwd
\startlocaldefs\endlocaldefs\runtitle

Consistency of RKM Clustering

clustering \kwddimension reduction \kwd-means

## 1 Introduction

The aim of cluster analysis is the discovery of a finite number of homogeneous classes from data. In some cases, a cluster structure is considered to lie in a low-dimensional subspace of data, and the following procedure is applied:

Step .

Principal component analysis (PCA) is performed, and the first few components are obtained.

Step .

Conventional -means clustering is performed for the principal scores on the first few principal components.

This two-step procedure is called “tandem clustering” by Arabie & Hubert (1994) and has been discouraged by several authors (e.g., Arabie & Hubert, 1994; Chang, 1983; De Soete & Carroll, 1994). Because the first few principal components of PCA do not necessarily reflect the cluster structure in data, the appropriate clustering result may not be obtained by using the tandem clustering approach. Figure 1 shows that the first two principal components do not reflect the cluster structure, and the clustering result of the tandem clustering is incorrect.

De Soete & Carroll (1994) proposed reduced -means (RKM) clustering. RKM clustering simultaneously determines the clusters of objects on the basis of the -means criterion and the subspace that is informative about the cluster structure in data on the basis of component analysis. In other words, for given data points in , the fixed cluster number and the dimension number of subspace , RKM clustering is defined by the minimization problem of the following loss function:

 RKMn:=1nn∑i=1min1≤j≤k∥xi−Afj∥2, (1)

where and is a columnwise orthonormal matrix. For some clustering methods related to -means clustering, several authors have discussed their statistical properties (e.g., Abraham et al., 2003; García-Escudero et al., 1999; Pollard, 1981; Pollard, 1982; von Luxburg et al., 2008). However, because RKM clustering is proposed in the framework of descriptive statistics, the statistical properties are not discussed. When data points are independently drawn from a population distribution , the objective function is rewritten as

 RKM(F,A,Pn):=∫minf∈F∥x−Af∥Pn(dx),

where is a set containing or fewer points in , and is the empirical measure obtained from the data. For each fixed and , the strong law of large numbers (SLLN) shows that

 limn→∞RKM(F,A,Pn)=RKM(F,A,P):=∫minf∈F∥x−Af∥P(dx)a.s.

Thus, we wish to ensure that the global minimizer of converges almost surely to the global minimizers of , say the population global minimizers.

In this paper, the strong consistency of RKM under i.i.d. sampling is proven. For this purpose, the framework of the proof of the strong consistency of the -means clustering approach proposed by Pollard (1981) is used; in this framework, the existence and uniqueness of the population global minimizers are assumed for consistency. Conditions for the existence of the global minimizers are not discussed. For RKM clustering, the uniqueness of the population global minimizers cannot be assumed because RKM clustering has rotational indeterminacy. Therefore, the sufficient condition for the existence of the population global minimizers must be derived; it is also necessary to establish that the distance between the sample estimator and the set of global minimizers converges almost surely to zero, as the sample size approaches infinity.

This paper is organized as follows. In Section 2, the original algorithm of RKM clustering and visualization of the result are described. Then, the relationship between the conventional -means clustering method and RKM clustering is presented. The notation and some properties of RKM, including the rotational indeterminacy, is introduced in Section 3. The uniform SLLN and continuity of the objective function of RKM clustering are presented in Section 4. In Section 5, conditions for the existence of the population global minimizers are determined, and a theorem regarding the strong consistency of RKM clustering is stated. In Section 6, the main proof of the consistency theorem is explained. In Section 7, a new criterion and its consistent estimator are proposed to determine the optimal dimension number of a subspace, given the number of clusters. Moreover, the effectiveness of the criterion through numerical experiments are illustrated.

## 2 Reduced k-means clustering

### 2.1 Algorithm and visualization of reduced k-means clustering

Let be a data matrix and be row vectors of , where is the number of objects and is the number of variables. The number of clusters and components to which the variables are reduced are denoted by and , respectively. RKM clustering is defined as the minimizing problem of the following criterion:

 RKMn(A,F,U∣k,q):=∥X−UFAT∥2F=n∑i=1min1≤j≤k∥xi−Afj∥2, (2)

where and denote the usual Euclidean norm and Frobenius norm, respectively, is a binary membership matrix that specifies cluster membership for each objects, is a column-wise orthonormal loading matrix, is a centroid matrix, and is a centroid of the th cluster for each . For example, this problem can be solved by the following alternating least square algorithm:

Step .

First, initial values are chosen for and .

Step .

is expressed as the singular value decomposition of , where is a orthonormal matrix, is a diagonal matrix, and is a columnwise orthonormal matrix. is updated by .

Step .

For each and each , we update by

 uij={1iff ∥ATxi−fj∥2<∥ATxi−fj′∥2 for each j′≠j,0otherwise.
Step .

is updated using .

Step .

Finally, the value of the function for the present values of , and is computed. When the present values have decreased the function value, , and are update in accordance with Steps . Otherwise, the algorithm has converged.

Other formulations and algorithms for RKM clustering have been presented by De Soete & Carrol (1994) and Timmerman et al. (2010).

The algorithms for RKM clustering monotonically decrease the function . As shown below, because is bounded, the solution for each iteration converges to a local minimum point. Because of the binary constraint on , the solutions of these algorithms may often be local minimums. To prevent this, many random starts are required to be used.

The objective function can be decomposed into two terms:

 RKMn(A,F,U∣k,q)=∥X−XAAT∥2F+∥XA−UF∥2F. (3)

The first term of equation is the objective function of the PCA, and the second term is the -means criterion in a low dimensional subspace. Thus, for optimal solutions , and , we have . Using the optimal solutions , , and , the low-dimensional representation of the objects and cluster centers can be obtained:

 Y:=X^A and G:=(^UT^U)−1^UTY. (4)

Using and , a biplot reflecting the cluster structure can be presented. Figure 2 shows the biplot of the RKM clustering for the same data as that used in Figure 1.

### 2.2 The relationship between the conventional k-means and the RKM clusterings

The objective function of the conventional -means clustering method is given by

 KMn(C,U∣k):=∥X−UC∥2F, (5)

where is an cluster center matrix. is expressed as the singular value decomposition of , where is an orthonormal matrix, is an diagonal matrix, and is a column-wise orthonormal matrix. Function can be expressed as

 ∥X−UC∥2 =∥X−UPΣQT∥2F.

Considering and as a low-dimensional centroid matrix and a loading matrix , respectively, function is equivalent to the objective function of RKM, . Thus, RKM clustering includes the conventional -means clustering analysis as a special case.

## 3 Preliminaries

Let be a probability space and be independent random variables with a common population distribution on ; let be the empirical measure based on . For typographical convenience, the set of all column-wise orthonormal matrices are denoted by , and , where is the cardinality of . Thus, the parameter space is denoted by . denotes the -dimensional closed ball of radius centered at the origin. For each , define and . Let be a non-negative decreasing function and be a probability measure on . For each finite subset and each , the loss function of RKM with is defined by

 Φ(F,A,Q):=∫minf∈Fϕ(∥x−Af∥)Q(dx).

Write

 mk(Q):=inf(F,A)∈ΞkΦ(F,A,Q)andm∗k(Q∣M):=inf(F,A)∈Θ∗k(M)Φ(F,A,Q).

For , both descriptions and are used. In addition, and . For each , and . The parameters and are used to emphasize that and are dependent on the index . One of the measurable estimators in will be denoted by or . Similarly, we will also denote one of the measurable estimators in by or . To illustrate the existence of measurable estimators, see Section 6.7 of Pfanzagl (1996).

Let be the distance between two matrices based on Frobenius norm and the Hausdorff distance, which is defined for finite subsets as

 dH(A,B):=maxa∈A{minb∈B∥a−b∥}.

Moreover, let be the product distance with and . In this paper, the distance between and is defined as

 d(^θn,Θ′):=inf{d(^θn,θ)∣θ∈Θ′}.

To clarify the minimization procedures, the function must satisfy some regularity conditions. As proposed by Pollard (1981), it is assumed that is continuous, and . Moreover, to control the growth of , it is assumed that

 ∃λ>0;∀r>0;ϕ(2r)≤λϕ(r).

For each and each ,

 ∫ϕ(∥x−Af∥)P(dx) ≤∫ϕ(∥x∥+∥Af∥)P(dx)=∫ϕ(∥x∥+∥f∥)P(dx) =∫∥f∥>∥x∥ϕ(2∥f∥)P(dx)+∫∥f∥≤∥x∥ϕ(2∥x∥)P(dx) ≤ϕ(2∥f∥)+λ∫ϕ(∥x∥)P(dx).

Therefore, as long as is finite, is also finite for each and each .

Let be a orthonormal matrix, i.e., . For each and each ,

 ∫ϕ(∥x−Af∥)P(dx)=∫ϕ(∥x−ARTRf∥)P(dx).

It follows that is not a singleton when , thus suggesting that RKM clustering has rotational indeterminacy.

## 4 The uniform SLLN and the continuity of Φ(⋅,⋅,P)

###### Proposition 1.

Let be an arbitrary number. Let denote the class of all -integrable functions on of the form

 g(F,A)(x):=minf∈Fϕ(∥x−Af∥),

where takes all values over . Suppose that . Then,

 limn→∞supg∈G∣∣∣∫g(x)Pn(dx)−∫g(x)P(dx)∣∣∣=0a.s. (6)
###### Proof.

DeHardt (1971) provided the sufficient condition for the uniform SLLN ; for all , there exists a finite class of functions such that for each , and exist in with and .

An arbitrary is selected, and denotes the surface of the sphere on of radius centered at the origin. To find such a finite class , is defined as the finite set of satisfying

 ∀f∈Bq(M);∃g∈Dδ1;∥f−g∥<δ1

and as the finite sets of satisfying

 ∀A∈Sp×q(√q);∃B∈Ap×q,δ2;∥A−B∥F<δ2.

Define . Take as the finite class of functions of the form

 minf∈F′ϕ(∥x−A′f∥+√qδ1+Mδ2)orminf∈F′ϕ(∥x−A′f∥−√qδ1−Mδ2),

where takes all values over and is defined as zero for all negative .

For given and , there exists with for each and each with . Corresponding to each , choose

 ¯g(F,A):=minf∈F′ϕ(∥x−A′f∥+√qδ1+Mδ2)

and

 ˙g(F,A):=minf∈F′ϕ(∥x−A′f∥−√qδ1−Mδ2).

Because is a monotone function and

 ∥x−A′f′i∥−√qδ1−Mδ2≤∥x−Afi∥≤∥x−A′f′i∥+√qδ1+Mδ2

for each and each , these functions ensure that .

If we choose to be greater than ,

 ∫[¯g(F,A)(x)−˙g(F,A)(x)]P(dx) ≤ ∫k∑i=1[ϕ(∥x−A′f′i∥+√qδ1+Mδ2) −ϕ(∥x−A′f′i∥−√qδ1−Mδ2)]P(dx) ≤ ksup∥x∥≤Rsupf∈B(5M)supA∈Sp×q(√q)[ϕ(∥x−Af∥+√qδ1+Mδ2) −ϕ(∥x−Af∥−√qδ1−Mδ2)]+2kλ∫∥x∥≥Rϕ(∥x∥)P(dx).

The second term would be less than if is sufficiently large. Moreover, because is uniform continuous on a bounded set, the first term can be less than if is sufficiently small. Thus, the uniform SLLN is proven. ∎

Similarly, the continuity of on can be proven.

###### Proposition 2.

Let be an arbitrary number. Suppose that . Then, is continuous on .

###### Proof.

If are select such that and , then for each , there exists with , and furthermore,

 Φ(F,A,P)−Φ(G,B,P) =∫[minf∈Fϕ(∥x−Af∥)−ming∈Gϕ(∥x−Bg∥)]P(dx) ≤∫maxg∈G[ϕ(∥x−Af(g)∥)−ϕ(∥x−Bg∥)]P(dx) ≤∫∑g∈G[ϕ(∥x−Bg∥+Mδ2+δ1)−ϕ(∥x−Bg∥)]P(dx) ≤ksup∥x∥≤Rmaxg∈G[ϕ(∥x−Bg∥+Mδ2+δ1)−ϕ(∥x−Bg∥)] +2kλ∫∥x∥≥Rϕ(∥x∥)P(dx) (7)

for . When a sufficiently large and a sufficiently small are selected, the last bound is less than . For each , there also exists with . Therefore, the other inequality necessary for the continuity is obtained by interchanging and in the inequality . ∎

## 5 The consistency theorem

### 5.1 The existence of the population global optimizers

The aim of this paper is to prove that, for a fixed measure satisfying some natural assumptions, the infimum distance between the (measurable) estimator with and parameters achieving converges almost surely to , as the sample size goes to infinity. However, there may be no such parameters. Thus, before providing the consistency theorem, the sufficient condition for the existence of parameters achieving in is provided. The following proposition ensures the existence of such parameters. The proof and some details about the proposition are given in Appendix A.

###### Proposition 3.

Suppose that and that for . Then, .

From Lemma 4 in Appendix A, there exists such that for all . Moreover, under the assumption of Proposition 3, the following identification condition can be proven:

 infθ∈Θ∗k(5M):d(θ,Θ′)≥ϵΦ(θ,P)>infθ∈Θ′Φ(θ,P)for% all ϵ>0.

The proof of the identification condition is also given in Appendix A. The identification condition is used in Section 6.

### 5.2 Strong consistency of reduced k-means clusterings

If the parameter space is , the strong consistency of RKM clustering can be proven. Note that since is compact, we have and the identification condition:

 infθ∈Θ∗ϵ(M)Φ(θ,P)>infθ∈Θ∗Φ(θ,P)for all ϵ>0,

where .

###### Proposition 4.

Suppose that . Then, for each ,

 limn→∞d(^θ∗n,Θ∗)=0a.s.,and limn→∞m∗k(Pn∣M)=m∗k(P∣M)a.s.
###### Proof.

Since the uniform SLLN and the continuity of , the proof of this proposition is given by the similar argument of the proof of the following consistency theorem. ∎

In a study by Pollard (1981), the uniqueness of the parameter is also assumed for the strong consistency theorem. As discussed in Section 3, we cannot assume the uniqueness condition. Thus, the condition that for is assumed instead of the uniqueness condition.

This condition is equivalent to the distinctness condition that has distinct points for all . Indeed, suppose that there exists such that have or fewer distinct points; that is, . There exists such that and . Then, , which contradicts to . Thus, the condition that for implies the distinctness condition. Moreover, this condition is equivalent to since for each satisfying .

The following main theorem gives the sufficient condition for the strong consistency of the estimator of RKM clustering.

###### Theorem 1.

Suppose that and that for . Then, ,

 limn→∞d(^θn,Θ′)=0a.s.,and limn→∞mk(Pn)=mk(P)a.s.

## 6 Proof of Theorem 1

Because almost sure convergence is dealt with, null sets of elements exists for which the convergence does not hold. Hereafter, denotes the set obtained by avoiding a proper null set from . In the first step of the proof, when is sufficiently large, the estimators of the cluster centers are contained within a compact ball that does not depend on . For convenience, it is assumed that as . When is bounded, the proof is a little complicated.

First, we prove the following lemma.

###### Lemma 1.

Suppose that . Then, there exists such that

 P(∞⋃n=1∞⋂m=n{ω∣∀(Fm,Am)∈Θ′m;Fm(ω)∩Bq(M)≠∅})=1.
###### Proof.

Select an appropriate value to satisfy the condition that the ball has positive measure, i.e., . Let be sufficiently large for satisfying and

 ϕ(M−r)P(Bp(r))>∫ϕ(∥x∥)P(dx). (8)

From the definition of , for any set containing at most points and any . The parameter is chosen such that it only consists of the origin. Then, by SLLN,

 Φ(F0,A,Pn)=∫ϕ(∥x∥)Pn(dx)→∫ϕ(∥x∥)P(dx)a.s.,

for each .

Let . By the axiom of choice, for an arbitrary there exists a subsequence such that and . Thus,

 limsuplΦ(Fnl,Anl,Pnl) ≥limsupl1nl∑i∈{i∣Xi∈Bp(r)}min1≤j≤kϕ(∥Xi−Anlfj∥) ≥limsupl1nl∑i∈{i∣Xi∈Bp(r)}ϕ(M−r) =ϕ(M−r)limsuplPnl(Bp(r))=ϕ(M−r)P(Bp(r)).

On the other hand, because . Therefore, we have and , which is a contradiction. Therefore, , that is,

 P(∞⋃n=1∞⋂m=n{ω∣∀(Fm,Am)∈Θ′m;Fm(ω)∩Bq(M)≠∅})=1.

Without loss of generality, all can be assumed contain at least one point of when is sufficiently large. The next lemma shows that for sufficiently large , there exists such that the closed ball contains all estimators of centers. When , the next lemma is obviously satisfied.

From the results in Section 4 and using the same arguments in the final part of this section, the conclusions of the theorem are proven when .

###### Lemma 2.

Under the assumption of the theorem, there exists such that

 P(∞⋃n=1∞⋂m=n{ω∣∀(Fm,Am)∈Θ′m;Fm(ω)⊂Bq(5M)})=1.
###### Proof.

Choose sufficiently large to satisfy the inequality and

 λ∫∥x∥≥2Mϕ(∥x∥)P(dx)<ϵ, (9)

where is selected to ensure . Note that for .

Suppose that contains at least one center outside and consider the effect on by deleting such outside centers from for all . From Lemma 1, all contain at least one center on when is sufficiently large, say . In the worst case, the cluster of should contain all sample points belonging to clusters outside . Because these points must be outside , the increment of due to the deletion of centers outside from would be at most

 ∫∥x∥≥2Mϕ(∥x−Af1∥)Pn(dx) ≤∫∥x∥≥2Mϕ(∥x∥+∥f1∥)Pn(dx) ≤∫∥x∥≥2Mϕ(2∥x∥)Pn(dx) ≤λ∫∥x∥≥2Mϕ(∥x∥)Pn(dx).

Denote the set obtained by deleting centers outside from by . For each , is contained in , and thus,

 Φ(F∗n,A,Pn)≥m∗k−1(Pn∣5M)≥mk−1(Pn).

For each satisfying and each , we have

 ∥x−Af∥>3Mfor all f∉Bq(5M)

and

 ∥x−Ag∥<3Mfor all g∈Bq(M).

Thus,

 ∫∥x∥<2Mminf∈Fnϕ(∥x−Af∥)Pn(dx)=∫∥x∥<2Mminf∈F∗nϕ(∥x−Af∥)Pn(dx).

for all . Note that

 limn→∞m∗k−1(Pn∣5M)=m∗k−1(P∣5M)a.s.

by Proposition 4.

Let . By the axiom of choice, for an arbitrary there exists a subsequence such that and . For any with or fewer points and any ,

 m∗k−1(P∣5M) ≤liminfiΦ(F∗ni,Ani,Pni)≤limsupiΦ(F∗ni,Ani,Pni) =limsupi[∫∥x∥<2Mminf∈Fniϕ(∥x−Anif∥)Pni(dx) +∫∥x∥≥2Mminf∈F∗niϕ(∥x−Anif∥)Pni(dx)] ≤limsupn[Φ(Fn,An,Pn)+λ∫∥x∥≥2Mϕ(∥x∥)Pn(dx)] ≤