A notion of stability for k-means clustering

# A notion of stability for k-means clustering

Thibaut Le Gouic 111Aix Marseille Univ, CNRS, Centrale Marseille, I2M, Marseille, France and Quentin Paris222National Research University Higher School of Economics. The study has been funded by the Russian Academic Excellence Project 5-100
###### Abstract

In this paper, we define and study a new notion of stability for the -means clustering scheme building upon the field of quantization of a probability measure. We connect this definition of stability to a geometric feature of the underlying distribution of the data, named absolute margin condition, inspired by recent works on the subject.

## 1 Introduction

Unsupervised classification consists in partitioning a data set into a series of groups (or clusters) each of which may then be regarded as a separate class of observations. This task, widely considered in data analysis, enables, for instance, practitioners, in many disciplines, to get a first intuition about their data by identifying meaningful groups of observations. The tools available for unsupervised classification are various. Depending on the nature of the problem, one may rely on a model based strategy modeling the unknown distribution of the data as a mixture of known distributions with unknown parameters. Another approach, model-free, is embodied by the well known -means clustering scheme. This paper focuses on the stability of this clustering scheme.

### 1.1 Quantization and the k-means clustering scheme

The -means clustering scheme prescribes to classify observations according to their distances to chosen representatives. This clustering scheme is strongly connected to the field of quantization of probability measures and this paragraph shortly recalls how these concepts interact. Suppose our data modeled by i.i.d. random variables , taking their values in some metric space , and with same distribution as (and independent of) a generic random variable . Let be an integer fixed in advance, representing the prescribed number of clusters, and define a -points 333The integer is supposed fixed throughout the paper and all quantizers considered below are supposed to be -points quantizers. quantizer as any mapping such that444For a set , notation refers to the number of elements in . . Denoting the values taken by , the sets , , partition the space into subsets (or cell) and each point (called indifferently a center, a centroid or a code point) stands as a representative of all points in its cell. Given a quantizer , associated data clusters are defined, for all , by

 Cj(q):={x∈E:q(x)=cj}∩{X1,…,Xn}.

The performance of this clustering scheme is naturally measured by the average square distance, with respect to , of a point to its representative. In other words, the risk of (also referred to as its distortion) is defined by

 R(q):=∫Ed(x,q(x))2dP(x). (1.1)

Quantizers of special interest are nearest neighbor (NN) quantizers, i.e. quantizers such that, for all ,

 q(x)∈argminc∈q(E) d(x,c).

The interest for these quantizers relies on the straightforward observation that for any quantizer , an NN quantizer such that satisfies . Hence, attention may be restricted to NN quantizers and any optimal quantizer

 q⋆∈argminq R(q), (1.2)

(where ranges over all quantizers -points quantizers) is necessarily an NN quantizer. We will denote the set of all -points NN quantizers and, unless mentionned explicitly, all quantizers involved in the sequel will be considered as members of . For , the value of its risk is entirely described by its image. Indeed, if takes values , then

 R(q)=∫Emin1≤j≤kd(x,cj)2dP(x). (1.3)

Denoting , referred to as a codebook, we will often denote by the right hand side of (1.3) with a slight abuse of notation.

A few additional considerations, relative to NN-quantizers, will be useful in the paper. Given , denote the set of points in closer to than to any other , that is

 Vj(c):={x∈E:∀ℓ∈{1,…,k},d(x,cj)≤d(x,cℓ)}.

These sets do not partition the space since, for , the set is not necessarily empty. A Voronoi partition of relative to is any partition of such that, for all , up to relabeling. For instance, given with image , the sets , , form a Voronoi partition relative to . We call frontier of the Voronoi diagram generated by the set

 F(c):=⋃i≠jVi(c)∩Vj(c). (1.4)

Given an optimal quantizer with image , a remarkable property, known as the center condition, states that for all , and provided ,

 P(Vj(c⋆))>0andc⋆j∈argminc∈E∫Vj(c⋆)d(x,c)2dP(x). (1.5)

From now on, the probability measure will be supposed to have a support of more than points.

We end this subsection by mentioning that computing an optimal quantizer requires the knowledge of the distribution . From a statistical point of view, when the only information available about consists in the sample , reasonable quantizers are empirically optimal quantizers, i.e. NN quantizers associated to any codebook satisfying

 ^c∈argminc={c1,…,ck} ^R(c)where^R(c)=1nn∑i=1min1≤j≤kd(Xi,cj)2. (1.6)

In other words, empirically optimal quantizers minimize the risk associated to the empirical measure

 Pn:=1nn∑i=1δXi.

The computation of empirically optimal centers is known to be a hard problem, due in particular to the non-convexity of , and is usually performed by Lloyd’s algorithm for which convergence guarantees have been obtained recently by Lu and Zhou (2016) in the context where is a mixture of sub-gaussian distributions.

### 1.2 Risk bounds

The performance of the -means clustering scheme, based on the notion of risk, has been widely studied in the literature. Whenever is a separable Hilbert space, the existence of an optimal codebook, i.e. of such that

 R(c⋆)=R⋆=infc={c1,…,ck}R(c),

is well established (see, e.g, Theorem 4.12 in Graf and Luschgy, 2000), provided . In this same context, works of Pollard (1981, 1982a) and Abaya and Wise (1984) imply that almost surely as goes to , where is as in (1.6). The non-asymptotic performance of the -means clustering scheme has also received a lot of attention and has been studied, for example, by Chou (1994); Linder et al. (1994); Bartlett et al. (1998); Linder (2000, 2001); Antos (2005); Antos et al. (2005) and Biau et al. (2008). For instance Biau et al. (2008) prove that in a separable Hilbert space, and provided almost surely, then

 ER(^c)−R⋆≤12kL2/√n,

for all . A similar result is established in Cadre and Paris (2012) relaxing the hypothesis of bounded support by supposing only the existence of an exponential moment for . In the context of a separable Hilbert space, Levrard (2015) establishes a stronger result under some conditions involving the quantity defined as follows.

###### Definition 1.1 ( Levrard, 2015 ).

Let be the set of all such that . For , we define

 p(t):=supc⋆∈MP(F(c⋆)t), (1.7)

where, for any set , the notation stands for the -neighborhood of in defined by and where is defined in (1.4).

For any codebook , corresponds to the probability mass of the frontier of the associated Voronoi diagram inflated by (see Figure 1). Under some slight restrictions and supposing does not increase too rapidly with , it appears that the excess risk is of order as described below.

###### Theorem 1.2 ( Proposition 2.1 and Theorem 3.1 in Levrard, 2015 ).

Suppose that is a (separable) Hilbert space. Denote

 B=infc⋆∈M,i≠j|c⋆i−c⋆j|andp\emphmin=infc⋆∈M,1≤j≤kP(Vj(c⋆)).

Suppose that for some . Then and .
Suppose in addition that there exists such that, for all ,

 p(t)≤Bp\emphmin128L2t,

where is as in (1.7). Then, for all , and any minimizing the empirical risk as in (1.6),

 R(^c)−R⋆≤C(k+x)L2n,

with probability at least , where denotes a constant depending on auxiliary (and explicit) characteristics of .

### 1.3 Stability

For a quantizer , the risk describes the average square distance of a point to its representative whenever is drawn from . The risk of characterizes therefore an important feature of the clustering scheme based on and defining optimality of in terms of the value of its risk appears as a reasonable approach. However, an important though simple observation is that the excess risk , for an optimal quantizer , isn’t well suited to describe the geometric similarity between the clusterings based on and . For one thing, there might be several optimal codebooks. Also, even in the context where there is a unique optimal codebook, quite different configurations of centers may give rise to very similar values of the excess risk . This observation relates to the difference between estimating the optimal quantizer and learning to perform as well as the optimal quantizer and is relevant in a more general context as briefly discussed in Appendix B below. Basically, the idea of stability we are referring to consists in identifying situations where having centers with small excess risk guarantees that isn’t far from an optimal center geometrically speaking. We formalize this idea below.

###### Definition 1.3.

Consider a function . The clustering problem discussed in subsections 1.1 and 1.2 is called -stable if, for any optimal quantizer , for any auxiliary quantizer ,

 F(q⋆,q)≤ϕ(R(q)−R(q⋆)). (1.8)

We say that the clustering problem is strongly stable for , if is linear.

Note first that, for some chosen , the notion of stability defined above characterizes a property of the underlying distribution . Here, properties of the function are deliberately unspecified as, in practice, can be chosen in order to encode very different properties, of more or less geometric nature. An important property of this notion is that stable clustering problem are such that -minimizers of the risk are ”close” (in the sense of ) to the optimal quantizer (see Corollary 2.5 below).

###### Remark 1.4.

The notion of stability described above differs from the notion of algorithm stability studied in Ben-David et al. (2006) and Ben-David et al. (2007). Their notion of stability is defined for a function (called algorithm) that maps any data set to a quantizer . In this context, the stability of is defined by

 Stab(A,P)=limn→∞ED(A({X1,…,Xn}),A({Y1,…,Yn})),

where the ’s and ’s are i.i.d. random variables of common distribution and is a (pseudo-) metric on . Then, an algorithm is said to be stable for if . According to this definition, any constant algorithm is stable. A notable difference, is that our notion of stability includes a notion of consistency. Indeed, since is continuous (for a proper choice of the metric on ), then our notion of stability measures (if and) at which rate whenever . Thus, we focus only on the behaviour of algorithms such that .

A first rather obvious choice for is given by

 F1(q⋆,q):=minσmax1≤j≤kd(c⋆j,cσ(j)), (1.9)

if and and where the minimum is taken over all permutations of (see Figure 3).

###### Remark 1.5.

Note that does not always coincide with the Hausdorff distance between and . Indeed, Figure 2 presents a configuration of codebooks and that have small Hausdorff distance but define NN quantizers and with large . However, it may be seen that inequality

 dH(c⋆,c)≤F1(q⋆,q)

always holds and that, provided

 dH(c⋆,c)<12mini≠j|c⋆i−c⋆j|,

we obtain . The proof of these statements is reported in Appendix A.1.

Whenever is Euclidean, it follows from the previous remark and Pollard (1982b) that, provided the optimal codebook is unique,

 F1(q⋆,^q)⟶n→+∞0,a.% s.,

when is any quantizer minimizing the empirical risk . In Levrard (2015), under the conditions of Theorem 1.2, it is proven that for any optimal quantizer , and any such that ,

 F1(q⋆,q)2≤pmin2(R(q)−R(q⋆)),

provided which proves in this case (a local version of) the stability of the clustering scheme for (constants are defined in Theorem 1.2). In the same spirit, when and for a measure with bounded support, Rakhlin and Caponnetto (2007) show that as whenever and are optimal quantizers for empirical measures and whose supports differ by at most points. In addition, their Lemma 5.1 shows that, for with bounded support,

 dH(c⋆,c)≤CE[||X−q(X)|2−|X−q⋆(X)|2|]1d+2,

for some constant . Note that, since , our main result (Theorem 2.3) improves this inequality under suitable conditions discussed below.

While captures distances between representatives of the two quantizers, it is however totally oblivious to the amount of wrongly classified points. From this point of view, a more interesting quantity is described by

 F2(q⋆,q):=minσP[(k⋃j=1Vj(c⋆)∩Vσ(j)(c))c], (1.10)

where the minimum is taken over all permutations of (see Figure 3). This quantity measures exactly the amount of points that are misclassified by compared to , regarding .

In the present paper, we study a related quantity, of geometric nature, defined simply as the average square distance between a quantizer and an optimal quantizer , i.e.

 F(q⋆,q)2:=∫Ed(q(x),q⋆(x))2dP(x). (1.11)

As discussed later in the paper (see Subsection 2.2), this quantity may be seen as an intermediate between and incorporating both the notion of proximity of the centers and the amount of misclassified points. The general concern of the paper will be to establish conditions under which the clustering scheme is strongly stable for this function .

## 2 Stability results

In this section, we present our main results. In the sequel, we restrict ourselves to the case where is a (separable) Hilbert space with scalar product and associated norm . For any -valued random variable , we’ll denote

 ∥Z∥2:=E|Z|2,

for brevity.

### 2.1 Absolute margin condition

We first address the issue of characterizing the stability of the clustering scheme in terms of the function defined in (1.11). The next definition plays a central role in our main result. Recall that denotes a generic random variable with distribution .

###### Definition 2.1 ( Absolute margin condition ).

Suppose that and let be an optimal -points quantizer of . For , define

 A(λ)={x∈E:q⋆(x+λ(x−q⋆(x)))=q⋆(x)}.

Then, is said to satisfy the absolute margin condition with parameter , if both the following conditions hold:

1. .

2. For any random variable such that , the map

 q∈Qk↦∥Y−q(Y)∥2

has a unique minimizer .

The second condition means that every probability measure, in a neighborhood of , has a unique -quantizer. Note that and that for . Letting , the first point of this definition states that the neighborhood of the frontier is of probability zero (see Figure 4). The next remark discusses the geometry of the set , involved in the previous definition, in comparison with the sets used in Definition 1.1. In particular, it follows from the following remark that, for appropriate , the set satisfies

 F(c)t1⊂(E∖A(λ0))⊂F(c)t2.
###### Remark 2.2.

Let . Denote

 m(c)=mini≠j|ci−cj|andM(c)=maxi≠j|ci−cj|.

For all and , let

 A(λ):={x∈E:q(x+λ(x−q(x)))=q(x)}andB(t):=E∖F(c)t.

Then the following statements hold.

1. For all ,

 B(t)⊂A(2tM(c)−2t).
2. For all

 A(λ)⊂B(m(c)λ2(1+λ)).

We are now in position to state the main result of this paper.

###### Theorem 2.3.

Suppose that . Let be an optimal quantizer for and suppose that satisfies the absolute margin condition 2.1 with parameter . Then, for any , it holds that

 F(q⋆,q)2≤1+λ0λ0(R(q)−R(q⋆)).
###### Remark 2.4.

The above theorem states that the clustering scheme is strongly stable for provided the absolute margin condition holds. Here, we briefly argue that this result is optimal in the sense that strong stability requires that both hypotheses of the absolute margin condition 2.1 hold in general.

1. The following example shows that the first point of the absolute margin condition cannot be dropped. Take uniform on and fix . Then the first point of the absolute margin condition is clearly not satisfied. The codebook

 c⋆={(−1/2,0),(1/2,0)}

defines the unique optimal quantizer. For , consider now

 cε={(−1/2,ε),(1/2,−ε)}.

Then it can be checked through straightforward computations that and that , so that there exists no for which inequality

 F(q⋆,qε)2≤1+λλ(R(qε)−R(q⋆))

holds for all .

2. If there is not uniqueness of an optimal quantizer of , then the result clearly cannot hold. Although, this uniqueness property does not suffice. To illustrate this statement, suppose is defined by where is uniform on and is uniform on . For , the codebook

 c⋆={(0,1),(0,−1)}

defines the unique optimal quantizer for . The distribution satisfies the first point of the absolute margin condition for any , but fails to satisfies the second point for large . In particular, it follows from details in the proof of Theorem 2.3 that the desired inequality cannot hold for large .

An interesting consequence of Theorem 2.3 holds in the context of empirical measures for which the absolute margin condition always holds. Consider a sample composed of i.i.d. variables with distribution and let

 Pn=1nn∑i=1δXi.

The next result ensures that an -empirical risk minimizer (i.e. a quantizer such that ) is at a distance (in terms of ) at most to an empirical risk minimizer for some depending only on .

###### Corollary 2.5.

Let . Let be the empirical measure of a measure , associated with sample . Suppose has a unique optimal quantizer . Then satisfies the absolute margin condition for some . In addition, if satisfies

 1nn∑i=1|Xi−qε(Xi)|2≤ε+1nn∑i=1|Xi−^q(Xi)|2,

then

 F(^q,qε)2≤1+λnλnε.

The last result follows easily from Theorem 4.2 in Graf and Luschgy (2000) (stating that , for , and thus for some ) and from Theorem 2.3. The proof is therefore omitted for brevity. The interpretation of this corollary is that any algorithm producing a quantizer with small empirical risk will be, automatically, such that is small (and again, provided uniqueness of ) if is large. Parameter defined by the absolute margin condition, thus provides a key feature for stability of the -means algorithm. A nice property of the previous result is that is of course independent of the -minimizer . However, an important remaining question, of large practical value, is to lower bound with large probability to assess the size of the coefficient . This is left for future research.

### 2.2 Comparing notions of stability

This subsection describes some relationships existing between the function involved in our main result, with the two functions and mentioned earlier in section 1.3. Below, we restrict attention to the case where there is a unique optimal quantizer . Comparing and can be done straightforwardly. Let

 m=infi≠j|c⋆i−c⋆j|andM=supi≠j|c⋆i−c⋆j|.

Observe that, for small enough, the permutation reaching the minimum in the definitions of and is the same and can be assumed to be the identity without loss of generality. Then, it follows that, for small enough,

 F(q⋆,q)2 =k∑i,j=1P(Vi(c⋆)∩Vj(c))|c⋆i−cj|2 ≤k∑i=1P(Vi(c⋆)∩Vi(c))|c⋆i−ci|2+∑i≠jP(Vi(c⋆)∩Vj(c))(|c⋆j−cj|+M)2 ≤F1(q⋆,q)2+F2(q⋆,q)(F1(q⋆,q)+M)2,

and similarly, when ,

 F(q⋆,q)2 ≥k∑i=1P(Vi(c⋆)∩Vi(c))|c⋆i−ci|2+k∑i≠j=1P(Vi(c⋆)∩Vj(c))(m−|c⋆j−cj|)2 ≥F2(q⋆,q)(m−F1(q⋆,q))2.

This two inequalities imply that and are comparable whenever is small enough.

Comparing and requires more effort, although one inequality is also quite straightforward. Recall the notation . Suppose again that the optimal permutation in the definition of is the identity. Then, remark that , implies , for all . Thus, in this case,

 F(q⋆,q)2 =E|q⋆(X)−q(X)|2 =k∑i,j=1P(Vi(c⋆)∩Vj(c))|c⋆i−cj|2 ≥k∑i=1k∑j=1P(Vi(c⋆)∩Vj(c))|c⋆i−ci|2 ≥pminF1(q⋆,q)2.

In view of providing a more detailled result, we define the function , similar in nature to the function introduced by Levrard (2015) and defined in 1.1.

###### Definition 2.6.

For a metric space and a probability measure on , let be a random variable of law . Denote an optimal quantizer of with image and the frontier of the Voronoi cell associated to . Then, for all , we let

 p⋆(t):=P(k⋃i=1{md(X,∂Vi(c⋆))≤2d(X,q⋆(X))t+2t2}),

where .

While corresponds to the probability of the -inflated frontier of the Voronoi cells (defined in Definition 1.1), corresponds to a similar object in which the inflation of the frontier gets larger as the points go further from their representant in the codebook . These two functions can thus differ significantly, in general. However, since for such that , it follows that

 p(t)≤p⋆(2t),

whenever . And when the probability measure has its support in a ball of diameter , it can be readily seen that for all

 p⋆(t)≤p(m−1[2Rt+2t2]).

If the support of is not contained in a ball, the comparison is not as straightforward.

We can now state the last comparison inequality.

###### Proposition 2.7.

Under the same setting as in the Definition 2.6,

 F(q⋆,q)2≤F1(q⋆,q)2+p⋆(F1(q⋆,q))(M+F1(q⋆,q))2

A consequence of this proposition and the result of Levrard (2015) recalled in Theorem 1.2 is the following

###### Corollary 2.8.

Under the conditions of Theorem 1.2,

 F(q⋆,^q)2=O(1n)+p⋆(O(1√n)),

for any empirical risk minimizer .

## 3 Proofs

This section gathers the proofs of the main results of the paper. Additional proofs are postponed to the appendices.

### 3.1 Proof of Theorem 2.3

Recall that is a Hilbert space with scalar product , norm and that, for an -valued random variable with square integrable norm, we denote for brevity. For , set

 xλ=x+λ(x−q⋆(x)).

As is a Hilbert space, we have for all and all ,

 |ty+(1−t)z|2=t|y|2+(1−t)|z|2+t(1−t)|y−z|2.

Now for all , any quantizer and any , using the previous inequality with , and , it follows that

 |q⋆(x)−q(x)|2 =1+λλ(|x−q(x)|2−|x−q⋆(x)|2)+|xλ−q⋆(x)|2−|xλ−q(x)|2λ ≤1+λλ(|x−q(x)|2−|x−q⋆(x)|2)+|xλ−q⋆(x)|2−|xλ−q(xλ)|2λ,

where the last inequality follows from the fact that is a nearest neighbor quantizer. Integrating this inequality with respect to , we obtain

 F(q⋆,q)2≤1+λλ(R(q)−R(q⋆))+1λcq(λ), (3.1)

where we have denoted

 cq(λ):=∥Xλ−q⋆(X)∥2−∥Xλ−q(Xλ)∥2.

Observe that is continuous. Now, define

 c∞(λ):=supqcq(λ),

where the supremum is taken over all -points quantizers . The function satisfies obviously , for all . To prove the theorem, we will show that , whenever satisfies the absolute margin condition with paramater . To that aim, we provide two auxiliary results.

###### Lemma 3.1.

Suppose there exists such that . For all , denote any quantizer such that and denote an optimal quantizer of the law of . Suppose the absolute margin condition holds for . Then, for all , there exists such that for all , if , then

 q⋆=qλ.
###### Proof of lemma 3.1.

The main idea of the proof is that since the Voronoi cells are well separated (inflated borders are with probability ), when a quantizer is close enough to the optimal one, it shares its Voronoi cell (on the support of ) and thus, centroid condition requires that quantizer have to be centroid of its cell to be optimal.

Set and . Suppose without loss of generality that the optimal permutation in the definition of is the identity. The assumption implies that, with probability one, for each , on the event , the inequality holds, or equivalently

 2(1+λ0)⟨X−c⋆i,c⋆j−c⋆i⟩≤|c⋆i−c⋆j|2. (3.2)

However,

 |Xλ0−ci|2= (1+λ0)2|X−c⋆i|2+|c⋆i−ci|2+2(1+λ0)⟨X−c⋆i,c⋆i−ci⟩ |Xλ0−cj|2= (1+λ0)2|X−c⋆i|2+|c⋆i−cj|2+2(1+λ0)⟨X−c⋆i,c⋆i−cj⟩

so that if

 2(1+λ0)⟨X−c⋆i,cj−ci⟩≤|c⋆i−cj|2−|c⋆i−ci|2.

Since (3.2) holds, for all , there exists therefore such that, if , then for all ,

 |Xλ−ci|2<|Xλ−cj|2,

on the event . As a result,

 P(k⋃i=1{q⋆(X)=c⋆i}∩{qλ(Xλ)=ci})=1.

This means that and share the same cells on the support of . Thus,

 ∥Xλ−qλ(X)∥2 =(1+λ)2k∑i=1E1{q</