Machine Learning Friendly Set Version of Johnson-Lindenstrauss Lemma

Machine Learning Friendly Set Version of Johnson-Lindenstrauss Lemma

Mieczysław A. Kłopotek (klopotek@ipipan.waw.pl)
Institute of Computer Science of the Polish Academy of Sciences
ul. Jana Kazimierza 5, 01-248 Warszawa Poland
Abstract

In this paper we make a novel use of the Johnson-Lindenstrauss Lemma. The Lemma has an existential form saying that there exists a JL transformation of the data points into lower dimensional space such that all of them fall into predefined error range .

We formulate in this paper a theorem stating that we can choose the target dimensionality in a random projection type JL linear transformation in such a way that with probability all of them fall into predefined error range for any user-predefined failure probability .

This result is important for applications such a data clustering where we want to have a priori dimensionality reducing transformation instead of trying out a (large) number of them, as with traditional Johnson-Lindenstrauss Lemma. In particular, we take a closer look at the -means algorithm and prove that a good solution in the projected space is also a good solution in the original space. Furthermore, under proper assumptions local optima in the original space are also ones in the projected space. We define also conditions for which clusterability property of the original space is transmitted to the projected space, so that special case algorithms for the original space are also applicable in the projected space.

Keywords: Johnson-Lindenstrauss Lemma, random projection, sample distortion, dimensionality reduction, linear JL transform, -means algorithm, clusterability retention,

1 Introduction

Dimensionality reduction plays an important role in many areas of data processing, and especially in machine learning (cluster analysis, classifier learning, model validation, data visualisation etc.).

Usually it is associated with manifold learning, that is a belief that the data lie in fact in a low dimensional subspace that needs to be identified and the data projected onto it so that the number of degrees of freedom is reduced and as a consequence also sample sizes can be smaller without loss of reliability. Techniques like reduced -means [17], PCA (Principal Component Analysis), Kernel PCA, LLE (Locally Linear Embedding), LEM (Laplacian Eigenmaps), MDS (Metric Multidimensional Scaling), Isomap, SDE (Semidefinite Embedding), just to mention a few.

But there exists still another possibility of approaching the dimensionality reduction problems, in particular when such intrinsic subspace where data is located cannot be identified. The problem of choice of the subspace has been surpassed by several authors by so-called random projection, applicable in particularly highly dimensional spaces (tens of thousands of dimensions) and correspondingly large data sets (of at least hundreds of points).

The starting point here is the Johnson-Lindenstrauss Lemma [13]. Roughly speaking it states that there exists a linear111JL Lemma speaks about a general transformation, but many researchers look just for linear ones. mapping from a higher dimensional space into a sufficiently high dimensional subspace that will preserve approximately the distances between points, as needed e.g. by -means algorithm [4].

To be more formal consider a set of objects . An object may have a representation . Then the set of these representations will be denoted by . An object may have a representation , in a different space. Then the set of these representations will be denoted by .

With this notation let us state:

Theorem 1.

(Johnson-Lindenstrauss) Let . Let be a set of objects and - a set of points representing them in , and let , where is a sufficiently large constant (e.g.20). There exists a Lipschitz mapping such that for all

 (1−δ)∥u−v∥2≤∥f(u)−f(v)∥2≤(1+δ)∥u−v∥2 (1)

A number of proofs and applications of this theorem have been proposed which in fact do not prove the theorem as such but rather create a probabilistic version of it, like e.g. [11, 1, 3, 12, 14, 10]. For an overview of Johnson-Lindenstrauss Lemma variants see e.g. [15].

Essentially the idea behind these probabilistic proofs is as follows: It is proven that the probability of reconstructing the length of a random vector from a projection onto a subspace within a reasonable error boundaries is high.

One then inverts the thinking and states that the probability of reconstructing the length of a given vector from a projection onto a (uniformly selected) random subspace within a reasonable error boundaries is high.

But uniform sampling of high dimensional subspaces is a hard task. So instead vectors with random coordinates are sampled from the original -dimensional space and one uses them as a coordinate system in the -dimensional subspace which is a much simpler process. One hopes that the sampled vectors will be orthogonal (and hence the coordinate system will be orthogonal) which in case of vectors with thousands of coordinates is reasonable. That means we create a matrix of rows and columns as follows: for each row we sample numbers from forming a row vector . We normalize it obtaining the row vector . This becomes the th row of the matrix . Then for any data point in the original space its random projection is obtained as .

Then the mapping we seek is the projection multiplied by a suitable factor.

It is claimed afterwards that this mapping is distance-preserving not only for a single vector, but also for large sets of points with some, usually very small probability, as Dasgupta and Gupta [11] maintain. Via applying the above process many times one can finally get the mapping that is needed. That is each time we sample a subspace from the space of subspaces and check if condition expressed by equation (1) holds for all the points, and if not, we sample again, while we have the reasonable hope that we will get the subspace of interest after a finite number of steps with probability that we assume.

In this paper we explore the following flaw of the mentioned approach: If we want to apply for example a -means clustering algorithm, we are in fact not interested in resampling the subspaces in order to find a convenient one so that the distances are sufficiently preserved. Computation over and over again of distances between the points in the projected space may turn out to be much more expensive than computing distances during -means clustering (if ) in the original space. In fact we are primarily interested in clustering data. But we do not have any criterion for the -means algorithm that would say that this particular subspace is the right one via e.g. minimization of -means criterion (and in fact for any other clustering algorithm).

Therefore, we rather seek a scheme that will allow us to say that by a certain random sampling we have already found the subspace that we sought with a sufficiently high probability. As far as we know, this is the first time such a problem has been posed.

To formulate claims concerning -means, we need to introduce additional notation. Let us denote with a partition of into clusters . For any let denote the cluster to which belongs. For any set of objects let and .

Under this notation the -means cost function may be written as

 J(Q,C)=∑i∈Q∥xi−μ(C(i))∥2 (2) J(Q′,C)=∑i∈Q∥x′i−μ′(C(i)∥2 (3)

for the sets .

Our contribution is as follows:

• We formulate and prove a set version of JL Lemma - see Theorem 6.

• Based on it we demonstrate that a good solution to -means problem in the projected space is also a good one in the original space - see Theorem 2.

• We show that local -means minima in the original and the projected spaces match under proper conditions - see Theorems 3, 4.

• We demonstrate that a perfect -means algorithm in the projected space is a constant factor approximation of the global optimum in the original space - see Theorem 5

• We prove that the projection preserves several clusterability properties - see Theorems 9, 7, 8. 10 and 11.

For -means in particular we make the following claim:

Theorem 2.

Let be a set of representatives of objects from in an -dimensional orthogonal coordinate system . Let , . and let

 n′≥2−lnϵ+2ln(m)−ln(1+δ)+δ (4)

Let be a randomly selected (via sampling from a normal distribution) -dimensional orthogonal coordinate system. Let the set consist of objects such that for each , is a projection of onto . If is a partition of , then

 (1−δ)J(Q,C)≤nn′J(Q′,C)≤(1+δ)J(Q,C) (5)

holds with probability of at least .

Note that the inequality (5) can be rewitten as

 (1−δ1+δ)J(Q′,C)≤n′nJ(Q,C)≤(1+δ1−δ)J(Q′,C)

Furthermore

Theorem 3.

Under the assumptions and notation of Theorem 2, if the partition constitutes a local minimum of over (in the original space) and if for any two clusters times half of the distance between their centres is the gap between these clusters, where , and

 δ≤1−(1−g2)2(1−g2)2+(1+2p) (6)

( to be defined later by inequality (14)) then this same partition is (in the projected space) also a local minimum of over , with probability of at least .

Theorem 4.

Under the assumptions and notation of Theorem 2, if the clustering constitutes a local minimum of over (in the projected space) and if for any two clusters times the distance between their centres is the gap between these clusters, where , and

 δ1−δ≤1−α2(1+2p)+α2 (7)

then the very same partition is also (in the original space) a local minimum of over , with probability of at least .

Theorem 5.

Under the assumptions and notation of Theorem 2, if denotes the clustering reaching the global optimum in the original space, and denotes the clustering reaching the global optimum in the projected space, then

 nn′J(Q′,C′G)≤(1+δ)J(Q,CG) (8)

with probability of at least .

That is the perfect -means algorithm in the projected space is a constant factor approximation of -means optimum in the original space.

We postpone the proof of the theorems 2-5 till section 3, as we need first to derive the basic theorem 6 in section 2 which is essentially based on the results reported by Dasgupta and Gupta [11].

Let us however stress at this point the significance of these theorems. Earlier forms of JL lemma required sampling of the coordinates over and over again222 Though in passing a similar result is claimed in Lemma 5.3 http://math.mit.edu/~bandeira/2015˙18.S096˙5˙Johnson˙Lindenstrauss.pdf, though without an explicit proof. , with quite a low success rate until a mapping is found fitting the error constraints. In our theorems, we need only one sampling in order to achieve the required success probability of selecting a suitable subspace to perform -means. In Section 5 we illustrate this advantage by some numerical simulation results, showing at the same time the impact of various parameters of Jonson-Lindenstrauss Lemma on the dimensionality of the projected space. In Section 6 we recall the corresponding results of other authors.

In Section 4 we demonstrate an additional advantage of our version of JL lemma consisting in preservation of various clusterability criteria.

Section 7 contains some concluding remarks.

2 Derivation of the Set-Friendly Johnson-Lindenstrauss Lemma

Let us present the process of seeking the mapping from Theorem 1 in a more detailed manner, so that we can then switch to our target of selecting the size of the subspace guaranteeing that the projected distances preserve their proportionality in the required range.

Let us consider first a single vector of independent random variables drawn from the normal distribution with mean 0 and variance 1. Let , where , be its projection onto the first coordinates.

Dasgupta and Gupta [11] in their Lemma 2.2 demonstrated that for a positive

• if then

 Pr(∥x′∥2≤βn′n∥x∥2)≤βn′2(1+n′(1−β)n−n′)n−n′2 (9)
• if then

 Pr(∥x′∥2≥βn′n∥x∥2)≤βn′2(1+n′(1−β)n−n′)n−n′2 (10)

Now imagine we want to keep the error of squared length of bounded within a range of (relative error) upon projection, where . Then we get the probability

 Pr((1−δ)∥x∥2≤nn′∥x′∥2≤(1+δ)∥x∥2) ≥1−(1−δ)n′2(1+n′δn−n′)n−n′2 −(1+δ)n′2(1−n′δn−n′)n−n′2

This implies

 Pr((1−δ)∥x∥2≤nn′∥x′∥2≤(1+δ)∥x∥2) ≥1−2max⎛⎜⎝(1−δ)n′2(1+n′δn−n′)n−n′2, (1+δ)n′2(1−n′δn−n′)n−n′2⎞⎟⎠ =1−2maxδ∗∈{−δ,+δ}⎛⎜⎝(1−δ∗)n′2(1+δ∗n′n−n′)n−n′2⎞⎟⎠

The same holds if we scale the vector .

Now if we have a sample consisting of points in space, without however a guarantee that coordinates are independent between the vectors then we want that the probability that squared distances between all of them lie within the relative range is higher than

 1−ϵ≤1−(m2)(1−Pr((1−δ)∥x∥2≤nn′∥x′∥2≤(1+δ)∥x∥2)) (11)

for some failure probability333 We speak about a success if all the projected data points lie within the range defined by formula (1). Otherwise we speak about failure (even if only one data point lies outside this range). term .

To achieve this, it is sufficient that the following holds:

 ϵ≥2(m2)maxδ∗∈{−δ,+δ}⎛⎜⎝(1−δ∗)n′2(1+δ∗n′n−n′)n−n′2⎞⎟⎠

Taking logarithm

 lnϵ≥ ln(m(m−1)) +maxδ∗∈{−δ,+δ}(n′2ln(1−δ∗)+(n−n′)2ln(1+δ∗n′n−n′))
 lnϵ−ln(m(m−1)) ≥maxδ∗∈{−δ,+δ}(n′2ln(1−δ∗)+(n−n′)2ln(1+δ∗n′n−n′))

We know444Please recall at this point the Taylor expansion which converges in the range (-1,1) and hence implies for as we will refer to it discussing difference to JL theorems of other authors. that for and , hence the above holds if

 lnϵ−ln(m(m−1))≥maxδ∗∈{−δ,+δ}(n′2ln(1−δ∗)+(n−n′)2δ∗n′n−n′)
 lnϵ−ln(m(m−1))≥maxδ∗∈{−δ,+δ}(n′2ln(1−δ∗)+12(δ∗)n′)=n′2maxδ∗∈{−δ,+δ}(ln(1−δ∗)+δ∗)

Recall that also we have for and , threfore

 maxδ∗∈{−δ,+δ}(2lnϵ−ln(m(m−1))ln(1−δ∗)+δ∗)≤n′

So finally, realizing that , and that we get as sufficient condition555 We substituted the denominator with a smaller positive number and the nominator with a larger positive number so that the fraction value increases so that a higher will be required than actually needed.

 n′≥2−lnϵ+2ln(m)−ln(1+δ)+δ

Note that this expression does not depend on that is the number of dimensions in the projection is chosen independently of the original number of dimensions666 Though in passing a similar result is claimed in Lemma 5.3 http://math.mit.edu/~bandeira/2015˙18.S096˙5˙Johnson˙Lindenstrauss.pdf, though without an explicit proof. They propose that in order to get a failure rate below . In fact when we substitute , both formulas are the same. However, usage of alows for control of failure rate in the other theorems in this paper, while does not make this possibility obvious. Also fixing versus fixing impacts disadvantageously the growth rate of with . .

So we are ready to formulate our major finding of this paper

Theorem 6.

Let , . Let be a set of points in an -dimensional orthogonal coordinate system and let (as in formula (4))

 n′≥2−lnϵ+2ln(m)−ln(1+δ)+δ

Let be a randomly selected (via sampling from a normal distribution) -dimensional orthogonal coordinate system. For each let be its projection onto . Then for all pairs

 (1−δ)∥u−v∥2≤nn′∥u′−v′∥2≤(1+δ)∥u−v∥2 (12)

holds with probability of at least

3 Proofs of theorems 2-5

The permissible error will surely depend on the target application. Let us consider the context of -means. First we claim for -means, that the JL Lemma applies not only to data points but also to cluster centres.

Lemma 1.

Let , . Let be a set of representatives of elements of in an -dimensional orthogonal coordinate system and let the inequality (4) hold. Let be a randomly selected (via sampling from a normal distribution) -dimensional orthogonal coordinate system. For each let be its projection onto . Let be a partition of . Then for all data points

 (1−δ)∥xi−μ(C(i))∥2≤nn′∥x′i−μ′(C(i))∥2≤(1+δ)∥xi−μ(C(i))∥2 (13)

hold with probability of at least ,

Proof.

As we know, data points under -means are assigned to clusters having the closest cluster centre. On the other hand the cluster centre is the average of all the data point representatives in the cluster.

Hence the cluster element has the squared distance to its cluster centre amounting to

 ∥xi−μ(C(i))∥2=1|C(i)|∑j∈C(i)∥xi−xj∥2

But according to Theorem 6

Hence

 (1−δ)∥xi−μ(C(i))∥2≤nn′∥x′i−μ′(C(i))∥2≤(1+δ)∥xi−μ(C(i))∥2

Note that here is not the projective image of , but rather the centre of projected images of cluster elements.

The Lemma 1 permits us to prove Theorem 2

Proof.

(Theorem 2) According to formula (13):

 (1−δ)∥xi−μ(C(i))∥2≤nn′∥x′i−μ′(C(i))∥2≤(1+δ)∥xi−μ(C(i))∥2

Hence

 ∑i∈Q(1−δ)∥xi−μ(C(i))∥2≤∑i∈Qnn′∥x′i−μ′(C(i))∥2≤∑i∈Q(1+δ)∥xi−μ(C(i))∥2
 (1−δ)∑i∈Q∥xi−μ(C(i))∥2≤∑i∈Qnn′∥x′i−μ′(C(i))∥2≤(1+δ)∑i∈Q∥xi−μ(C(i))∥2

Based on defining equations (2) and (3) we get the formula (5)

 (1−δ)J(Q,C)≤nn′J(Q′,C)≤(1+δ)J(Q,C)

Let us now investigate the distance between centres of two clusters, say . Let their cardinalities amount to respectively. Denote . Consequently . For a set let and .

Therefore

 VAR(C12)=1|C12|∑i∈C12∥xi−μ(C12)∥2
 =1|C12|⎛⎝⎛⎝∑i∈C1∥xi−μ(C12)∥2⎞⎠+⎛⎝∑i∈C2∥xi−μ(C12)∥2⎞⎠⎞⎠

By inserting a zero

 =1|C12|⎛⎝⎛⎝∑i∈C1∥xi−μ(C1)+μ(C1)−μ(C12)∥2⎞⎠+⎛⎝∑i∈C2∥xi−μ(C12)∥2⎞⎠⎞⎠
 =1|C12|⎛⎝⎛⎝∑i∈C1((xi−μ(C1))2+(μ(C1)−μ(C12))2+2(xi−μ(C1))(μ(C1)−μ(C12)))⎞⎠
 +⎛⎝∑i∈C2∥xi−μ(C12)∥2⎞⎠⎞⎠
 =1|C12|⎛⎝⎛⎝⎛⎝∑i∈C1(xi−μ(C1))2⎞⎠+⎛⎝∑i∈C1(μ(C1)−μ(C12))2⎞⎠
 +2(∑i∈C1xi−∑i∈C1μ(C1))(μ(C1)−μ(C12))⎞⎠⎞⎠+⎛⎝∑i∈C2∥xi−μ(C2)+μ(C2)−μ(C12)∥2⎞⎠
 =1|C12|⎛⎝⎛⎝⎛⎝∑i∈C1(xi−μ(C1))2⎞⎠+|C1|(μ(C1)−μ(C12))2
 +2(|C1|μ(C1)−|C1|μ(C1))(μ(C1)−μ(C12)))+⎛⎝∑i∈C2∥xi−μ(C12)∥2⎞⎠⎞⎠
 =1|C12|⎛⎝(VAR(C1)|C1|+|C1|(μ(C1)−μ(C12))2)+⎛⎝∑i∈C2∥xi−μ(C12)∥2⎞⎠⎞⎠

Via the same reasonig we get:

 =1|C12|((VAR(C1)|C1|+|C1|(μ(C1)−μ(C12))2)
 +(VAR(C2)|C2|+|C2|(μ(C2)−μ(C12))2))
 =1|C12|(VAR(C1)|C1|+VAR(C2)|C2|
 +|C1|(μ(C1)−μ(C12))2+|C2|(μ(C2)−μ(C12))2)

As Apparently that is , we get

 =1|C12|⎛⎝VAR(C1)|C1|+VAR(C2)|C2|+|C1|(μ(C1)−|C1||C12|μ(C1)−|C2||C12|μ(C2)2
 +|C2|(μ(C2)−|C1||C12|μ(C1)−|C2||C12|μ(C2))2⎞⎠
 =1|C12|⎛⎝VAR(C1)|C1|+VAR(C2)|C2|+|C1|(|C2||C12|μ(C1)−|C2||C12|μ(C2))2
 +|C2|(−|C1||C12|μ(C1)+|C1||C12|μ(C2))2⎞⎠
 =1|C12|(VAR(C1)|C1|+VAR(C2)|C2|+|C1||C2|2+|C1|2|C2||C12|2(μ(C1)−μ(C2))2)

hence

 VAR(C12)==1|C12|(VAR(C1)|C1|+VAR(C2)|C2|+|C1||C2||C12|(μ(C1)−μ(C2))2)

 VAR(C12)⋅m12=VAR(C1)⋅m1+VAR(C2)⋅m2+m1⋅m2/m12⋅∥μ(C1)−μ(C2)∥2

which implies

 VAR(C12)⋅m212m1⋅m2=VAR(C1)⋅m12m2+VAR(C2)⋅m12m1+∥μ(C1)−μ(C2)∥2

According to Lemma 1, applied to the set as a cluster,

 (1−δ)(VAR(C1)⋅m12/m2+VAR(C2)⋅m12/m1+∥μ(C1)−μ(C2)∥2)
 ≤nn′(VAR′(C1)⋅m12/m2+VAR′(C2)⋅m12/m1+∥μ′(C1)−μ′(C2)∥2)
 ≤(1+δ)(VAR(C1)⋅m12/m2+VAR(C2)⋅m12/m1+∥μ(C1)−μ(C2)∥2)

and with respect to combined

 (1−δ)(VAR(C1)⋅m12/m2+VAR(C2)⋅m12/m1)
 ≤nn′(VAR′(C1)⋅m12/m2+VAR′(C2)⋅m12/m1)
 ≤(1+δ)(VAR(C1)⋅m12/m2+VAR(C2)⋅m12/m1)

These two last equations mean that

 −2δ(VAR(C1)⋅m12/m2+VAR(C2)⋅m12/m1)+(1−δ)∥μ(C1)−μ(C2)∥2
 ≤nn′(∥μ′(C1)−μ′(C2)∥2)
 ≤2δ(VAR(C1)⋅m12/m2+VAR(C2)⋅m12/m1)+(1+δ)∥μ(C1)−μ(C2)∥2

Let us assume that the quotient

 VAR(C1)⋅m12/m2+VAR(C2)⋅m12/m1∥μ(C1)−μ(C2)∥2≤p (14)

where is some positive number. So we have in effect

 (1−δ(1+2p))∥μ(C1)−μ(C2)∥2≤nn′(∥μ′(C1)−μ′(C2)∥2)≤(1+δ(1+2p))∥μ(C1)−μ(C2)∥2

Under balanced ball-shaped clusters does not exceed 1. So we have shown the lemma

Lemma 2.

Under the assumptions of preceding lemmas for any two clusters

 (1−δ(1+2p))∥μ(C1)−μ(C2)∥2≤nn′(∥μ′(C1)−μ′(C2)∥2)≤(1+δ(1+2p))∥μ(C1)−μ(C2)∥2 (15)

where depends on degree of balance between clusters and cluster shape, holds with probability at least .

Now let us consider the choice of in such a way that with high probability no data point will be classified into some other cluster. We claim the following

Lemma 3.

Consider two clusters . Let , . Let be a set of points in an -dimensional orthogonal coordinate system and let the inequality (4) hold. Let be a randomly selected (via sampling from a normal distribution) -dimensional orthogonal coordinate system. For each let be its projection onto . For two clusters , obtained via -means, in the original space let be their centres and be centres to the correspondings sets of projected cluster members. Furthermore let be the distance of the first cluster centre to the common border of both clusters and let the closest point of the first cluster to this border be at the distance of from its cluster centre as projected on the line connecting both cluster centres, where .
Then all projected points of the first cluster are (each) closer to the centre of the set of projected points of the first than to the centre of the set of projected points of the second if

 δ≤1−(1−g2)2(1−g2)2+(1+2p)=1−α2(1+2p)+α2 (16)

where , with probability of at least .

Proof.

Consider a data point ”close” to the border between the two neighbouring clusters, on the line connecting the cluster centres, belonging to the first cluster, at a distance from its cluster centre, where is the distance of the first cluster centre to the border and . The squared distance between cluster centres, under projection, can be ”reduced” by the factor , (beside the factor which is common to all the points) whereas the squared distance of to its cluster centre may be ”increased” by the factor . This implies a relationship between the factor and the error .

If should not cross the border between the clusters, the following needs to hold:

 ∥x′−μ′1∥≤12∥μ′2−μ′1∥ (17)

which implies:

 nn′∥x′−