Semi-Supervised Cluster Extraction via a Compressive Sensing Approach

# Semi-Supervised Cluster Extraction via a Compressive Sensing Approach

Ming-Jun Lai111mjlai@uga.edu. Department of Mathematics, University of Georgia, Athens, GA 30602. This research is partially supported by the National Science Foundation under the grant #DMS 1521537.    Daniel Mckenzie 222mckenzie@math.uga.edu. Department of Mathematics, University of Georgia, Athens, GA 30602. The financial assistance of the National Research Foundation of South Africa (NRF) towards this research is hereby acknowledged. Opinions expressed and conclusions arrived at, are those of the author and not necessarily to be attributed to the NRF.
###### Abstract

We use techniques from compressive sensing to design a local clustering algorithm by treating the cluster indicator vector as a sparse solution to a linear system whose coefficient matrix is the graph Laplacian. If the graph is drawn from the Stochastic Block Model we are able to prove that the fraction of misclassified vertices goes to zero as the size of the graph increases. Numerical experiments on simulated and real-life graphs demonstrate the effectiveness and speed of our approach. Finally, we explore the application of our algorithm to semi-supervised learning.

## 1 Introduction

Finding clusters is a problem of primary interest when analyzing networks. This is because vertices which are in the same cluster can reasonably be assumed to have some latent similarity. Thus, clustering techniques can be used to find communities in social networks [25, 45] functionally similar molecules in protein-protein interaction networks [33], or deduce political affiliation from a network of blogs connected by hyperlinks [3]. Moreover, even data sets which are not presented as graphs can profitably be studied by first creating an auxiliary graph (such as a -nearest-neighbors graph) and then applying graph clustering techniques. This has been successfully applied to image segmentation [42], natural language processing [19] and differentiating types of breast cancer [21].

We shall informally think of a cluster as a subset of vertices, with many edges between vertices in , and few edges to the rest of the graph, . For a toy example, consider the college football network of Girvan and Newman [25], represented in Figure 1. The vertices of this network correspond to the colleges fielding (American) football teams that played in NCAA Division 1A in Fall 2000. Two vertices are connected by an edge if they played against one another during the regular season. As can be seen from either the graph or the adjacency matrix, this graph contains clusters. In this case, the underlying similarity responsible for the clusters are the conferences to which the teams belong. Despite the simplicity of this graph, it exhibits two subtle clustering related phenomena. The first is the presence of background vertices, illustrated in black. These correspond to the five independent schools - Central Florida, Connecticut, Navy, Notre Dame and Utah State. These schools do not belong to any conference, and thus should not be placed into any cluster. The second is the presence of clusters at multiple scales. For example, the cluster corresponding to the South Eastern Conference (shown in red) could be further divided into two equally sized sub-clusters, both of which form cliques. In the context of this problem, this would reveal further valuable information, as the two sub-clusters correspond to the East and West Divisions of this Conference. Hence it is of practical importance to have clustering algorithms which can be set to find clusters of different sizes, and which are not forced to assign background vertices to a cluster.

Of course many real-world graphs of interest today are significantly larger than the college football network. For truly massive graphs it can be computationally intractable to partition the entire vertex set into clusters. Moreover, if one is only interested in the cluster containing several vertices of interest, this is unnecessary. Thus, in the last decade or so, there has been intensive research into local clustering algorithms (see, for example, [43, 27, 31, 38]) loosely defined to be algorithms with complexity proportional to the size of the cluster, not the whole graph.

In this paper we introduce a two-step local clustering algorithm, drawing on ideas from the signal processing field of compressive sensing. Our algorithm, which we call Semi-Supervised Cluster Pursuit (SSCP), is computationally efficient, provably accurate, able to find clusters at multiple scales and is not confounded by the presence of background vertices. We prove that for graphs drawn from the Stochastic Block Model (SBM) our algorithm misclassifies at most vertices, where is the size of the cluster of interest. We further show that, under certain assumptions on the parameters of the SBM, SSCP runs in operations. Finally we verify, via extensive experimentation on real and artificial graphs, that the performance of our algorithm is comparable, and some cases exceeds, that of many state-of-the-art algorithms. In the interest of reproducibility, we make all our code available at: danielmckenzie.github.io.

The rest of this paper is laid out as follows. In the remainder of §1, we introduce some notation and review the existing literature. In §2 we introduce the SSCP algorithm and include a brief overview of the theory of Compressive Sensing. Most of the technical work of this paper is in §3, where we prove the weak consistency of SSCP. We relegate several particularly technical results to an appendix. Finally in §4 we provide extensive numerical experiments.

### 1.1 Notation and Definitions

We restrict our attention to finite, simple, undirected graphs , possibly with edge weights. We identify the vertex set with the integers and denote an edge between vertices and as . The (possibly weighted) adjacency matrix of will be denoted as . By we mean the degree of the -th vertex, computed as . For quantities such as (and later and ) that are indexed by , let and similarly . Denote by the diagonal matrix whose entry is .

###### Definition 1.1 (Laplacians of graphs).

The normalized, random walk Laplacian is defined as . We shall simply refer to it as the Laplacian. The signless Laplacian is defined as while the normalized, symmetric Laplacian is: .

For any , we denote by the induced sub-graph with vertices and edges all with . For any we define an indicator vector by if and otherwise. will always denote the cardinality of . For any matrix , by we mean the submatrix of consisting of the columns for all . Suppose for every we have a probabilistic model of graphs on vertices containing a cluster , for example the stochastic block model introduced in the next section. Let be any algorithm for graph clustering problem with output . We say that is weakly consistent if

 P[∣∣C#ΔC(n)∣∣/∣∣C(n)∣∣≤o(1)]=1−o(1),

where for any two sets and , denotes their symmetric difference. Note this is analogous to the almost exact recovery condition for partitioned clustering given in [1]. See [27] for a slightly different formulation of this problem.

### 1.2 Random Graphs

In order to study how well our algorithm performs, it is useful to have a statistical model of graph with latent clusters. The model we shall use in this paper is the Stochastic Block Model (SBM). As pointed out elsewhere (for example in [1]), the SBM strikes a good balance between theoretical tractability and realistically modelling real-world networks.

###### Definition 1.2.

Let be a vector of positive integers, and let be a symmetric matrix with for all . We say a graph is drawn from (and shall write ) if there exists a latent partition with such that any vertices and are connected by an edge with probability , and all edges are inserted independently.

In [1] and elsewhere, a slightly more general definition is given where it is only required that the expected value of is , but the above shall suffice for our purposes. In the special case where all the are equal, for all and for all we say that is drawn from the Symmetric Stochastic Block Model, and write . In this case the clusters are all of size . We will also use a simpler model of random graph, the Erdős - Rènyi (ER) graph.

###### Definition 1.3.

We say is drawn from (written ) if for .

Note that if then for all . We shall use this simple observation repeatedly.

###### Remark 1.4.

Certainly, the Stochastic Block Model is not the only model of random graph studied with regards to clustering. In [32], Lancichinetti, Fortunato and Radicchi proposed a set of models designed to display certain phenomena — such as overlapping communities and a wide range of degrees — that are observed in real-world networks. In [5], random graphs are generated using a preferential attachment rule, generating a power-law degree distribution, which is often empirically observed in real-world networks. It would be an interesting topic for future research to investigate how our algorithm applies to such models.

### 1.3 Some Existing Related Work

Local community detection algorithms (also known as Cluster Extraction algorithms in the statistics literature) seek to find a “good” cluster given a set of seed vertices . In the computer science literature it is usually required that (this is the case for HKGrow and Losp++) while in the statistics literature this is not always the case (see ESSC). If desired, this procedure can be iterated a (possibly predefined, possibly data-determined) number of times, finding clusters while not requiring that they cover the vertex set. Depending on the algorithm, the may be allowed to overlap. The set of vertices not assigned a cluster is referred to as the background vertices. That is, . We review several such algorithms here.

##### The Extraction of Statistically Significant Communities (Essc) algorithm

The key insight behind this approach is to view communities as fixed points of the update rule:

 S(B):={u∈V: u is strongly connected to B} where B⊂V (1)

In [46] the idea of a vertex being strongly connected to a set is formalized as a procedure analogous to a statistical -test. Precisely, denote by the graph under consideration, and let denote the number of edges between a vertex and a set of vertices . Assume a null-model for graphs, on the same vertex set and with the same degree sequence, but without any a priori cluster structure. Let be a random variable denoting the number of edges between and for graphs drawn from . If the probability of being larger than the value is smaller than some threshold value (usually taken to be ) then say that is strongly connected to . Thus (1) can be written as:

 S(B)={u∈V: P[^d(u:B)≥d0(u:B)]≤α}. (2)

The authors in [46] show that, if is taken to be the configuration model, then is approximately a binomial random variable, hence the probability in the update rule can be easily computed. The algorithm is initialized with a set of seed vertices consisting of the highest degree vertex and its neighbors. The update rule (2) is then used: , until or a maximum number of iterations is reached. This resulting cluster is then removed and the process may be repeated, terminating when the empty set is returned as a fixed point of the update rule (2). No theoretical guarantee of success is given in [46], but experimental results suggest that the algorithm works well.

##### The HKGrow algorithm

This algorithm, introduced in [31], is part of a family of cluster extraction algorithms known as diffusion methods. HKGrow is based on the idea that if one unit of heat is initially distributed over a small set of seed vertices, and then allowed to spread over the graph via the heat equation, it will concentrate in the cluster containing the seed vertices. More formally, for any seed set , let and define , for an appropriate value of to be specified by the user. Normalize by degree: , and let be a permutation of such that . HKGrow returns the cluster defined as where

 k∗=arg max{Cond({j1,…,jk}) for% k=1,…,n} (3)

For any subset of vertices , denotes its conductance, defined as follows. Let denote the boundary of and let denote its volume, then . From work of Chung [15] it is known that if is contained in a set of low conductance then will be of similarly low conductance. Experimental results provided in [31] verify this, and show that the performance of HKGrow is on par with the Pagerank diffusion method of [4].

##### The Losp++ algorithm

This algorithm is a representative of the family of Local Spectral Methods (see also LEMON [34] and LOSP [26]). LOSP++, introduced in [27], works as follows. Given a set of seed vertices , first extract a subgraph from which is very likely to contain the community which contains . Let denote the adjacency matrix of and denote by the random walk transition matrix . Define and let denote the distribution of the -th step of the random walk with initial distribution . For small values of and , to be fixed by the user, construct the matrix . Now let denote the solution to the linear programming problem:

 argmin ∥y∥1 such that: y∈range(Vkd), y≥0, yi≥1/|S| for all i∈S.

For a user specified size parameter , define to be the set of indices of the largest entries in . In [27] both theoretical and experimental arguments that will be a low conductance cluster containing are given.

There are certainly other algorithms that fall under the local community detection/ cluster extraction umbrella, such as Nibble [43], algorithms which seek to optimize a local modularity score [48] and locally-biased spectral methods [38].

#### 1.3.1 Fundamental Bounds for Recovery

Recent work of Abbe, Sandon and others has culminated in a theoretical bound beyond which it is impossible to detect cluster membership in the SBM with accuracy better than that of a random guess:

###### Theorem 1.5 (See [2]).

Exact recovery in the is solvable if and not solvable if Moreover, when exact recovery is possible, there exist efficient algorithms to do so.

There exist analogous statements for graphs drawn from the non-symmetric block model. This motivates us to consider values of and of the form in our theoretical analysis of SSCP (see §3) although our current analysis requires an additional factor in , where is any function of such that . In our numerical experiments, we take . Removing this extra factor is an interesting problem for future research.

## 2 The SSCP Algorithm

Our algorithm was inspired by a serendipitous observation that the problem of determining the indicator vector, , of a cluster can be rephrased as a compressive sensing problem. Before elaborating on this, let us briefly review some of the pertinent results of this field of signal processing.

### 2.1 Compressive Sensing

Candés, Donoho and their collaborators in [20, 11, 12] initialized the study of compressive sensing, which offers theoretical analysis and algorithmic tools for solving the minimization problem:

 argmin∥Φx−y∥2 subject to ∥x∥0≤s (4)

In the case where with , making the linear system underdetermined. For any , define . The matrix is typically referred to as a sensing matrix. There are many algorithms (e.g. [6, 7, 10, 23]) to solve problem (4), but the one we shall focus on is the SubspacePursuit algorithm introduced in [18]:

Here and are thresholding operators:

 Ls(v):={i∈[n]: vi among s % largest-in-magnitude entries in v} Hs(v)i:={vi if i∈Ls(v)0 otherwise

In quantifying when (4) has a unique solution, the following constant is often used (see [22])

###### Definition 2.1.

The Restricted Isometry Constant (-RIC) of , written , is defined to be the smallest value of such that, for all with , we have:

 (1−δ)∥x∥22≤∥Φx∥22≤(1+δ)∥x∥22.

If we often say that has the Restricted Isometry Property (RIP).

###### Lemma 2.2.

For any with one can easily check that .

###### Proof.

This follows most easily from an alternative characterization of (see Chpt. 6 of [22]): that is, Indeed, we have

 δs(ΦΩ)=maxS′⊂Ω,|S′|≤s∥Φ⊤S′ΦS′−I∥2≤maxS⊂[n],|S|≤s∥Φ⊤SΦS−I∥2=δs(Φ)

One of the reasons for the remarkable usefulness of compressive sensing is its robustness to error, both additive (i.e. in ) and multiplicative (i.e. in ). More precisely, suppose that a signal is acquired, but that we do not know the sensing matrix precisely. Instead, we have access only to , for some small perturbation . This models the scenario where a sensing matrix is designed, and then implemented in hardware (for example as an MRI coil) where a certain amount of error becomes unavoidable. Suppose further that there is a small amount of noise in the measurement process, so that the signal we actually receive is . Can one hope to approximate a sparse vector from well, given only ? This question is answered in the affirmative by several authors, starting with the work of [28]. For SubspacePursuit, we have the following result of Li:

###### Theorem 2.3.

Let , , and be as above and suppose that . For any , let . Define the following constants:

 ϵy:=∥e∥2/∥^y∥2 and ϵsΦ=∥M∥s2/∥^Φ∥s2

where for any matrix , . Define further:

 ρ=√2δ23s(1+δ23s)1−δ23s and τ=(√2+2)δ3s√1−δ23s(1−δ3s)(1−ρ)+2√2+1(1−δ3s)(1−ρ)

Assume and let be the output of SubspacePursuit applied to problem (4) after iterations. Then:

 ∥x∗−xm∥2∥x∗∥2≤ρm+τ√1+δs1−ϵsΦ(ϵsΦ+ϵy).
###### Proof.

This is Corollary 1 in [35]. Note that our convention on hats is different to theirs — our is their , hence our is their and so on. ∎

### 2.2 Cluster Extraction as Compressive Sensing

The eigenvectors of the Laplacian are the key ingredient in Spectral Clustering algorithms. The following theorem is usually used in theoretical justifications of their success:

###### Theorem 2.4.

Let denote the connected components of a graph . Then the cluster indicator vectors form a basis for the kernel of .

###### Proof.

See proposition 4 of [36]. ∎

Now suppose that has clusters . By definition, clusters have few edges between them, and so it is useful to write as the union of two edge-disjoint subgraphs, defined as follows: let have only in-cluster edges, and let consist only of the out-of-cluster edges, and . Denote by and (resp. and ) the adjacency matrix and Laplacian of (resp. ). Similarly, (resp. ) shall denote the degree of the vertex in the graph (resp. ). Note that are now the connected components of , and so for all .

As we have and . For future reference, define . It is not the case that , but we shall show in §3 that with . Without loss of generality assume that and denote . Let (resp. , and ) denote the -th column of (resp. , and ). Then:

 0=Lin1C1=[ℓin1,Lin−1][11C1∖{1}]=ℓin1+L%in−11C1∖{1} (5)

or in other words, is a solution to the linear system . This system is underdetermined, but crucially . That is, as long as is not too large, is sparse. Thus we may hope to recover exactly by solving the problem:

 argmin{∥Lin−1x+ℓin1∥2 subject to ∥x∥0≤n1−1}.

Of course we do not have access to . Instead, we have , a noisy version of . However, given that , we may hope to use the results of §2.1, particularly Theorem 2.3, to show that if is the solution to:

 argmin{∥L−1x+ℓ1∥2 subject to ∥x∥0≤n1−1} (6)

then . Unfortunately problem (6) turns out to be poorly conditioned, as . Thus, we propose a two-stage approach. In the first stage (Algorithm 2) we determine a superset of size while in the second stage (Algorithm 3) we extract from by solving a compressive sensing problem to find a vector supported on the complement of in . Specifically, observe that if , then . It follows that:

 LinΩ1Ω=LinΩ(1C1+1Ω∖C1)=0+LinΩ1Ω∖C1⇒LinΩ1Ω∖C1=LinΩ1Ω. (7)

Equivalently, if then is the solution to

 argmin{∥LinΩx−yin∥2: ∥x∥0≤ϵn1} (8)

This problem is better conditioned, as we shall show that . Clearly once is known, we can find as . In §3, we shall show that if we replace and with and and let denote the solution to:

 argmin{∥LΩx−y∥2: ∥x∥0≤ϵn1} (9)

Then and . We now describe our algorithm in pseudocode. In line 3 of Algorithm 2, denotes the thresholding operator defined as

 ~Ls(v)={i∈[n]: vi among s largest % entries in v}.
###### Remark 2.5.

Several comments on the parameters of Algorithms 2 and 3 are in order. A natural choice of is , in which case is simply the (non-negative) support of . If is known, then setting in Algorithm 2 and in Algorithm 4 is natural, as . In practice, the size of is only approximately known, and we have found greater success with setting to be an upper bound on the expected size of , while setting and . This allows ClusterPursuit to explore a greater range of cluster sizes, as is between and for any , hence is between and . That can be taken to be will follow from the proof of Theorem 3.15. In practice, we set .

## 3 Theoretical Analysis

In this section we prove that SSCP is weakly consistent for the SSBM. Without loss of generality we assume we are trying to extract . Our main result is:

###### Theorem 3.1.

Let with for such that as , for constant and . Let be a set of vertices drawn uniformly at random from , where is independent of . Fix any , set , and . Let denote the output of SSCP run with these inputs. Then:

 P⎡⎢⎣∣∣C1ΔC#1∣∣|C1|≤o(1)⎤⎥⎦=1−o(1).
###### Proof.

In Theorem 3.8 we show that Algorithm 2 returns an containing a fraction of the vertices of with probability . Theorem 3.15 will then show that given such an , ClusterPursuit will output a cluster such that with probability , completing the proof. ∎

Henceforth, when an event happens with probability , we shall say it happens almost surely, or a.s.. Note that if a finite collection of events happen almost surely, then their intersection also occurs almost surely. We shall use this observation repeatedly.

### 3.1 Concentration in Erdős - Rènyi Graphs

The proof of Theorem 3.1 relies on two concentration phenomena in Erdős - Rènyi graphs. The first is that the maximum and minimum degrees of an Erdős - Rènyi graph are within a small deviation of their expected value, a.s. The second is that the second eigenvalue of the Laplacian of an ER graph is within an term of its expected value, a.s.

###### Theorem 3.2 (see [8, 9]).

Let with . There exist a function satisfying and such that

 dmax(G)=(1+ηΔ(b))blogn+o(1)≤2blog(n)+o(1) a% .s.
###### Theorem 3.3 (see [24], Theorem 3.4 (ii)).

If with where , then and a.s.

###### Theorem 3.4.

Suppose that with where . Then we have almost surely (1) ; (2) for ; and (3) for all .

###### Proof.

See Theorems 3 and 4 in [16]. In their notation, . Their results refer to , but one can easily show that and have the same spectrum. ∎

### 3.2 Reducing from the SBM to the ER model

Let and be as in §2.2. If then consists of disjoint i.i.d graphs, . The graph is not an Erdős - Rènyi graph, as there is probability of it containing an edge between two vertices in the same cluster (because we have removed them). However, we can profitably think of as a subgraph of some . In particular, any upper bounds on the degrees of vertices in are automatically bounds on the degrees in . Thus, we have the following corollaries of Theorems 3.3 and 3.2:

If with , a.s.

###### Proof.

Consider as a subgraph of and apply Theorem 3.2

###### Corollary 3.6.

If with and where , then and a.s.

###### Proof.

If then , where . Clearly:

By Theorem 3.3, a.s. Note that the are i.i.d random variables, and since we are taking a maximum over of them, it follows that a.s. Moreover, as , . The proof for is similar. ∎

###### Corollary 3.7.

with where , and . Recall that . Then a.s.

###### Proof.

First of all, it is clear that for any , . From Corollaries 3.5 and 3.6 we have:

 doutmaxdinmin ≤2blogn+o(1)(1−o(1))ωlog(n0)=2blogn+o(1)(1−o(1))ω(log(n)−log(k)) as n=kn0 =2b+o(1)(1−o(1))ω(1−o(1))=o(1) since k=O(1)% and ω→∞

### 3.3 Reliably Finding Supersets

Let denote the output of Algorithm 2, run with inputs as in Theorem 3.1. Further, let denote the “missed” indices, and denote the “bad” indices (i.e. vertices in that are not in ). Let , in which case , as by construction . We prove that :

###### Theorem 3.8.

Let with , with and . Let with for some constant . For any , if is the output of Algorithm 2, with inputs , and , then .

###### Proof.

As in line 3 of Algorithm 2, define , where . Observe:

 ((L+)Tℓ+j)i=⟨ℓ+i,ℓ+j⟩=(1di+1dj)Aij+n∑k=1AikAkjd2k. (11)

By the definition of the thresholding operator , we must have for every and . We sum first over and then sum over to have

 (ϵ+u)n0vi≤∑j∈Wvj and (ϵ+u)n0∑i∈Uvi≤un0∑j∈Wvj,

respectively. It follows that:

 ∑i∈Uvi≤uϵ+u∑j∈Wvj≤∑j∈Wvj. (12)

Looking ahead, we shall show that if inequality (12) holds then . Now:

From equation (11) we deduce that . Moreover:

 n∑k=1AikAkjd2k≥1d2maxn∑k=1AikAkj≥1d2max∑k∈C1AikAkj

and so:

 ∑i∈Uvi≥1d2max∑i∈U∑j∈Γ∑k∈C1AikAkj (13)

The triple sum above is precisely the number of length two paths from to contained in the Erdős - Rènyi graph . In [14] a neat formula for this quantity, which they call it , is given. Specifically, they show that for any family of graphs such that for we have and for , then for any :

 ∣∣e2(X,Y)−p2n|X||Y|∣∣=o(p2n3)

As the aforementioned condition on the eigenvalues of holds for a.s. (see Theorem 3.4) we conclude that

 ∑i∈U∑j∈Γ∑k∈C1AikAkj =e2(U,Γ)≥p2n0|U||Γ|−o(p2n30)%a.s. =(ω2log2(n0)n20)n0(un0)(gn0)−o(ω2log2(n0)n20n30)