Connectivity in Random Annulus Graphs and the Geometric Block Model

Connectivity in Random Annulus Graphs and the Geometric Block Model

Sainyam Galhotra  Arya Mazumdar  Soumyabrata Pal  Barna Saha College of Information and Computer Sciences, University of Massachusetts Amherst, Amherst, MA 01003, {sainyam,arya,spal,barna}@cs.umass.edu. This research is supported by NSF awards CCF 1642550, CCF 1464310 and CAREER awards 1642658 and 1652303.
Abstract

Random geometric graphs are the simplest, and perhaps the earliest possible random graph model of spatial networks, introduced by Gilbert in 1961. In the most basic setting, a random geometric graph has vertices. Each vertex of the graph is assigned a real number in randomly and uniformly. There is an edge between two vertices if the corresponding two random numbers differ by at most (to mitigate the boundary effect, let us consider the Lee distance here, ). It is well-known that the connectivity threshold regime for random geometric graphs is at . In particular, if , then a random geometric graph is connected with high probability if and only if . Consider for any to satisfy the connectivity requirement and delete half of its edges which have distance at most . It is natural to believe that the resultant graph will be disconnected. Surprisingly, we show that the graph still remains connected!

Formally, generalizing random geometric graphs, we define a random annulus graph with vertices. Each vertex of the graph is assigned a real number in randomly and uniformly as before. There is an edge between two vertices if the Lee distance between the corresponding two random numbers is between and , . Let us assume and . We show that this graph is connected with high probability if and only if and . That is is not connected but is.

This result is then used in finding the recovery threshold of geometric block model, a random graph based model for real-world communities defined based on geometric random graphs analogous to the popular stochastic block model. Geometric block model is a basic random graph model for spatial communities, and as shown in an earlier work, is a better model than the stochastic block model for a variety of different community detection problems, especially as it captures correlated edge formulation. In a geometric block model with two equally sized communities, each community is represented by whereas the inter-cluster edges are formed according to , . In the regime, where , we study the necessary and sufficient conditions (on ) for recovery of the two parts exactly from the graph. To show the sufficiency, we provide an efficient algorithm for recovery of the partition, a significant improvement on previously available results. The analysis of the algorithm crucially uses aforementioned curious threshold phenomenon of random annulus graphs. The necessary condition that we provide is also the first nontrivial lower bound on this problem.

1 Introduction

Models of random graphs are ubiquitous with Erdős-Rényi graphs (Erdös and Rényi, 1959; Gilbert, 1959) at the forefront. Studies of the properties of random graphs had led to many fundamental theoretical observations as well as many engineering applications. The introduction of random geometric graphs (RGG) shortly follows that of Erdős-Rényi graphs (Gilbert, 1961), and they constitute the first and simplest model of spatial networks. The one-dimensional RGG is defined in the following way. It is a graph with vertices. Each vertex is assigned a random number selected randomly and uniformly from . Two vertices and are connected by an edge, if and only if This model can straightforwardly be extended to higher dimensional case, where vertices are assigned random (possible non-uniform) points from some closed compact subset of the Euclidean space and then two vertices will have an edge between them if the inner product between the corresponding vectors is higher than some prescribed value. Random geometric graphs have several desirable properties that model real human social networks, such as vertices with high modularity and the degree associativity property (high degree nodes tend to connect). This has led RGGs to be used as models of disease outbreak in social network (Eubank et al., 2004) and flow of opinions (Zhang et al., 2014). RGGs are also a popular model for wireless (ad-hoc) communication networks (Dettmann and Georgiou, 2016; Haenggi et al., 2009). Recent works on RGGs also include hypothesis testing between an Erdős-Rényi graph and a random geometric graph (Bubeck et al., 2016).

Threshold properties of random graphs (especially Erdős-Rényi graphs) have been at the center of much theoretical interest, and in particular it is known that many graph properties exhibit sharp phase transition phenomena (Friedgut and Kalai, 1996). Random geometric graphs also exhibit similar threshold properties; in particular, the connectivity threshold for RGGs is known to be at for the one-dimensional model defined above (Penrose, 2003).

Consider the random geometric graph defined above with 111The base of the logarithm is unless otherwise mentioned.. The threshold property says that is connected with high probability if and only if 222That is, is connected for any . We will ignore this and just mention connectivity threshold as . Now let us consider a variant of , where instead of two points having edges when , two points will have an edge between them if and only if . Clearly this graph has less edges than . Is this graph still connected? Surprisingly, we show that the above modified graph remains connected as long as . Note that, on the other hand, is not connected for any .

To formalize this point, let us define a mild generalization of RGG, called the random annulus graph (RAG). Given two numbers , the random annulus graph is a random graph with vertices. Each vertex is assigned a random number selected randomly and uniformly from . Two vertices and are connected by an edge, if and only if . Therefore, for , random annulus graph is simply the random geometric graph . Random annulus graphs have been previously mentioned in (Dettmann and Georgiou, 2016). The interval is called the connectivity interval in RAG.

Now consider an when and . We show that when , the random annulus graph is connected with high probability if and only if and . This means the graphs and are not connected with high probability, whereas is connected. Note that, since we are using the Lee distance (or geodesic distance), this fact cannot be intuitively justified by boundary effects. For a depiction of the connectivity regime for the random annulus graph see Figure 1.

Can we explain this seemingly curious shift in connectivity interval, when one goes from to ? Compare the RAG with the . The former one can be thought of being obtained by deleting all the ‘short-distance’ edges from the later. It turns out the ‘long-distance’ edges are sufficient to maintain connectivity, because they can connect points over multiple hops in the graph. This intriguing observation of long edge phenomenon in random geometric graphs is one of the major contributions of this paper.

We are motivated to study the threshold phenomena of random annulus graphs, because it appears naturally in the analysis of the geometric block model (GBM) (Galhotra et al., 2018). The geometric block model is a probabilistic generative model of communities in a variety of networks and is a spatial analogue to the popular stochastic block model (SBM) (Holland et al., 1983; Dyer and Frieze, 1989; Decelle et al., 2011; Abbe and Sandon, 2015; Abbe et al., 2016; Hajek et al., 2015; Chin et al., 2015; Mossel et al., 2015; Agarwal et al., 2017). The SBM generalizes the Erdős-Rényi graphs in the following way. Consider a graph , where is a disjoint union of clusters denoted by The edges of the graph are drawn randomly: there is an edge between and with probability Given the adjacency matrix of such a graph, the task is to find exactly (or approximately) the partition of .

This model has been incredibly popular both in theoretical and practical domains of community detection. Recent theoretical works focus on characterizing sharp threshold of recovering the partition in the SBM. For example, when there are only two communities of exactly equal sizes, and the inter-cluster edge probability is and intra-cluster edge probability is , it is known that exact recovery is possible if and only if (Abbe et al., 2016; Mossel et al., 2015). The regime of the probabilities being has been put forward as one of most interesting ones, because in an Erdős-Rényi random graph, this is the threshold for graph connectivity (Bollobás, 1998). Note that the results are not only of theoretical interest, many real-world networks exhibit a “sparsely connected” community feature (Leskovec et al., 2008), and any efficient recovery algorithm for sparse SBM has many potential applications.

While SBM is a popular model (because of its apparent simplicity), there are many aspects of real social networks, such as “transitivity rule” (‘friends having common friends’) inherent to many social and other community structures, are not accounted for in SBM. To circumvent this, in a previous work, we proposed a random graph community detection model analogous to the stochastic block model, that we call the geometric block model (GBM) (Galhotra et al., 2018). The GBM depends on the basic definition of the random geometric graph in the same way the SBM depends on Erdős-Rényi graphs. The two-cluster GBM with vertex set , is a random graph defined in the following way. Suppose, be two real numbers. For each vertex randomly and independently choose a number according to uniform distribution. There will be an edge between if and only if,

 dL(Xu,Xv)≤rs when u,v∈V1 or u,v∈V2 dL(Xu,Xv)≤rd when u∈V1,v∈V2 or u∈V2,v∈V1.

Let us denote this random graph as . Given this graph , the main problem of community detection is to find the parts and .

It has been shown in (Galhotra et al., 2018) that GBM accurately represents (more so than SBM) many real world networks such as Amazon product metadata or DBLP citation and collaboration networks. A simple motif-counting algorithm was also provided in (Galhotra et al., 2018), that provably works well in the sparse regime of and real sparse networks. In contrast, the simple motif counting can easily seen to perform badly in SBM, giving some further validation of GBM in real networks, where such algorithms are often very effective.

Motivated by SBM literature, we here also look at GBM in the connectivity regime, i.e., when Our first contribution in this part is to provide a lower bound that shows that it is impossible to recover the parts from when We also provide a simple, intuitive, and efficient triangle-counting algorithm that leads to significant improvement over the algorithm proposed in (Galhotra et al., 2018). The algorithm simply count the number of common neighbors for each pair in the GBM connected by an edge, and deletes some edges when the number of common neighbors is below some threshold. In the next step of the algorithm we just find connected components in the redacted graph. The relation between and that defines a sufficient condition of recovery in has been derived as one of the main results of this paper (see, Theorem 2).

To analyze the algorithm proposed, we need to crucially use the results obtained for the connectivity of random annulus graphs. Indeed, we need to go beyond the scenario for RAG when the connectivity interval can be disjoint, such as where .

It is possible to generalize the GBM to include different distributions, different metric spaces, multiple parts and higher dimensions. It is also possible to construct other type of spatial block models such as the one very recently being put forward in (Sankararaman and Baccelli, 2018) which rely on the random dot product graphs (Young and Scheinerman, 2007). In (Sankararaman and Baccelli, 2018), edges are drawn between vertices randomly and independently as a function of the distance between the corresponding vertex random variables. In contrast, in GBM edges are drawn deterministically given the vertex random variables, and edges are dependent unconditionally. (Sankararaman and Baccelli, 2018) also considers the recovery scenario where in addition to the graph, values of the vertex random variables are provided. In GBM, we only observe the graph. In particular, it will be later clear from our derivations that if we are given the corresponding random variables (locations) to the variables in addition to the graph, then recovery of the partitions in is possible if and only if .

The paper is organized as follows. In Section 2, we provide the main results of the paper formally. In Section 3, the sharp connectivity phase transition results for random annulus graphs are proven. In Section 4, a lower bound for the geometric block model as well as the main recovery algorithm and its analysis are presented.

2 Notations and Main Results

We start this section by formally defining the random geometric graphs.

Definition 1.

A random geometric graph on vertices has parameters , and a real number . It is defined by assigning a number to vertex where are independent and identical random variables uniformly distributed in . There will be an edge between vertices and if and only if .

One can think of the random variables , to be uniformly distributed on the perimeter of a circle with radius and the distance to be the geodesic distance. It will be helpful to consider vertices as just random points on . Note that every point has a natural left direction (if we think of them as points on a circle then this is the counterclockwise direction) and a right direction. As a shorthand, for any two vertices , let denote where are corresponding random values to the vertices respectively. We can extend this notion to denote the distance between a vertex (or the embedding of that vertex in ) and a point naturally.

We define the random annulus graph, a mild generalization of the random geometric graphs.

Definition 2.

A random annulus graph on vertices has parameters , and a pair of real numbers . It is defined by assigning a number to vertex where s are independent and identical random variables uniformly distributed in . There will be an edge between vertices and if and only if .

Our main result regarding random annulus graphs is given in the following theorem. The base of the logarithm is here and everywhere else in the paper unless otherwise mentioned.

Theorem 1 (Connectivity threshold of random annulus graphs).

The is connected with probability if and . On the other hand, the is not connected with probability if or .

The regime of connectivity is depicted in Fig. 1. For the special case of , the result was known (Muthukrishnan and Pandurangan, 2005; Penrose, 2003). However, note that the case of is neither a straightforward generalization (i.e., the connectivity region is not defined by ) nor intuitive.

As a corollary of Theorem 1, we are able to show connectivity regimes for more complicated models such as the one derived in Corollaries 1 and 2. These are useful in analyzing the recovery algorithm for the geometric block model that we define next.

Definition 3.

Given , choose a random variable uniformly distributed in for all . The geometric block model with parameters is a random graph where an edge exists between vertices and if and only if,

 dL(Xu,Xv)≤rs when u,v∈V1 or u,v∈V2 dL(Xu,Xv)≤rd when u∈V1,v∈V2 or u∈V2,v∈V1.

Given a geometric random graph our main objective is to recover the partition (i.e., and ). As a consequence of the connectivity lower bound on RAG, we are able to show that recovery of the partition is not possible with high probability in whenever or (see, Theorem 5). Another consequence of the random annulus graph results is that we show that if in addition to a GBM graph, all the locations of the vertices are also provided, then recovery is possible if and only if or (formal statement in Theorem 6).

Coming back to the actual recovery problem, our main contribution for GBM is to provide a simple efficient triangle counting algorithm that performs well in the sparse regime (see, Algorithm 4.2). The algorithm goes over all the edges of the graph and counts the number of common neighbors for each pair of vertices in the graph. Based on this count, the algorithm then deletes some of the edges. The next and final step is to compute connected components in the remaining graph. If the redacted graph has exactly two components, then the algorithm returns them as the two clusters. The main result here can be summarized as below.

Theorem 2 (Recovery algorithm for GBM).

Suppose we have the graph generated according to . Define,

 t1 =min{t:(2b+t)log2b+t2b−t>1} t2 =min{t:(2b−t)log2b−t2b+t>1} θ1 =max{θ:12((4b+2t1)log4b+2t12a−θ+2a−θ−4b−2t1)>1 and 0≤θ≤2a−4b−2t1} θ2 =min{θ:12((4b−2t2log4b−2t22a−θ+2a−θ−4b+2t2)>1 and a≥θ≥max{2b,2a−4b+2t2}}.

Then there exists an efficient algorithm which will recover the correct partition in the GBM with probability if OR .

Some example of the parameters when the proposed algorithm (Algorithm 4.2) can successfully recover is given in Table 1.

The theorem relies on the concentrations of count of common neighbors (or other motifs) for pairs of vertices that are in the same cluster, versus the pairs that are in different clusters. It also crucially relies on Corollary 1 for the final result. If instead of Corollary 1, the result of Corollary 2 is used, then the results of this theorem can be slightly improved, but we omit that statement for clarity (and relative obviousness).

3 Random Annulus Graphs

In this section we prove Theorem 1. Towards this proof we show the sufficient condition first and delegate the necessary condition at the end of this section.

3.1 Sufficient condition for connectivity of RAG

Theorem 3.

The random annulus graph is connected with probability if and .

To prove this theorem we use two main technical lemmas that show two different events happen with high probability simultaneously.

Lemma 1.

A set of vertices is called a cover of , if for any point in there exists a vertex such that . A is a union of cycles such that every cycle forms a cover of (see Figure 2) as long as and with probability .

We consider this lemma to be a main technical contribution for this paper and a large part of this section is dedicated to the proof. This lemma also shows effectively the fact that ‘long-edges’ are able to connect vertices over multiple hops. Note that, the statement of Lemma 1 would be easier to prove if the condition were . In that case what we prove is that every vertex has neighbors (in the RAG) on both of the left and right directions. To see this for each vertex , assign two indicator -random variables and , with if and only if there is no node to the left of node such that . Similarly, let if and only if there is no node to the right of node such that . Now define . We have,

 Pr(Alu)=Pr(Aru)=(1−(a−b)lognn)n−1,

and,

 E[A]=2n(1−(a−b)lognn)n−1≤2n1−(a−b).

If then which implies, by invoking Markov inequality, that with high probability every node will have neighbors (connected by an edge in the RAG) on either side. This results in the interesting conclusion that every vertex will lie in cycle that covers . This is true for every vertex, hence the graph is simply a union of cycles each of which is a cover of . The main technical challenge is to show that this conclusion remains valid even when , which is proved after we describe the other components of the result in this section.

Lemma 2.

Set two real numbers and . In an , with probability there exists a vertex and nodes to the right of such that and nodes to the right of such that , for . The arrangement of the vertices is shown in Figure 3.

With the help of these two lemmas, we are in a position to prove Theorem 3. The proof of the two lemmas are given immediately after the proof of the theorem.

Proof of Theorem 3.

We have shown that the two events mentioned in Lemmas 1 and 2 happen with high probability. Therefore they simultaneously happen under the condition and . Now we will show that these events together imply that the graph is connected. To see this, consider the vertices and that satisfy the conditions of Lemma 2. We can observe that each vertex has an edge with and , . This is because (see Figure 3 for a depiction)

 d(ui,vi)≥((i(a−b)+b−(2i−1)ϵ)lognn−i(a−b)−(2i−1)ϵ)lognn=blognnand
 d(ui,vi) ≤i(a−b)+b−(2i−2)ϵlognn−(i(a−b)−2iϵ)lognn=(b+2ϵ)lognn.

Similarly,

 d(ui−1,vi) ≥((i(a−b)+b−(2i−1)ϵ)lognn−(i−1)(a−b)−(2i−3)ϵ)lognn =(a−2ϵ)lognnand
 d(ui−1,vi) ≤i(a−b)+b−(2i−2)ϵlognn−((i−1)(a−b)−2(i−1)ϵ)lognn=alognn.

This implies that is connected to and for all . The first event implies that the connected components are cycles spanning the entire line . Now consider two such disconnected components, one of which consists of the nodes and . There must exist a node in the other component (cycle) such that is on the right of and . If , (see Figure 4). When , we can calculate the distance between and as

 d(t,vi) ≥(i(a−b)+b−(2i−1)ϵ)lognn−(i(a−b)−(2i−1)ϵ)lognn=blognn

and

 d(t,vi) ≤(i(a−b)+b−(2i−2)ϵ)lognn−((i−1)(a−b)+b−a−(2i−2)ϵ)lognn=alognn.

Therefore is connected to when If then is already connected to . Therefore the two components (cycles) in question are connected.This is true for all cycles and hence there is only a single component in the entire graph. Indeed, if we consider the cycles to be disjoint super-nodes, then we have shown that there must be a star configuration. ∎

We will now provide the proof of Lemma 2.

Proof of Lemma 2.

Recall that we want to show that there exists a node and nodes to the right of such that and exactly nodes to the right of such that , for and is a constant less than (see Figure 3 for a depiction). Let be an indicator -random variable for every node which is if satisfies the above conditions and otherwise. We will show with high probability.

We have,

 Pr(Au=1) =n(n−1)…(n−(2k−1))(ϵlognn)2k(1−2kϵlognn)n−2k =c0n−2kϵ(ϵlogn)2k2k−1∏i=0(1−i/n) =c1n−2kϵ(ϵlogn)2k

where are just absolute constants independent of (recall is a constant). Hence,

 ∑uEAu=c1n1−2kϵ(ϵlogn)2k≥1

as long as . Now, in order to prove with high probability, we will show that the variance of is bounded from above. This calculation is very similar to the one in the proof of Theorem 4. Recall that if is a sum of indicator random variables, we must have

 Var(A)≤E[A]+∑u≠vCov(Au,Av)=E[A]+∑u≠vPr(Au=1∩Av=1)−Pr(Au=1)Pr(Av=1).

Now first consider the case when vertices and are at a distance of at least apart (happens with probability ). Then the region in that is within distance from both and is the empty-set. In this case, where is a constant.

In all other cases, . Therefore,

 Pr(Au=1∩Av=1)=(1−4(a+b)lognn)c2n−4kϵ(ϵlogn)4k+4(a+b)lognnc1n−2kϵ(ϵlogn)2k

and

 Var(A) ≤c1n1−2kϵ(ϵlogn)2k+c3n1−2kϵ(logn)2k+1 ≤c4n1−2kϵ(logn)2k+1

where are constants. Again invoking Chebyshev’s inequality, with probability at least

 A>c1n1−2kϵ(ϵlogn)2k−√c4n1−2kϵ(logn)2k+2.

It remains to prove Lemma 1.

Proof of Lemma 1.

The proof of this lemma is somewhat easily explained if we consider a weaker result (a stronger condition) with . Let us first briefly describe this case.

Consider a node and assume without loss of generality that the position of is (i.e. ). Associate four indicator -random variables which take the value of if and only if there does not exist any node such that

The intervals representing these random variables are shown in Figure 5.

Notice that and therefore . This means that for and , . Hence there exist vertices in all the regions described above for every node with high probability.

Now, and being zero implies that either there is a vertex in or there exists two vertices in and respectively (see, Figure 5). In the second case, is connected to and is connected to . Therefore has nodes on left () and right () and is connected to both of them through one hop in the graph.

Similarly, and being zero implies that either there exists a vertex in or again will have vertices on left and right and will be connected to them. So , when all the four are zero together, the only exceptional case is when there are nodes in and . But in that case has direct neighbors on both its left and right. We can conclude that every vertex is connected to a vertex on its right and a vertex on its left such that and ; therefore every vertex is part of a cycle that covers .

We can now extend this proof to the case when

Let be large number to be chosen specifically later. Consider a node and assume that the position of is . Now consider the three different regions (as defined below) each divided into patches (intervals) of size in the following way:

where . Note that any vertex in is connected to . See, Figure 6 for a depiction.

Consider a -indicator random variable that is if and only if there does not exist any node in some region that consists of patches amongst the ones described above. Notice that if the patches do not overlap then the total size of patches is and if they do overlap, then the total size of the patches is going to be less than . Since there are possible regions that consists of patches,

 ∑uEXu ≤n(4L2L−1)(1−min{2c+1−12c(a−b)logn