Exact alignment recovery for correlated Erdős-Rényi graphs

# Exact alignment recovery for correlated Erdős-Rényi graphs

Daniel Cullina, Negar Kiyavash
###### Abstract

We consider the problem of perfectly recovering the vertex correspondence between two correlated Erdős-Rényi (ER) graphs on the same vertex set. The correspondence between the vertices can be obscured by randomly permuting the vertex labels of one of the graphs. We determine the information-theoretic threshold for exact recovery, i.e. the conditions under which the entire vertex correspondence can be correctly recovered given unbounded computational resources.

Graph alignment is the problem finding a matching between the vertices of the two graphs that matches, or aligns, many edges of the first graph with edges of the second graph. Alignment is a generalization of graph isomorphism recovery to non-isomorphic graphs. Graph alignment can be applied in the deanonymization of social networks, the analysis of protein interaction networks, and computer vision. Narayanan and Shmatikov successfully deanonymized an anonymized social network dataset by graph alignment with a publicly available network [1]. In order to make privacy guarantees in this setting, it is necessary to understand the conditions under which graph alignment recovery is possible.

We consider graph alignment for a randomized graph-pair model. This generation procedure creates a “planted” alignment: there is a ground-truth relationship between the vertices of the two graphs. Pedarsani and Grossglauser [2] were the first to approach the problem of finding information-theoretic conditions for alignment recovery. They established conditions under which exact recovery of the planted alignment is possible, The authors improved on these conditions and also established conditions under which exact recover is impossible [3]. In this paper, we close the gap between these results and establish the precise threshold for exact recovery in sparse graphs. As a special case, we recover a result of Wright [4] about the conditions under which an Erdős-Rényi graph has a trivial automorphism group.

## I Model

### I-a The alignment recovery problem

We consider the following problem. There are two correlated graphs and , both on the vertex set . By correlation we mean that for each vertex pair , presence or absence of , or equivalently the indicator variable , provides some information about . The true vertex labels of are removed and replaced with meaningless labels. We model this by applying a uniformly random permutation to map the vertices of to the vertices of its anonymized version. The anonymized graph is , where for all , . The original vertex labels of are preserved and and are revealed. We would like to know under what conditions it is possible to discover the true correspondence between the vertices of and the vertices of . In other words, when can the random permutation be exactly recovered with high probability?

In this context, an achievability result demonstrates the existence of an algorithm or estimator that exactly recovers with high probability. A converse result is an upper bound on the probability of exact recovery that applies to any estimator.

### I-B Correlated Erdős-Rényi graphs

To fully specify this problem, we need to define a joint distribution over and . In this paper, we will focus on Erdős-Rényi (ER) graphs. We discuss some of the advantages and drawbacks of this model in Section II-F.

We will generate correlated Erdős-Rényi graphs as follows. Let and be graphs on the vertex set . We will think of as a single function with codomain : . The random variables , , are i.i.d. and

 (Ga,Gb)(e)=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩(1,1)w.p. p11(1,0)w.p. p10(0,1)w.p. p01(0,0)w.p. p00.

Call this distribution , where . Note that the marginal distributions of and are Erdős-Rényi and so is the distribution of the intersection graph : , , and .

When , we say that the graphs and have positive correlation. Observe that

 p11−(p10+p11)(p01+p11)=p11p00−p01p10

so is an equivalent, more symmetric condition for positive correlation.

### I-C Results

All of the results concern the following setting. We have , is a uniformly random permutation of independent of , and is the anonymization of by as described in Section I-A. Our main result is the following.

###### Theorem 1.

Let satisfy the conditions

 p01+p10 ≤O(1logn) (1) p11 ≤O(1logn) (2) p01p10p11p00 ≤O(1(logn)3) (3) p11 ≥logn+ω(1)n. (4)

Then there is an estimator for given that is correct with probability .

Together, conditions (1) and (2) force and to be mildly sparse. Condition (3) requires and to have nonnegligible positive correlation.

There is a matching converse bound.

###### Theorem 2 (Converse bound).

If satisfies

 p11≤logn−ω(1)n,

then any estimator for given is correct with probability .

Theorem 2 was originally proved by the authors in [3]. The proof is short compared to Theorem 1 and it is included in Section V.

A second achievability theorem applies without conditions (1), (2), and (3). This requires condition (4) to be strengthened.

###### Theorem 3.

If satisfies

 2logn+ω(1)n≤(√p11p00−√p01p10)2.

then there is an estimator for given that is correct with probability .

Theorem 3 was also originally proved in [3]. In this paper, it appears as an intermediate step in the proof of Theorem 1.

## Ii Preliminaries

### Ii-a Notation

Throughout, we use capital letters for random objects and lower case letters for fixed objects.

For a graph , let and be the node and edge sets respectively. Let denote the set . All of the -vertex graphs that we consider will have vertex set . We will always think of a permutation as a bijective function . The set of permutations of under the binary operation of function composition forms the group .

We denote the collection of all two element subsets of by . The edge set of a graph is .

Represent a labeled graph on the vertex set by its edge indicator function . The group has an action on . We can write the action of the permutation on the graph as the composition of functions , where is the lifted version of :

 l(π) : ([n]2)→([n]2) {i,j}↦{π(i),π(j)}.

Thus . Whenever there is only a single permutation under consideration, we will follow the convention .

For a generating function in the formal variable , is the coefficient extraction operator:

 [zj]a(z)=[zj]∑iaizi=aj.

When is a matrix of numbers or formal variables and is a matrix of numbers, both indexed by , we use the notation

 zk=∏i∈S∏j∈Tzki,ji,j

for compactness.

### Ii-B Graph statistics

Recall that we consider a graph on to be a -labeling of the set of vertex pairs . The following quantities have clear interpretations for graphs, but we define them more generally for reasons that will become apparent shortly.

###### Definition 1.

For a set and a pair of binary labelings , define the type

 μ(fa,fb) ∈N[2]×[2] μ(fa,fb)ij =∑e∈S1{(fa,fb)(e)=(i,j)}.

The Hamming distance between and is

 Δ(fa,fb) =∑e∈S1{fa(e)≠fb(e)} =μ(fa,fb)01+μ(fa,fb)10,

For a permutation , define

 δ(τ;fa,fb)=12(Δ(fa∘τ,fb)−Δ(fa,fb)).

In the particular case of graphs (where and ), is the size of the symmetric difference of the edge sets, . The quantity is central to both our converse and our achievability arguments (as well as the achievability proof of Pedarsani and Grossglauser [2]). When and are graphs on and is a permutation of , is the difference in matching quality between the permutation and the identity permutation.

###### Lemma II.1.

Let . Then there is some such that

 μ(fa∘τ,fb)−μ(fa,fb)=(−iii−i)

and .

###### Proof.

Let , and . Let be the vector of all ones. We have because both vectors give the distribution of symbols in . Similarly . The matrix has integer entries, so it must have the claimed form for some value of . Finally,

 i =12((k′01+k′10)−(k01+k10)) =12(Δ(Ga∘τ,Gb)−Δ(Ga,Gb)) =δ(τ;fa,fb)\qed

### Ii-C MAP estimation

The maximum a posteriori (MAP) estimator for this problem can be derived as follows. In the following lemma we will be careful to distinguish graph-valued random variables from fixed graphs: we name the former with upper-case letters and the latter with lower-case.

###### Lemma II.2.

Let , let be a uniformly random permutation of , and let . Then

 P[Π=π|(Gc,Gb)=(gc,gb)]∝(p10p01p11p00)i

where .

###### Proof.
 =P[Π=π|(Gc,Gb)=(gc,gb)] (a)∝P[Π=π,(Gc,Gb)=(gc,gb)] (b)=P[Π=π,(Ga,Gb)=(gc∘l(π),gb)] (c)∝P[(Ga,Gb)=(gc∘l(π),gb)] (d)=pμ(gc∘l(π),gb) (e)∝pμ(gc∘l(π),gb)−μ(gc,gb)(p01p10p00p11)12Δ(gc,gb) (f)=(p01p10p00p11)12Δ(gc∘l(π),gb)

where in we multiply by the constant , in we apply the relationship , and in we use the independence of from and the uniformity of . In we apply the definition of the distribution of , in , we divide by a constant that does not depend on , and in we use Lemma II.1. ∎

Thus MAP estimator is

 ^Π =argmax^πP[Π=^π|(Gc,Gb)=(gc,gb)] =argmin^πΔ(Gc∘l(^π),Gb)

The permutation achieves an alignment score of . Although is unknown to the estimator, we can analyze its success by considering the distribution of

 Δ(Ga∘l(Π−1∘^π),Gb)−Δ(Ga,Gb)=δ(l(Π−1∘^π);Ga,Gb).

Let

 Q ={π∈Sn:Δ(Ga∘l(π),Gb)≤Δ(Ga,Gb)} ={π∈Sn:δ(l(π);Ga,Gb)≤0},

the set of permutation that give alignments of and that are at least as good as the true permutation. The identity permutation achieves , so it is always in by definition.

Let be the probability of success of the MAP estimator conditioned on the generation of the graph pair . When is not minimizer of , i.e. there is some such that , . When achieves the minimum, .

The converse argument use the fact the overall probability of success is at most .

The achievability arguments use the fact the overall probability of error is at most

 P[|Q|≥1]≤E[|Q|−1]

or equivalently

 P[⋁π≠id(π∈Q)]≤∑π≠idP[π∈Q].

Here we have applied linearity of expectation on the indicators for or equivalently the union bound on these events.

### Ii-D Cycle decomposition and the nontrivial region

The cycle decompositions of the permutations and play a crucial role in the distribution of . For a vertex set and a fixed , define , the nontrivial region of the graph, to be the vertex pairs that are not fixed points of , i.e. . We will mark quantities and random variables associated with the nontrivial region with tildes.

Recall that is the number of vertices and let be the number of vertices that are not fixed points of . Let , let , and let be the number of vertex pairs in cycles of length . Then .

The expected value of depends only on the size of the nontrivial region.

.

###### Proof.

Let . Using the alternative expression for from Lemma II.1, we have

 =E[δ(τ;Ga,Gb)] =E[μ(Ga,Gb)11−μ(Ga∘τ,Gb)11] =∑e∈SP(Ga(e)=Gb(e)=1)−P(Ga(τ(e))=Gb(e)=1) =∑e∈˜Sp11−(p10+p11)(p01+p11) =~t(p00p11−p01p10).\qed

Let , which is the number of edges in . Let be the number of edges in in the nontrivial region, i.e. . When , the events for are independent and occur with probability , so both and have binomial distributions. Conditioned on , has a hypergeometric distribution.

We use the following notation for binomial and hypergeometric distributions. Each of these probability distributions models drawing from a pool of items, of which are marked. If we take samples without replacement, the number of marked items drawn has the hypergeometric distribution . If we take samples with replacement, the number of marked items drawn has a binomial distribution . Thus

 M ∼Bin(t,p11,1) (5) ˜M ∼Bin(~t,p11,1) (6) ˜M|M=m ∼Hyp(~t,m,t). (7)

Hypergeometric and binomial random variables have the following generating functions:

 Hyp(a,b,n;z) ≜[xayb](1+x+y+xyz)n[xayb](1+x)n(1+y)n Bin(a,b,n;z) ≜(1−bn+bnz)a

Observe that is symmetric in and . Additionally because the number of marked balls that are drawn is equal to the number of draws minus the number of unmarked balls drawn. For the same reason, .

### Ii-E Proof outline

Both of our achievability proofs have the following broad structure.

• Use a union bound over the non-identity permutations and estimate , where is fixed and are random.

• For a fixed , examine the cycle decomposition and relate to , where has the same number of fixed points as but all nontrivial cycles have length two. This is summarized in Theorem 4.

• Use large deviations methods to bound the lower tail of . This is done in Theorem 5.

Our first achievability result, Theorem 3, comes from applying Theorem 5 in a direct way. This requires no additional assumptions on but does not match the converse bound when is sparse. If has no edges, every permutation is in and the union bound is extremely loose. When , the probability that has no edges is

 (1−p11)t≈exp(−tp11)=exp(−c2(n−1)logn).

When , this probability is larger than , so the union bound on the error probability becomes larger than one.

To overcome this, in the proof of Theorem 6 we condition on before applying Theorem 5. It is more difficult to apply Theorem 5 to . In particular, , the number of edges of the intersection graph in nontrivial cycles of , now has a hypergeometric distribution rather than a binomial distribution . One way to analyze the tail of a hypergeometric random variable is to look at the binomial random variable with the same mean and number of samples. This idea is formalized in Lemma III.3. Moving from to would effectively undo our conditioning on . For the most important values of and the typical values of , we have . Thus we exploit the symmetry of the hypergeometric distribution () and replace with , which is more concentrated than .

### Ii-F Related Work

In the perfect correlation limit, i.e. , we have . In this case, the size of the automorphism group of determines whether it is possible to recover the permutation applied to . This is because the composition of an automorphism with the true matching gives another matching with no errors. Whenever the automorphism group of is nontrivial, it is impossible to exactly recover the permutation with high probability. We will return to this idea in Section V in the proof of the converse part of Theorem 1. Wright established that for , the automorphism group of is trivial with probability and that for , it is nontrivial with probability [4]. In fact, he proved a somewhat stronger statement about the growth rate of the number of unlabeled graphs that implies this fact about automorphism groups. Thus our Theorem 1 and Theorem 2 extend Wright’s results. Bollobás later provided a more combinatorial proof of this automorphism group threshold function [5]. The methods we use are closer to those of Bollobás.

Some practical recovery algorithms start by attempting to locate a few seeds. From these seeds, the graph matching is iteratively extended. Algorithms for the latter step can scale efficiently. Narayanan and Shmatikov were the first to apply this method [1]. They evaluated their performance empirically on graphs derived from social networks.

More recently, there has been some work evaluating the performance of this type of algorithm on graph inputs from random models. Yartseva and Grossglauser examined a simple percolation algorithm for growing a graph matching [6]. They find a sharp threshold for the number of initial seeds required to ensure that final graph matching includes every vertex. The intersection of the graphs and plays an important role in the analysis of this algorithm. Kazemi et al. extended this work and investigated the performance of a more sophisticated percolation algorithm[7].

If the networks being aligned correspond to two distinct online services, it is unlikely that the user populations of the services are identical. Kazemi et al. investigate alignment recovery of correlated graphs on overlapping but not identical vertex sets [8]. They determine that the information-theoretic penalty for imperfect overlap between the vertex sets of and is relatively mild. This regime is an important test of the robustness of alignment procedures.

### Ii-G Subsampling model

Pedarsani and Grossglauser [2] introduced the following generative model for correlated Erdős-Rényi (ER) graphs. Essentially the same model was used in [9, 10]. Let be an ER graph on with edge probability . Let and be independent random subgraphs of such that each edge of appears in and in with probabilities and respectively. We will refer to this as the subsampling model. The and parameters control the level of correlation between the graphs. This model is equivalent to with

 p11 = rsasb p10 = rsa(1−sb) p01 = r(1−sa)sb p00 = 1−r(sa+sb−sasb).

Solving for from the above definitions, we obtain

 r=(p10+p11)(p01+p11)p11=p11+p10+p01+p10p01p11. (8)

Observe that when and are independent, we have . This reveals that the subsampling model is capable of representing any correlated Erdős-Rényi distribution with nonnegative correlation. From (8), we see that is equivalent to and .

## Iii Graphs and cyclic sequences

Let be a matrix of formal variables indexed by :

 w=(w00w01w10w11)

and let be a single formal variable. For a set and a permutation , define the generating function

 AS,τ(w,z) =∑g∈[2]S∑h∈[2]Szδ(τ;g,h)wμ(g,h)

When and , captures the joint distribution of and :

 P[μ(Ga,Gb)=k,δ(τ;Ga,Gb)=i]=[wkzi]AS,τ(p⊙w,z)

where is the element-wise product of the matrices and . This follows immediately from the definition of the distribution.

### Iii-a Generating functions

###### Definition 2.

Let be an finite index set and let be a permutation consisting of a single cycle of length . A cyclic -ary sequence is a pair where .

Let be a permutation of with a single cycle. For any such choices of , the sets of cyclic sequences obtained are isomorphic, so we can define the generating function

 aℓ(w,z)=A[ℓ],σ(w,z).
###### Lemma III.1.

Let be a permutation. Let be the number of cycles of length in . Then and

 AS,τ(w,z)=∏ℓ∈Naℓ(w,z)tℓ.
###### Proof.

Let , so

 δ(τ;g,h)=∑e∈Sγ(g(e),g(τ(e)),h(e)).

Let be the partition of from the cycle decomposition of . Then we have an alternate expression for :

 =AS,τ(w,z) =∑g∈[2]S∑h∈[2]S∏e∈Szγ(g(e),g(τ(e)),h(e))wg(e),h(e) =∑g∈[2]S∑h∈[2]S∏Si∈T∏e∈Sizγ(g(e),g(τ(e)),h(e))wg(e),h(e) (a)=∏Si∈T∑g∈[2]Si∑h∈[2]Si∏e∈Sizγ(g(e),g(τ(e)),h(e))wg(e),h(e) =∏Si∈Ta|Si|(w,z).

In , we use the fact that implies . ∎

For , the generating function is very simple. There are 4 possible pairs of cyclic -ary sequences of length one. A cycle of length one in a permutation is a fixed point, so these cyclic sequences are unchanged by the application of and is zero for each of them. Thus .

We define

 ˜AS,τ(w,z)=∏ℓ≥2aℓ(w,z)tℓ.

Just as captures the joint distribution of and , captures the joint distribution of and . Because does not appear in , we have

 [zi]AS,τ(w,z)=a1(w,z)t1[zi]˜AS,τ(w,z).

This implies that and are conditionally independent given .

### Iii-B Nontrivial cycles

For , there are 16 possible pairs of sequences. There are only 4 pairs for which : the cases where and are each either or . In the two cases where , , , and . In the two cases where , . Thus

 a2(w,z)=(w00+w01+w10+w11)2+2w00w11(z−1)+2w01w10(z−1−1). (9)

The following theorem relates longer cycles to cycles of length two.

###### Theorem 4.

Let , and . Then for all , .

The proofs of Theorem 4 and several intermediate lemmas are in Appendix A.

### Iii-C Tail bounds from generating functions

The following lemma is a well known inequality that we will apply in the proof of Theorem 5 and in several other places.

###### Lemma III.2.

For a generating function where for all and a real number ,

 ∑i≤j[zi]g(z)≤z−j∗g(z1).
###### Proof.
 ∑i≤j[zi]g(z)=∑i≤jgi≤∑igizi−j∗=z−j∗g(z∗).\qed
###### Theorem 5.

For ,

 ∑i≤0[zi]˜AS,τ(w,z)≤((w00+w01+w10+w11)2−2(√w00w11−√w01w10)2)~t2.
###### Proof.

For all , we have

 ∑i≤0[zi]˜AS,τ(x,z) =∑i≤0[zi]∏ℓ≥2aℓ(w,z)tℓ (a)≤∏ℓ≥2aℓ(w,z1)tℓ (b)≤a2(w,z1)~t/2 (10)

where follows from by Lemma III.2 and follows from Theorem 4 and .

From (10), where

 u =w00+w01+w10+w11 v =w00w11(z−1)+w01w10(z−1−1).

We would like to choose to minimize . Substituting the optimal choice, , into the expression for , we obtain

 =minzw00w11z−w00w11−w01w10+w01w10z−1 =2√w00w01w10w11z1−w00w11−w01w10 =−(√w00w11−√w01w10)2. (11)

Combining this with and (10) gives the claimed bound. ∎

### Iii-D Hypergeometric and binomial g.f.

Chvátal provided an upper bound on the tail probabilities of a hypergeometric random variable [11]. The following lemma is essentially a translation of that bound into the language of generating functions.

###### Lemma III.3.

For all , and , and all ,

 Hyp(a,b,n;z)≤Bin(a,b,n;z).
###### Proof.

First, we have

 (na)(nb)Hyp(a,b,n;z) =[xayb](1+x+y+xyz)n =[xayb]((1+x)(1+y)+xy(z−1))n =∑ℓ(nℓ)[xayb]((1+x)(1+y))n−ℓ(xy(z−1))ℓ =∑ℓ(nℓ)[xa−ℓ](1+x)n−ℓ[yb−ℓ](1+y)n−ℓ(z−1)ℓ =∑ℓ(nℓ)(n−ℓa−ℓ)(n−ℓb−ℓ)(z−1)ℓ =(na)(nb)∑ℓ(aℓ)(bℓ)(nℓ)(z−1)ℓ.

Observe that

 (bℓ)(nℓ)nℓbℓ=ℓ−1∏i=0(b−i)n(n