An efficient reconciliation algorithm for social networks
People today typically use multiple online social networks (Facebook, Twitter, Google+, LinkedIn, etc.). Each online network represents a subset of their “real” ego-networks. An interesting and challenging problem is to reconcile these online networks, that is, to identify all the accounts belonging to the same individual. Besides providing a richer understanding of social dynamics, the problem has a number of practical applications. At first sight, this problem appears algorithmically challenging. Fortunately, a small fraction of individuals explicitly link their accounts across multiple networks; our work leverages these connections to identify a very large fraction of the network.
Our main contributions are to mathematically formalize the problem for the first time, and to design a simple, local, and efficient parallel algorithm to solve it. We are able to prove strong theoretical guarantees on the algorithm’s performance on well-established network models (Random Graphs, Preferential Attachment). We also experimentally confirm the effectiveness of the algorithm on synthetic and real social network data sets.
The advent of online social networks has generated a renaissance in the study of social behaviors and in the understanding of the topology of social interactions. For the first time, it has become possible to analyze networks and social phenomena on a world-wide scale and to design large-scale experiments on them. This new evolution in social science has been the center of much attention, but has also attracted a lot of critiques; in particular, a longstanding problem in the study of online social networks is to understand the similarity between them and “real” underlying social networks .
This question is particularly challenging because online social networks are often just a realization of a subset of real social networks. For example, Facebook “friends” are a good representation of the personal acquaintances of a user, but probably a poor representation of her working contacts, while LinkedIn is a good representation of work contacts but not a very good representation of personal relationships. Therefore, analyzing social behaviors in any of these networks has the drawback that the results would only be partial. Furthermore, even if certain behavior can be observed in several networks, there are still serious problems because there is no systematic way to combine the behavior of a specific user across different social networks and because some social relationships will not appear in any social network. For these reasons, identifying all the accounts belonging to the same individual across different social services is a fundamental step in the study of social science.
Interestingly, the problem has also very important practical implications. First, having a deeper understanding of the characteristics of a user across different networks helps to construct a better portrait of her, which can be used to serve personalized content or advertisements. In addition, having information about connections of a user across multiple networks would make it easier to construct tools such as “friend suggestion” or “people you may want to follow”.
The problem of identifying users across online social networks (also referred to as the social network reconciliation problem) has been studied extensively using machine learning techniques; several heuristics have been proposed to tackle it. However, to the best of our knowledge, it has not yet been studied formally and no rigorous results have been proved for it. One of the contributions of our work is to give a formal definition of the problem, which is a precursor to mathematical analysis. Such a definition requires two key components: A model of the “true” underlying social network, and a model for how each online social network is formed as a subset of this network. We discuss details of our models in Section 3.
Another possible reason for the lack of mathematical analysis is that natural definitions of the problem are demotivatingly similar to the graph isomorphism problem.111In graph theory, an isomorphism between two graphs and is a bijection, , between the vertex sets of and such that any two vertices and of are adjacent in if and only if and are adjacent in . The graph isomorphism problem is: Given two graphs and find an isomorphism between them or determine that there is no isomorphism. The graph isomorphism problem is considered very hard, and no polynomial algorithms are known for it. In addition, at first sight the social network reconciliation problem seems even harder because we are not looking just for isomorphism but for similar structures, as distinct social networks are not identical. Fortunately, when reconciling social networks, we have two advantages over general graph isomorphism: First, real social networks are not the adversarially designed graphs which are hard instances of graph isomorphism, and second, a small fraction of social network users explicitly link their accounts across multiple networks.
The main goal of this paper is to design an algorithm with provable guarantees that is simple, parallelizable and robust to malicious users. For real applications, this last property is fundamental, and often underestimated by machine learning models.222Approaches based largely on features of a user (such as her profile) and her neighbors can easily be tricked by a malicious user, who can create a profile locally identical to the attacked user. In fact, the threat of malicious users is so prominent that large social networks (Twitter, Google+, Facebook) have introduced the notion of ‘verification’ for celebrities.
Our first contribution is to give a formal model for the graph reconciliation problem that captures the hardness of the problem and the notion of an initial set of trusted links identifying users across different networks. Intuitively, our model postulates the existence of a true underlying graph, then randomly generates 2 realizations of it which are perturbations of the initial graph, and a set of trusted links for some users. Given this model, our next significant contribution is to design a simple, parallelizable algorithm (based on similar intuition to the algorithm in ) and to prove formally that our algorithm solves the graph reconciliation problem if the underlying graph is generated by well established network models. It is important to note that our algorithm relies on graph structure and the initial set of links of users across different networks in such a way that in order to circumvent it, an attacker must be able to have a lot of friends in common with the user under attack. Thus it is more resilient to attack than much of the previous work on this topic. Finally, we note that any mathematical model is, by necessity, a simplification of reality, and hence it is important to empirically validate the effectiveness of our approach when the assumptions of our models are not satisfied. In Section 5, we measure the performance of our algorithm on several synthetic and “real” data sets.
We also remark that for various applications, it may be possible to improve on the performance of our algorithm by adding heuristics based on domain-specific knowledge. For example, we later discuss identifying common Wikipedia articles across languages; in this setting, machine translation of article titles can provide an additional useful signal. However, an important message of this paper is that a simple, efficient and scalable algorithm that does not take any domain-specific information into account can achieve excellent results for mathematically sound reasons.
2 Related Work
The problem of identifying Internet users was introduced to identify users across different chat groups or web sessions in [24, 27]. Both papers are based on similar intuition, using writing style (stylography features) and a few semantic features to identify users. The social network reconciliation problem was introduced more recently by Zafarani and Liu in . The main intuition behind their paper is that users tend to use similar usernames across multiple social networks, and even when different, search engines find the corresponding names. To improve on these first naive approaches, several machine learning models were developed [3, 17, 20, 25, 28], all of which collect several features of the users (name, location, image, connections topology), based on which they try to identify users across networks. These techniques may be very fragile with respect to malicious users, as it is not hard to create a fake profile with similar characteristics. Furthermore, they get lower precision experimentally than our algorithm achieves. However, we note that these techniques can be combined with ours, both to validate / increase the number of initial trusted links, and to further improve the performance of our algorithm.
A different approach was studied in , where the authors infer missing attributes of a user in an online social network from the attribute information provided by other users in the network. To achieve their results, they retrieve communities, identify the main attribute of a community and then spread this attribute to all the user in the community. Though it is interesting, this approach suffers from the same limitations of the learning techniques discussed above.
Recently, Henderson et al.  studied which are the most important features to identify a node in a social network, focusing only on graph structure information. They analyzed several features of each ego-network, and also added the notion of recursive features on nodes at distance larger than 1 from a specific node. It is interesting to notice that their recursive features are more resilient to attack by malicious users, although they can be easily circumvented by the attacker typically assumed in the social network security literature , who can create arbitrarily many nodes.
The problem of reconciling social networks is closely connected to the problem of de-anonymizing social networks. Backstrom et al. introduced the problem of deanonymizing social networks in . In their paper, they present 2 main techniques: An active attack (nodes are added to the network before the network is anonymized), and a second passive one. Our setting is similar to that described in their passive attack. In this setting the authors are able to design a heuristic with good experimental results; though their technique is very interesting, it is somewhat elaborate and does not have a provable guarantee.
In the context of de-anonymizing social networks, the work of Narayanan and Shmatikov  is closely related. Their algorithm is similar in spirit to ours; they look at the number of common neighbors and other statistics, and then they keep all the links above a specific threshold. There are two main differences between our work and theirs. First, we formulate the problem and the algorithm mathematically and we are able to prove theoretical guarantees for our algorithm. Second, to improve the precision of their algorithm in  the authors construct a scoring function that is expansive to compute. In fact the complexity of their algorithm is , where and are the number of edges in the two graphs and and are the maximum degree in the 2 graphs. Thus their algorithm would be too slow to run on Twitter and Facebook, for example; Twitter has more than 200M users, several of whom have degree more than 20M and Facebook more than 1B users with several users of degree 5K. Instead, in our work we are able to show that a very simple technique based on degree bucketing combined with the number of common neighbors suffices to guarantee strong theoretical guarantees and good experimental results. In this way we designed an algorithm with sequential complexity that can be run in MapReduce rounds. In this context, our paper can be seen as the first really scalable algorithm for network de-anonymization with theoretical guarantees. Further, we also obtain considerably higher precision experimentally, though a perfect comparison across different datasets is not possible. The different contexts also are important: In de-anonymization, the precision of 72% they report corresponds to a significant violation of user privacy. In contrast, we focus on the benefits to users of linking accounts; in a user-facing application, suggesting an account with a 28% chance of error is unlikely to be acceptable.
Finally, independently from our work, Yartseva and Grossglauser  recently studied a very similar model focusing only on networks generated by the Erdős-Rényi random graph model.
3 Model and Algorithm
In this section, we first describe our formal model and its parameters. We then describe our algorithm and discuss the intuition behind it.
Recall that a formal definition of the user identification problem requires first a model for the “true” underlying social network that captures relationships between people. However, we cannot directly observe this network; instead, we consider two imperfect realizations or copies and with . Second, we need a model for how edges of are selected for the two copies and . This model must capture the fact that users do not necessarily replicate their entire personal networks on any social networking service, but only a subset.
Any such mathematical models are necessarily imperfect descriptions of reality, and as models become more ‘realistic’, they become more mathematically intractable. In this paper, we consider certain well-studied models, and provide complete proofs. It is possible to generalize our mathematical techniques to some variants of these models; for instance, with small probability, the two copies could have new “noise” edges not present in the original network , or vertices could be deleted in the copies. We do not fully analyze these as the generalizations require tedious calculations without adding new insights. Our experimental results of Section 5 show that the algorithm performs well even in real networks where the formal mathematical assumptions are not satisfied.
For the underlying social network, our main focus is on the preferential attachment model , which is historically the most cited model for social networks. Though the model does not capture some features of real social networks, the key properties we use for our analysis are those common to online social networks such as a skewed degree distribution, and the fact that nodes have distinct neighbors including some long-range / random connections not shared with those immediately around them[13, 15]. In the experimental section we will consider also different models and also real social networks as our underline real networks.
For the two imperfect copies of the underlying network we assume that (respectively ) is created by selecting each edge of the original graph independently with a fixed probability (resp. ) (See Figure 1.) In the real world, edges/relationships are not selected truly independently, but this serves as a reasonable approximation for observed networks. In fact, a similar model has been previously considered by , which also produced experimental evidence from an email network to support the independent random selection of edges. Another plausible mechanism for edge creation in social network is the cascade model, in which nodes are more likely to join a new network if more of their friends have joined it. Experimentally, we show that our algorithm performs even better in the cascade model than in the independent edge deletion model.
These two models are theoretically interesting and practically interesting . Nevertheless, in some cases the analyzed social networks may differ in their scopes and so the group of friends that a user has in a social network can greatly differ from the group of friends that same user has in the other network. To capture this scenario in the experimental section, we also consider the Affiliation Network model  (in which users participate in a number of communities) as the underlying social network. For each of , and for each community, we keep or delete all the edges inside the community with constant probability. This highly correlated edge deletion process captures the fact that a user’s personal friends might be connected to her on one network, while her work colleagues are connected on the second network. We defer the detailed description of this experiment to Section 5.
Recall that the user identification problem, given only the graph information, is intractable in general graphs. Even the special case where (that is, no edges have been deleted) is equivalent to the well-studied Graph Isomorphism problem, for which no polynomial-time algorithm is known. Fortunately, in reality, there are additional sources of information which allow one to identify a subset of nodes across the two networks: For example, people can use the same email address to sign up on multiple websites. Users often explicitly connect their network accounts, for instance by posting a link to their Facebook profile page on Google+ or Twitter and vice versa. To model this, we assume that there is a set of users/nodes explicitly linked across the two networks . More formally, there is a linking probability (typically, is a small constant) and each node in is linked across the networks independently with probability . (In real networks, nodes may be linked with differing probabilities, but high-degree nodes / celebrities may be more likely to connect their accounts and engage in cross-network promotions; this would be more likely to help our algorithm, since low-degree nodes are less valuable as seeds because they help identify only a small number of neighbors. In the experiments of , the authors explicitly consider high-degree nodes as seeds in the real-world experiments.)
In Section 3.2 below, we present a natural algorithm to solve the user identification problem with a set of linked nodes, and discuss some of its properties. Then, in Section 4, we prove that this algorithm performs well on several well-established network models. In Section 5, we show that the algorithm also works very well in practice, by examining its performance on real-world networks.
3.2 The Algorithm
To solve the user identification problem, we design a local distributed algorithm that uses only structural information about the graphs to expand the initial set of links into a mapping/identification of a large fraction of the nodes in the two networks.
Before describing the algorithm, we introduce a useful definition.
A pair of nodes with is said to be a similarity witness for a pair with if and has been linked to / identified with .
Here, denotes the neighborhood of in , and similarly denotes the neighborhood of in .
Roughly speaking, in each phase of the algorithm, every pair of nodes (one from each network) computes an similarity score that is equal to the number of similarity witnesses they have. We then create a link between two nodes and if is the node in with maximum similarity score to and vice versa. We then use the newly generated set of links as input to the next phase of the algorithm.
A possible risk of this algorithm is that in early phases, when few nodes in the network have been linked, low-degree nodes could be mis-matched. To avoid this (improving precision), in the th phase, we only allow nodes of degree roughly and above to be matched, where is a parameter related to the largest node degree. Thus, in the first phase, we match only the nodes of very high degree, and in subsequent phases, we gradually decrease the degree threshold required for matching. In the experimental section we will show in fact that this simple step is very effective, reducing the error rate by more than . We summarize the algorithm, that we called User-Matching, as follows:
a set of initial identification links
across the networks, the maximum degree in the graph
a minimum matching score and a specified number of
A larger set of identification links across the networks.
For all the pairs with and
and such that and \+
Assign to a score equal to the number
of similarity witnesses between and
If is the pair with highest score in which
either or appear and the score is above
add to .
Where is the degree of node in . Note that the internal for loop can be implemented efficiently with consecutive rounds of MapReduce, so the total running time would consist of MapReductions. In the experiments, we note that even for a small constant ( or ), the algorithm returns very interesting results. The optimal choice of threshold depends on the desired precision/recall tradeoff; higher choices of T improve precision, but in our experiments, we note that or is sufficient for very high precision.
4 Theoretical Results
In this section we formally analyze the performance of our algorithm on two network models. In particular, we explain why our simple algorithm should be effective on real social networks. The core idea of the proofs and the algorithm is to leverage high degree nodes to discover all the possible mapping between users. In fact, as we show here theoretically and later experimentally, high degree nodes are easy to detect. Once we are able to detect the high degree nodes, most low degree nodes can be identified using this information.
We start with the Erdős-Rényi (Random) Graph model  to warm up with the proofs, and to explore the intuition behind the algorithm. Then we move to our main theoretical results, for the Preferential Attachment Model. For simplicity of exposition, we assume throughout this section that ; this does not change the proofs in any material detail.
4.1 Warm up: Random Graphs
In this section, we prove that if the underlying ‘true’ network is a random graph generated from the Erdős-Rényi model (also known as ), our algorithm identifies almost all nodes in the network with high probability.
Formally, in the model, we start with an empty graph on nodes, and insert each of the possible edges independently with probability . We assume that ; in fact, any constant probability results in graphs which are much denser than any social network.333In fact, the proof works even with , but it requires more care. However, when is too close to 1, is close to a clique and all nodes have near-identical neighborhoods, making it impossible to distinguish between them. Let be a graph generated by this process; given this underlying network , we now construct two partial realizations as described by our model of Section 3.
We note that the probability a specific edge exists in or is . Also, if is less than for , the graphs and are not connected w.h.p. . Therefore, we assume that for some constant .
In the following we identify the nodes in with and the nodes in with , where nodes and correspond to the same node in . In the first phase, the expected number of similarity witnesses for a pair is . This follows because the expected number of neighbors of in is , the probability that the edge to a given neighbor survives in both and is , and the probability that it is initially linked is . On the other hand, the expected number of similarity witnesses for a pair , with is ; the additional factor of is because a given other node must have an edge to both and , which occurs with probability . Thus, there is a factor of difference between the expected number of similarity witnesses a node has with its true match and with some other node , with . The main intuition is that this factor of is enough to ensure the correctness of algorithm. We prove this by separately considering two cases: If is sufficiently large, the expected number of similarity witnesses is large, and we can apply a concentration bound. On the other hand, if is small, is so small that the expected number of similarity witnesses is almost negligible.
We start by proving that in the first case there is always a gap between a real and a false match.
If (that is, ), w.h.p. the number of first-phase similarity witnesses between and is at least . The number of first-phase similarity witnesses between and , with is w.h.p. at most .
We prove both parts of the lemma using Chernoff Bounds (see, for instance, ).
Let consider a pair for node . Let be a r.v. such that if node and , and if , where is the initial seed of links across and . Then, we have . If , the Chernoff bound implies that . That is,
Now, taking the union bound over the nodes in , w.h.p. every node has the desired number of first-phase similarity witnesses with its copy.
To prove the second part, suppose w.l.o.g. that we are considering the number of first-phase similarity witnesses between and , with . Let if node and , and if . If , the Chernoff bound implies that . That is,
where the last inequality comes from the fact that . Taking the union bound over all unordered pairs of nodes gives the fact that w.h.p., every pair of different nodes does not have too many similarity witnesses. ∎
The theorem above implies that when is sufficiently large, there is a gap between the number of similarity witnesses of pairs of nodes that correspond to the same node and a pair of nodes that do not correspond to the same node. Thus the first-phase similarity witnesses are enough to completely distinguish between the correct copy of a node and possible incorrect matches.
It remains only to consider the case when is smaller than the bound required for Theorem 1. This requires the following useful lemma.
Let be a Bernoulli random variable, which is 1 with probability at most , and 0 otherwise. In independent trials, let denote the outcome of the th trial, and let : If is , the probability that is greater than is at most .
The probability that is at most 2 is given by: . Using the Taylor series expansion for , this is at most . ∎
When we run our algorithm on a graph drawn from , we set the minimum matching threshold to be .
If , w.h.p., algorithm User-Matching never incorrectly matches nodes and with .
Suppose for contradiction the algorithm does incorrectly match two such nodes, and consider the first time this occurs. We use Lemma 2 above. Let denote the event that the vertex is a similarity witness for and .
In order for to have occurred, we must have in and in and . The probability that is therefore at most . Note that each is independent of the others, and that there are such events. As is , the conditions of Lemma 2 apply, and hence the probability that more than 2 such events occur is at most . But is , and hence this event occurs with probability at most . Taking the union bound over all unordered pairs of nodes gives the fact that w.h.p., not more than 2 similarity witnesses can exist for any such pair. But since the minimum matching threshold for our algorithm is , the algorithm does not incorrectly match this pair, contradicting our original assumption. ∎
Having proved that our algorithm does not make errors, we now show that it identifies most of the graph.
Our algorithm identifies fraction of the nodes w.h.p.
Note that the probability that a node is identified is by the Chernoff bound because in expectation it has similarity witnesses. So in expectation, we identify fraction of the nodes. Furthermore, by applying the method of bounded difference  (each node affects the final result at most by ), we get that the result holds also with high probability. ∎
4.2 Preferential Attachment
[PA model]. Let , being a fixed parameter, be defined inductively as follows:
consists of a single vertex with self-loops.
is built from by adding a new node together with edges inserted one after the other in this order. Let be the sum of the degrees of all the nodes when the edge is added. The endpoint is selected with probability , with the exception of node , which is selected with probability .
The PA model is the most celebrated model for social networks. Unlike the Erdős-Rényi model, in which all nodes have roughly the same degree, PA graphs have a degree distribution that more accurately resembles the skew degree distribution seen in real social networks. Though more evolved models of social networks have been recently introduced, we focus on the PA model here because it clearly illustrates why our algorithm works in practice. Note that the power-law distribution of the model complicates our proofs, as the overwhelming majority of nodes only have constant degree (), and so we can no longer simply apply concentration bounds to obtain results that hold w.h.p. For a (small) constant fraction of the nodes , there does not exist any node such that and ; we cannot hope to identify these nodes, as they have no neighbors “in common” on the two networks. In fact, if and , roughly of nodes of “true” degree will be in this situation. Therefore, to be able to identify a reasonable fraction of the nodes, one needs to be at least a reasonably large constant; this is not a serious constraint, as the median friend count on Facebook, for instance, is over 100. In our experimental section, we show that our algorithm is effective even with smaller .
We now outline our overall approach to identify nodes across two PA graphs. In Lemma 11, we argue that for the nodes of very high degree, their neighborhoods are different enough that we can apply concentration arguments and uniquely identify them. For nodes of intermediate degree () and less, we argue in Lemma 10 that two distinct nodes of such degree are very unlikely to have more than 8 neighbors in common. Thus, running our algorithm with a minimum matching threshold of 9 guarantees that there are no mistakes. Finally, we prove in Lemma 12 that when we run the algorithm iteratively, the high-degree nodes help us identify many other nodes, these nodes together with the high-degree nodes in turn help us identify more, and so on: Eventually, the probability that any given node is unidentified is less than a small constant, which implies that we correctly identify a large fraction of the nodes.
Interestingly, we notice in our experiments that on real networks, the algorithm has the same behavior as on PA graphs. In fact, as we will discuss later, the algorithm is always able to identify high-degree/important nodes and then, using this information, identify the rest of the graph.
Technical Results: The first of the three main lemmas we need, Lemma 11, states that we can identify all of the high-degree nodes correctly. To prove this, we need a few technical results. These results say that all nodes of high degree join the network early, and continue to increase their degree significantly throughout the process; this helps us show that high-degree nodes do not share too many neighbors.
4.2.1 High degree nodes are early-birds
Here we will prove formally that the nodes of degree join the network very early; this will be useful to show that two high degree nodes do not share too many neighbors.
Let be the preferential attachment graph obtained after steps. Then for any node inserted after time , for any constant , with high probability, where is the degree of nodes at time .
It is possible to prove that such nodes have expected constant degree, but unfortunately, it is not trivial to get a high probability result from this observation because of the inherent dependencies that are present in the preferential attachment model. For this reason we will not prove the statement directly, but we will take a short detour inspired by the proof in . In particular we will first break the interval in a constant number of small intervals. Then we will show that in each interval the degree of will increase by at most with high probability. Thus we will be able to conclude that at the end of the process the total degree of is at most (recall that we only have a constant number of interval).
As mentioned above we analyze the evolution of the degree of in the interval to by splitting this interval in a constant number of segments of length , for some constant to be fixed later. Now we can focus on what happens to the degree of in the interval if , for some constant and . Note that if we can prove that , for some constant with probability , we can then finish the proof by the arguments presented in the previous paragraph.
In order to prove this, we will take a small detour to avoid the dependencies in the preferential attachment model. More specifically, we will first show that this is true with high probability for a random variable for which it is easy to get the concentration result. Then we will prove that the random variable stochastically dominates the increase in the degree of . Thus the result will follow.
Now, let us define as the number of heads that we get when we toss a coin which gives head with probability for times, for some constant . It is possible to see that:
Now we fix and we use the Chernoff bound to get the following result:
So we know that the value of is bounded by with probability . Now, note that until the degree of is less than the probability that increases its degree in the next step is stochastically dominated by the probability that we get an head when we toss a coin which gives head with probability . To conclude our algorithm we study the probability that become of degree precisely at time . Note that until time has degree smaller than and so it is dominated by the coin. But we already know that when we toss such a coin at most times the probability of getting heads is in . Thus for any the probability that reach degree at time is . Thus by using the union bound on all the possible , will get to degree with probability .
At this point we can finish the proof by taking the union bound on all the segments(recall that they are constant) and on the number of nodes and we get that all the nodes that join the network after time have degree that is upper bounded by for some constant with probability . ∎
4.2.2 The rich get richer
In this section we study another fundamental property of the preferential attachment, which is that nodes with degree bigger than continue to increase their degree significantly throughout the process. More formally:
Let be the preferential attachment graph obtained after steps. Then with high probability for any node of degree and for any fixed constant , a fraction of the neighbors of joined the graph after time .
By Lemma 5 above, we know that joined the network before time for any fixed constant . Now we consider two cases. In the first, , in which case the statement is true because the final degree is bigger than . Otherwise, we have that , in this case the probability that increase its degree at every time step after dominates the probability that a toss of a biased coin which gives head with probability comes up head. Now consider the random variable that counts the number of heads when we toss a coin that lands head with probability for times. The expected value of is:
Thus using the Chernoff bound:
Thus with probability is bigger that but as mentioned before the increase in the degree of stochastically dominates . Thus taking the union bound on all the possible we get that the statement holds with probability equal to . Thus the claim follows. ∎
4.2.3 First-mover advantage
Let be the preferential attachment graph obtained after steps. Then with high probability all the nodes that join the network before time have degree at least .
To prove this theorem we will use some results from , but before we need to introduce another model equivalent to the preferential attachment. In this new process instead of constructing , we first construct and then we collapse the vertices to construct the first vertex, the vertex between to construct the second vertex and so on so for. It is not hard to see that this new model is equivalent to the preferential attachment. Now we can prove our technical theorem.
Now we can state two useful equation from the proof of Lemma 6 in . Consider the model . Let , where is the degree of a node inserted at time at time . Then we have:
From the same paper we also have that if , we can derive from equation (23) that
From 1 we can derive that:
Thus we get that:
where we assumed that .
So now by union bounding on the first nodes we obtain that with high probability in all the nodes have degree bigger than . But this implies in turn the statement of the theorem by construction of . ∎
4.2.4 Handling product of generalized harmonic
Let and be constant greater than . Then:
Recall that . Then
Completing the Proof: We now use the technical lemmas above to complete our proof for the preferential attachment model.
For a node with degree , the probability that it is incident to a node arriving at time is at most w.h.p.
If node arrives after , the probability that is adjacent to is at most the given value, since there are edges existing in the graph already, and we take the union bound over the edges incident to . If arrives before , let denote the time at which arrives. From Lemma 6 of , the degree of at t is at most w.h.p.. But there are edges already in the graph at this time, and since has edges incident to it, the probability that one of them is incident to is at most . ∎
W.h.p, for any pair of nodes of degree , .
From Lemma 7, nodes and must have arrived after time . Let be constants such that and for some constant . We first show that the probability that any two nodes with degree less than and arriving before time have or more common neighbors between and is at most . This implies that, setting to , nodes and have at most 2 neighbors between and , at most 2 between and , and at most 2 between and , for a total of 6 overall. Similarly, we show that and have at most 2 neighbors arriving before , which completes the lemma.
From Lemma 9 above, the probability that a node arriving at time is incident to and is at most . (The events are not independent, but they are negatively correlated.) The probability that 3 nodes are all incident to both and , then, is at most . Therefore, for a fixed , the probability that some 3 nodes are adjacent to and is at most:
There are at most choices for each of and ; taking the union bound, the probability that any pair , have 3 or more neighbors in common is at most . ∎
So, by setting the matching threshold to , the algorithm never makes errors; we now prove that it actually detects a lot of good new links.
The algorithm successfully identifies any node of degree .
For any node of degree , the expected number of similarity witnesses it has with its copy during the first phase is ; using the Chernoff Bound, the probability that the number is less than of its expectation is at most . Therefore, with very high probability, every node of degree has at least first-phase similarity witnesses with its copy.
On the other hand, how many similarity witnesses can node have with a copy of a different node ? Fix , and first consider potential similarity witnesses that arrive before time ; later, we consider those that arrive after this time. From Lemma 6, we have . Even if all of these neighbors of are also incident to , the expected number of similarity witnesses for is at most . Now consider the neighbors of that arrive after time . Each of these nodes is a neighbor of with probability . But , and hence each of the neighbors of is a neighbor of with probability . Therefore, the expected number of similarity witnesses for among these nodes is at most . Therefore, the total expected number of similarity witnesses is at most . Again using the Chernoff Bound, the probability that this is at least is at most , which is at most .
To conclude, we showed that with very high probability, a high-degree node has at least first-phase similarity witnesses with its copy, and has fewer than this number of witnesses to the copy of any other node. Therefore, our algorithm correctly identifies all high-degree nodes. ∎
From the two preceding lemmas, we identify all the high-degree nodes, and make no mistakes on the low-degree nodes. It therefore remains only to show that we have a good success probability for the low-degree nodes. In the lemma below, we show this when . We note that one still obtains good results even with a higher or lower value of , but it appears difficult to obtain a simple closed-form expression for the fraction of identified nodes. For ease of exposition, we present the case of here, but the proof generalizes in the obvious way.
Suppose . Then, w.h.p., we successfully identify at least of the nodes.
We have already seen that all high-degree nodes (those arriving before time ) are identified in the first phase of the algorithm. Note also that it never makes a mistake; it therefore remains only to identify the lower-degree nodes. We describe a sequence of iterations in which we bound the probability of failing to identify other nodes.
Consider a sequence of roughly iterations, in each of which we analyze nodes. In particular, iteration contains all nodes that arrived between time and time . We argue inductively that after iteration , w.h.p. the fraction of nodes belonging to this iteration that are not identified is less than , and the total fraction of degree incident to unidentified nodes is less than . Since this is true for each , we obtain the lemma.
The base case of nodes arriving before has already been handled. Now, note that during any iteration, the total degree incident to nodes of this iteration is at most . Thus, when each node of this iteration, the probability that any of its edges is incident to another node of this iteration is less than .
Consider any of the edges incident to a given node of this iteration. For each edge, we say it is good if it survives in both copies of the graph, and is incident to an identified node from a previous iteration. Thus, the probability that an edge is good is at least . Since , the expected number of good edges is greater than 20. The node will be identified if at least of its edges are good; applying the Chernoff bound, the probability that a given node is unidentified is at most .
Since this is true for each node of this iteration, regardless of the outcomes for previous nodes of the iteration, we can apply concentration inequalities even though the events are not independent. In particular, the number of identified nodes stochastically dominates the number of successes in independent Bernoulli trials with probability (see, for example, Theorem 1.2.17 of ). Again applying the Chernoff Bound, the probability that the fraction of unidentified nodes exceeds is at most , which is negligible. To complete the induction, we need to show that the fraction of total degree incident to unidentified nodes is at most . To observe this, note that the increase in degree is ; the unidentified fraction increases if the new nodes are unidentified (but we have seen the expected contribution here is at most ), or if the “other” endpoint is another node of this iteration (at most ), or if the “other” endpoint is an unidentified node (in expectation, at most ). Again, a simple concentration argument completes the proof. ∎
In this section we analyze the performance of our algorithm in different experimental settings. The main goal of this section is to answer the following eight questions:
Are our theorems robust? Do our results depend on the constants that we use or are they more general?
How does the algorithm scale on very large graphs?
Does our algorithm work only for an underlying “real” network generated by a random process such as Preferential Attachment, or does it work for real social networks?
How does the algorithm perform when the two networks to be matched are not generated by independently deleting edges, but by a different process like a cascade model?
How does the algorithm perform when the two networks to be matched have different scopes? Is the algorithm robust to highly correlated edge deletion?
Does our model capture reality well? In more realistic scenarios, with distinct but similar graphs, does the algorithm perform acceptably?
How does our algorithm perform when the network is under attack? Can it still have high precision? Is it easy for an adversary to trick our algorithm?
How important is it to bucket nodes by degree? How big is the impact on the algorithm’s precision? How does our algorithm compare with a simple algorithm that just counts the number of common neighbors?
To answer these eight questions, we designed different experiments using different publicly available data sets. These experiments are increasingly challenging for our algorithm, which performs well in all cases, showing its robustness. Before entering into the details of the experiments, we describe briefly the basic datasets used in the paper. We use synthetic random graphs generated by the Preferential Attachment , Affiliation Network , and RMAT  processes; we also consider an early snapshot of the Facebook graph , a snapshot of DBLP , the email network of Enron , a snapshot of Gowalla  (a social network with location information), and Wikipedia in two languages . In Table 1 we report some general statistics on the networks.
|Network||Number of nodes||Number of edges|
|French Wikipedia ||4,362,736||141,311,515|
|German Wikipedia ||2,851,252||81,467,497|
Robustness of our Theorems: To answer the first question, we use as an underlying graph the preferential attachment graph described above, with 1,000,000 nodes and . We analyze the performance of our algorithm when we delete edges with probability and with different seed link probabilities. The main goal of this experiment is to show that the values of needed in our proof are only required for the calculations; the algorithm is effective even with much lower values. With the specified parameters, for the majority of nodes, the expected number of neighbors in the intersection of both graphs is . Nevertheless, as shown in Figure 2, our algorithm performs remarkably well, making zero errors regardless of the seed link probability. Further, it recovers almost the entire graph. Unsurprisingly, lowering the threshold for our algorithm increases recall, but it is interesting to note that in this setting, it does not affect precision at all.
Efficiency of our algorithms: Here we tested our algorithms with datasets of increasing size. In particular we generate 3 synthetic random graphs of increasing size using the RMAT random model. Then we use the three graphs as the underlying “real” networks and we generate graphs from them with edges surviving with probability . Finally we analyze the running time of our algorithm with seed link probability equal to . As shown in Table 2, using the same amount of resources, the running time of the algorithm increases by at most a factor between the smallest and the largest graph.
|Network||Number of nodes||Relative running time|
Robustness to other models of the underlying graph: For our third question, we move away from synthetic graphs, and consider the snapshots of Facebook and the Enron email networks as our initial underlying networks. For Facebook, edges survive either with probability or , and we analyze performance of our algorithm with different seed link probabilities. For Enron, which is a much sparser network, we delete the edges with probability and analyze performance of our algorithm with seed link probability equal to . The main goal of these experiments is to show that our algorithm has good performance even outside the boundary of our theoretical results even when the underlying network is not generated by a random model.
In the first experiment with Facebook, when edges survive with probability , there are nodes with degree at least in both networks.444Note that we can only detect nodes which have at least degree 1 in both networks In the second, with edges surviving with probability , there are nodes with this property. In this case, the results are also very strong; see Table 3. Roughly of nodes have extremely low degree , and so our algorithm cannot obtain recall as high as in the previous setting. However, we identify a very large fraction of the roughly nodes with degree above , and the precision is still remarkably good; in all cases, the error is well under . Table 2 presents the full results for the harder case, with edge survival probability . With edge survival probability (not shown in the table), performance is even better: At threshold and the lowest seed link probability of , we correctly identify nodes and incorrectly identify , an error rate of well under . In the case of Enron, the original email network is very sparse, with an average degree of approximately 20; this means that each copy has average degree roughly 10, which is much sparser than real social networks. Of the 36,692 original nodes, only 21,624 exist in the intersection of the two copies; over 18,000 of these have degree , and the average degree is just over . Still, with matching threshold , we identify almost all the nodes of degree and above, and even in this very sparse graph, the error rate among newly identified nodes is .
|Pr||Threshold 5||Threshold 4||Threshold 2|
|Pr||Threshold 5||Threshold 4||Threshold 3|
Robustness to different deletion models: We now turn our attention to the fourth question: How much do our results depend on the process by which the two copies are generated? To answer this, we analyze a different model where we generate the two copies of the underlying graph using the Independent Cascade Model of . More specifically, we construct a graph starting from one seed node in the underlying social network and we add to the graph the neighbors of the node with probability . Subsequently, every time we add a node, we consider all its neighbors and add each of them independently with probability (note that we can try to add a node to the graph multiple times).
The results in this cascade model are extremely good; in fact, for both Facebook and Enron we have errors; as shown for Facebook in Figure 3, we are able to identify almost all the nodes in the intersection of the two graphs (even at seed link prob. , we identify ).
Robustness to correlated edge deletion: We now analyze one of the most challenging scenarios for our algorithm where, independently in the two realizations of the social network, we delete all or none of the edges in a community. For this purpose, we consider the Affiliation Networks model  as the underlying real network. In this model, a bipartite graph of users and interests is constructed using a preferential attachment-like process and then two users are connected in the network if and only if they share an interest (for the model details, refer to ). To generate the two copies in our experiment, we delete the interests independently in each copy with probability , and then we generate the graph using only the surviving interests. Note that in this setting, the same node in the two realizations can have very different neighbors. Still, our algorithm has very high precision and recall, as shown in Table 4.
|Pr||Threshold 4||Threshold 3||Threshold 2|
Real world scenarios: Now we move to the most challenging case, where the two graphs are no longer generated by a mathematical process that makes 2 imperfect copies of the same underlying network. For this purpose, we conduct two types of experiments. First, we use the DBLP and the Gowalla datasets in which each edge is annotated with a time, and construct 2 networks by taking edges in disjoint time intervals. Then we consider the French- and German-language Wikipedia link graph.
From the co-authorship graph of DBLP, the first network is generated by considering only the publications written in even years, and the second is generated by considering only the publications written in odd years. Gowalla is a social network where each user could also check-in to a location (each check-in has an associated timestamp). Using this information we generate two Gowalla graphs; in the first graph, we have an edge between nodes if they are friends and if and only if they check-in to approximately the same location in an odd month. In the second, we have an edge between nodes if they are friends and if and only if they check-in in approximately the same location in an even month.
Note that for both DBLP and Gowalla, the two constructed graphs have a different set of nodes and edges, with correlations different from the previous independent deletion models. Nevertheless we will see that the intersection is big enough to retrieve a good part of the networks.
In DBLP, there are nodes in the intersection of the two graphs, but the considerable majority of them have extremely low degree. Over K have degree less than in the intersection of the two graphs, and so again we cannot hope for extremely high recall. However, we do find considerably more nodes than in the input set. We start with a probability of seed links, resulting in seeds; however, note that most of these have extremely low degree, and hence are not very useful. As shown in table 5, we have nearly nodes identified, with an error rate of under . Note that we identify over half the nodes of degree at least , and a considerably larger fraction of those with higher degree. We include a plot showing precision and recall for nodes of various degrees (Figure 4).
For Gowalla, there are 38103 nodes in the intersection of the two graphs, of which over 32K have degree . We start with 3800 seeds, of which most are low-degree and hence not useful. We identify over 4000 of the (nearly 6000) nodes of degree above 5, with an error rate of . See Table 5 and Figure 4 for more details.
Finally for a still more challenging scenario, we consider a case where the 2 networks do not have any common source, but yet may have some similarity in their structure. In particular, we consider the case of the French- and German-language Wikipedia sites, which have 4.36M and 2.85M nodes respectively. Wikipedia also maintains a set of inter-language links, which connect corresponding articles in a pair of languages; for French and German, there are 531710 links, corresponding to only of the French articles. The relatively small number of links illustrates the extent of the difference between the French and German networks. Starting with of the inter-language links as seeds, we are able to nearly triple the number of links (including finding a number of new links not in the input inter-language set), with an error rate of in new links. However, some of these mistakes are due to human errors in Wikipedia’s inter-language links, while others mistake French articles to closely connected German ones; for instance, we link the French article for Lee Harvey Oswald (the assassin of President Kennedy) to the German article on the assassination.
|Pr||Threshold 5||Threshold 4||Threshold 2|
|Pr||Threshold 5||Threshold 4||Threshold 2|
|Pr||Threshold 5||Threshold 3|
Robustness to attack: We now turn our attention to a very challenging question: what is the performance of our algorithm when the network is under attack? In order to answer this question, we again consider the Facebook network as the underlying social network, and from it we generate two realizations with edge probability . Then, in order to simulate an attack, in each network for each node we create a malicious copy of it, , and for each node connected to in the network (that is, ), we add the edge independently with probability . Note that this is a very strong attack model (it assumes that users will accept a friend request from a ’fake’ friend with probability 0.5), and is designed to circumvent our matching algorithm. Nevertheless when we run our algorithm with seed link probability equal to , and with threshold equal to we notice that we are still able to align a very large fraction of the two networks with just a few errors ( correct matches and wrong matches, out of possible good matches).
Importance of degree bucketing, comparison with straightforward algorithm: We now consider our last question: How important is it to bucket nodes by degree? How big is the impact on the algorithm’s precision? How does our algorithm compare with a straightforward algorithm that just counts the number of common neighbors? To answer this question, we run a few experiments. First, we consider the Facebook graph with edge survival probability and seed link probability , and we repeat the experiments again without using the degree bucketing and with threshold equal 1. In this case we observe that the number of bad matching increases by a factor of without any significant change in the number of good matchings.
Then we consider other two interesting scenarios: How does this simple algorithm perform on Facebook under attack? And how does it perform on matching Wikipedia pages? Those two experiments show two weaknesses of this simple algorithm. More precisely, in the first case the simple algorithm obtains precision but its recall is very low. It is indeed able to reconstruct less than half of the number of matches found by our algorithm (22346 vs 46955). On the other hand, the second setting shows that the precision of this simple algorithm can be very low. Specifically, the error rate of the algorithm is , while our algorithm has error rate only . In this second setting (for Wikipedia) the recall is also very low, less than ; there are 71854 correct matches, of which most (53174) are seed links, and 7216 wrong matches.
In this paper, we present the first provably good algorithm for social
network reconciliation. We show that in well-studied models of social
networks, we can identify almost the entire network, with no
errors. Surprisingly, the perfect precision of our algorithm holds
even experimentally in synthetic networks. For the more realistic data
sets, we still identify a very large fraction of the nodes with very
low error rates. Interesting directions for future work include
extending our theoretical results to more network models and
validating the algorithm on different and more realistic data sets.
We would like to thank Jon Kleinberg for useful discussions and Zoltán Gyöngyi for suggesting the problem.
-  Dblp. http://dblp.uni-trier.de/xml/.
-  Wikipedia dumps. http://dumps.wikimedia.org/.
-  F. Abel, N. Henze, E. Herder, and D. Krause. Interweaving public user profiles on the web. In UMAP 2010, pages 16–27.
-  L. Backstrom, C. Dwork, and J. Kleinberg. Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography. In WWW 2007, pages 181–190.
-  A.-L. Barabási and R. Albert. Emergence of scaling in random networks. Science, 286(5439):509–512, 1999.
-  B. Bollobás and O. Riordan. The diameter of a scale-free random graph. Combinatorica, 24(1):5–34, 2004.
-  D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-mat: A recursive model for graph mining. In SDM 2004.
-  E. Cho, S. A. Myers, and J. Leskovec. Friendship and mobility: Friendship and mobility: User movement in location-based social networks. In KDD 2011, pages 1082–1090.
-  C. Cooper and A. Frieze. The cover time of the preferential attachment graph. Journal of Combinatorial Theory Series B, 97(2):269–290, 2007.
-  D. P. Dubhashi and A. Panconesi. Concentration of Measure for the Analysis of Randomized Algorithms. Cambridge University Press, 2009.
-  P. Erdős and A. Rényi. On Random Graphs I. Publ. Math. Debrecen, 6:290–297, 1959.
-  J. Goldenberg, B. Libai, and E. Muller. Talk of the network: Complex systems look at the underlying process of word-of-mouth. Marketing Letters 2001, pages 211–223.
-  M. Granovetter. The strength of weak ties: A network theory revisited. Sociological Theory, 1:201–233, 1983.
-  K. Henderson, B. Gallagher, L. Li, L. Akoglu, T. Eliassi-Rad, H. Tong, and C. Faloutsos. It’s who you know: graph mining using recursive structural features. In KDD 2011, pages 663–671.
-  J. Kleinberg. Navigation in a small world. Nature, 406(6798):845, 2000.
-  B. Klimmt and Y. Yang. Introducing the enron corpus. In CEAS conference 2004.
-  S. Labitzke, I. Taranu, and H. Hartenstein. What your friends tell others about you: Low cost linkability of social network profiles. In ACM Social Network Mining and Analysis 2011, pages 51–60.
-  S. Lattanzi. Algorithms and models for social networks. PhD thesis, Sapienza, 2011.
-  S. Lattanzi and D. Sivakumar. Affiliation networks. In STOC 2009, pages 427–434.
-  A. Malhotra, L. Totti, W. M. Jr., P. Kumaraguru, and V. Almeida. Studying user footprints in different online social networks. In CSOSN 2012, pages 1065–1070.
-  A. Műller and D. Stoyan. Comparison Methods for Stochastic Models and Risks. Wiley, 2002.
-  A. Mislove, B. Viswanath, P. K. Gummadi, and P. Druschel. You are who you know: inferring user profiles in online social networks. In WSDM 2010, pages 251–260.
-  A. Narayanan and V. Shmatikov. De-anonymizing social networks. In S&P (Oakland) 2009, pages 111–125.
-  J. Novak, P. Raghavan, and A. Tomkins. Anti-aliasing on the web. In WWW 2004, pages 30–39.
-  A. Nunes, P. Calado, and B. Martins. Resolving user identities over social networks through supervised learning and rich similarity features. In SAC 2012, pages 728–729.
-  P. Pedarsani and M. Grossglauser. On the privacy of anonymized networks. In KDD 2011, pages 1235–1243.
-  J. R. Rao and P. Rohatgi. Can pseudonymity really guarantee privacy? In USENIX 2000, pages 85–96.
-  M. Rowe and F. Ciravegna. Harnessing the social web: The science of identity disambiguation. In Web Science Conference 2010.
-  G. Schoenebeck. Potential networks, contagious communities, and understanding social network structure. In WWW 2013, pages 1123–1132.
-  B. Viswanath, A. Mislove, M. Cha, and K. P. Gummadi. On the evolution of user interaction in facebook. In WOSN 2009, pages 37–42.
-  L. Yartseva and M. Grossglauser. On the performance of percolation graph matching. In COSN 2013, pages 119–130.
-  H. Yu, M. Kaminsky, P. B. Gibbons, and A. D. Flaxman. Sybilguard: defending against sybil attacks via social networks. IEEE/ACM Trans. Netw. 16(3), pages 267–278.
-  R. Zafarani and H. Liu. Connecting corresponding identities across communities. In ICWSM 2009, pages 354–357.