Models of on-line social networks

Models of on-line social networks

Anthony Bonato Department of Mathematics
Ryerson University
Toronto, ON
Canada, M5B 2K3
abonato@ryerson.ca
Noor Hadi Department of Mathematics
Wilfrid Laurier University
Waterloo, NS
Canada, N2L 3C5
hadi4130@wlu.ca
Paul Horn Department of Mathematics and Computer Science
Emory University
Atlanta, GA
U.S.A., 30322
phorn@mathcs.emory.edu
Paweł Prałat Department of Mathematics and Statistics
Dalhousie University
Halifax NS
Canada, B3H 3J5
pralat@mathstat.dal.ca
 and  Changping Wang Department of Mathematics
Ryerson University
Toronto, ON
Canada, M5B 2K3
cpwang@ryerson.ca
Abstract.

We present a deterministic model for on-line social networks (OSNs) based on transitivity and local knowledge in social interactions. In the Iterated Local Transitivity (ILT) model, at each time-step and for every existing node , a new node appears which joins to the closed neighbour set of The ILT model provably satisfies a number of both local and global properties that were observed in OSNs and other real-world complex networks, such as a densification power law, decreasing average distance, and higher clustering than in random graphs with the same average degree. Experimental studies of social networks demonstrate poor expansion properties as a consequence of the existence of communities with low number of inter-community edges. Bounds on the spectral gap for both the adjacency and normalized Laplacian matrices are proved for graphs arising from the ILT model, indicating such bad expansion properties. The cop and domination number are shown to remain the same as the graph from the initial time-step , and the automorphism group of is a subgroup of the automorphism group of graphs generated at all later time-steps. A randomized version of the ILT model is presented, which exhibits a tuneable densification power law exponent, and maintains several properties of the deterministic model.

Key words and phrases:
complex networks, on-line social networks, transitivity, densification power law, average distance, clustering coefficient, spectral gap, bad expansion, normalized Laplacian
1991 Mathematics Subject Classification:
05C75, 68R10, 91D30
The authors gratefully acknowledge support from NSERC and MITACS

1. Introduction

On-line social networks (OSNs) such as Facebook, MySpace, Twitter, and Flickr have become increasingly popular in recent years. In OSNs, nodes represent people on-line, and edges correspond to a friendship relation between them. In these complex real-world networks with sometimes millions of nodes and edges, new nodes and edges dynamically appear over time. Parallel with their popularity among the general public is an increasing interest in the mathematical and general scientific community on the properties of on-line social networks, in both gathering data and statistics about the networks, and finding models simulating their evolution. Data about social interactions in on-line networks is more readily accessible and measurable than in off-line social networks, which suggests a need for rigorous models capturing their evolutionary properties.

The small world property of social networks, introduced by Watts and Strogatz [37], is a central notion in the study of complex networks, and has roots in the work of Milgram [31] on short paths of friends connecting strangers. The small world property posits low average distance (or diameter) and high clustering, and has been observed in a wide variety of complex networks.

An increasing number of studies have focused on the small world and other complex network properties in OSNs. Adamic et al. [1] provided an early study of an on-line social network at Stanford University, and found that the network has the small world property. Correlation between friendship and geographic location was found by Liben-Nowell et al. [30] using data from LiveJournal. Kumar et al. [27] studied the evolution of the on-line networks Flickr and Yahoo!360. They found (among other things) that the average distance between users actually decreases over time, and that these networks exhibit power-law degree distributions. Golder et al. [23] analyzed the Facebook network by studying the messaging pattern between friends with a sample of 4.2 million users. They also found a power law degree distribution and the small world property. Similar results were found in [2] which studied Cyworld, MySpace, and Orkut, and in [32] which examined data collected from four on-line social networks: Flickr, YouTube, LiveJournal, and Orkut. Power laws for both the in- and out-degree distributions, low diameter, and high clustering coefficient were reported in the Twitter friendship graph by Java et al. [24]. In [25], geographic growth patterns and distinct classes of users were investigated in Twitter. For further background on complex networks and their models, see the books [6, 9, 12, 15].

Recent work by Leskovec et al. [28] underscores the importance of two additional properties of complex networks above and beyond more traditionally studied phenomena such as the small world property. A graph with edges and nodes satisfies a densification power law if there is a constant such that is proportional to In particular, the average degree grows to infinity with the order of the network (in contrast to say the preferential attachment model, which generates graphs with constant average degree). In [28], densification power laws were reported in several real-world networks such as a physics citation graph and the internet graph at the level of autonomous systems. Another striking property found in such networks (and also in on-line social networks; see [27]) is that distances in the networks (measured by either diameter or average distance) decreases with time. The usual models such as preferential attachment or copying models have logarithmically or sublogarithmically growing diameters and average distances with time. Various models (such as the Forest Fire [28] and Kronecker multiplication [29] models) were proposed simulating power law degree distribution, densification power laws, and decreasing distances.

We present a new model, called Iterated Local Transitivity (ILT), for OSNs and other complex networks which dynamically simulates many of their properties. The present article is the full version of the proceedings paper [8]. Although modelling has been done extensively for other complex networks such as the web graph (see [6]), models of OSNs have only recently been introduced (such as those in [14, 27, 30]). The central idea behind the ILT model is what sociologists call transitivity: if is a friend of , and is a friend of then is a friend of (see, for example, [18, 36, 38]). In its simplest form, transitivity gives rise to the notion of cloning, where is joined to all of the neighbours of . In the ILT model, given some initial graph as a starting point, nodes are repeatedly added over time which clone each node, so that the new nodes form an independent set. The ILT model not only incorporates transitivity, but uses only local knowledge in its evolution, in that a new node only joins to neighbours of an existing node. Local knowledge is an important feature of social and complex networks, where nodes have only limited influence on the network topology. We stress that our approach is mathematical rather than empirical; indeed, the ILT model (apart from its potential use by computer and social scientists as a simplified model for OSNs) should be of theoretical interest in its own right.

Variants of cloning were considered earlier in duplication models for protein-protein interactions [4, 5, 11, 35], and in copying models for the web graph [7, 26]. There are several differences between the duplication and copying models and the ILT model. For one, duplication models are difficult to analyze due to their rich dependence structure. While the ILT model displays a dependency structure, determinism makes it more amenable to analysis. The ILT model may be viewed as simplified snapshot of the duplication model, where all nodes are cloned in a given time-step, rather than duplicating nodes one-by-one over time. Cloning all nodes at each time-step as in the ILT model leads to densification and high clustering, along with bad expansion properties (as we describe in Subsection 1.2).

We finish the introduction with some asymptotic notation. Let and be functions whose domain is some fixed subset of . We write if

exists and is finite. We will abuse notation and write . We write if , and if and If then (or ). So if then tends to

1.1. The ILT Model

We now give a precise formulation of the model. The ILT model generates finite, simple, undirected graphs Time-step , for , is defined to be the transition between and (Note that a directed graph model will be considered in the sequel. See also Section 3.) The only parameter of the model is the initial graph which is any fixed finite connected graph. Assume that for a fixed the graph has been constructed. To form for each node add its clone such that is joined to and all of its neighbours at time Note that the set of new nodes at time form an independent set of cardinality See Figure 1 for the graphs generated from the -cycle over the time-steps and

Figure 1. The evolution of the ILT model with for

We write for the degree of a node at time for the order of and for its number of edges. It is straightforward to see that . Given a node at time let be its clone. The elementary but important recurrences governing the degrees of nodes are given as

(1.1)
(1.2)

1.2. Main Results

We state our main results on the ILT model, with proofs deferred to the next section. We give rigorous proofs that the ILT model generates graphs satisfying a densification power law and in many cases decreasing average distance (properties shared by the Forest Fire [28] and Kronecker multiplication [29] models). A randomized version of the ILT model is introduced with tuneable densification power law exponent. Properties of the ILT model not shown in the models of [28, 29] are higher clustering than in random graphs with the same average degree, and smaller spectral gaps for both their normalized Laplacian and adjacency matrices than in random graphs. Further, the cop and domination numbers are shown to remain the same as the graph from the initial graph , and the automorphism group of is a subgroup of the automorphism group of graphs generated at all later times. The ILT model does not, however (unlike the models of [28, 29]) generate graphs with a power law degree distribution. The number of nodes in the ILT model grows exponentially with time (as in the Kronecker multiplication model, but unlike in the Forest Fire model).

We first demonstrate that the model exhibits a densification power law. Define the volume of by

Theorem 1.1.

For the average degree of equals

Note that Theorem 1.1 supplies a densification power law with exponent We think that the densification power law makes the ILT model realistic, especially in light of real-world data mined from complex networks (see [28]).

We study the average distances and clustering coefficient of the model as time tends to infinity. Define the Wiener index of as

The Wiener index may be used to define the average distance of as

We will compute the average distance by deriving first the Wiener index. Define the ultimate average distance of , as

assuming the limit exists. Note that the ultimate average distance is a new graph parameter. We provide an exact value for and compute the ultimate average distance for any initial graph

Theorem 1.2.


  1. For

  2. For

  3. For all graphs

    Further, if and only if

Note that the average distance of is bounded above by (in fact, by in all cases except cliques). Further, the condition in (3) for holds for large cycles and paths. Hence, for many initial graphs the average distance decreases, a property observed in OSNs and other complex networks (see [27, 28]).

Let be the neighbour set of at time , let be the subgraph induced by in and let be the number of edges in For a node with degree at least define

By convention if the degree of is at most The clustering coefficient of is

The clustering coefficient of the graph at time generated by the ILT model is estimated and shown to tend to slower than a random graph with the same average degree.

Theorem 1.3.

Observe that tends to as If we let (so then this gives that

In contrast, for a random graph with comparable average degree

as , the clustering coefficient is which tends to zero much faster than (For a discussion of the clustering coefficient of , see Chapter 2 of [6].)

Social networks often organize into separate clusters in which the intra-cluster links are significantly higher than the number of inter-cluster links. In particular, social networks contain communities (characteristic of social organization), where tightly knit groups correspond to the clusters [21]. As a result, social networks possess bad expansion properties realized by small gaps between their first and second eigenvalues [17]. We find that the ILT model has bad expansion properties as indicated by the spectral gap of both its normalized Laplacian and adjacency matrices.

For regular graphs, the eigenvalues of the adjacency matrix are related to several important graph properties, such as in the expander mixing lemma. The normalized Laplacian of a graph, introduced by Chung [10], relates to important graph properties even in the case where the underlying graph is not regular (as is the case in the ILT model). Let denote the adjacency matrix and denote the diagonal adjacency matrix of a graph . Then the normalized Laplacian of is

Let denote the eigenvalues of . The spectral gap of the normalized Laplacian is

Chung, Lu, and Vu [13] observe that, for random power law graphs with some parameters (effectively in the case that for some constant and all integers ), that where is the average degree.

For the graphs generated by the ILT model, we observe that the spectra behaves quite differently and, in fact, the spectral gap has a constant order. The following theorem suggests a significant spectral difference between graphs generated by the ILT model and random graphs. Define to be the spectral gap of the normalized Laplacian of

Theorem 1.4.

For , .

Theorem 1.4 represents a drastic departure from the good expansion found in random graphs, where [10, 13, 19], and from the preferential attachment model [22]. If has bad expansion properties, and has (and thus, ) then, in fact, this trend of bad expansion continues as shown by the following theorem.

Theorem 1.5.

Suppose has at least two nodes, and for let be the second eigenvalue of Then we have that

Note that Theorem 1.5 implies that and this implies that the sequence is strictly decreasing. This follows since is constructed from in the same manner as is constructed from . If is , then there is no second eigenvalue, but is Hence, in this case, the theorem implies that is strictly decreasing.

Let denote the eigenvalues of the adjacency matrix If is the adjacency matrix of then the adjacency matrix of is

where is the identity matrix of order . We note the following recurrence for the eigenvalues of the adjacency matrix of .

Theorem 1.6.

If is an eigenvalue of the adjacency matrix of , then

are eigenvalues of the adjacency matrix of .

We leave the reader to check that the eigenvectors of can be written in terms of the eigenvectors of . As in the Laplacian case, we show that there is a small spectral gap of the adjacency matrix.

Theorem 1.7.

Let denote the eigenvalues of the adjacency matrix of . Then

That is, for some constant Theorem 1.7 is in contrast to the fact that in random graphs, (see [10]).

In a graph , a set of nodes is a dominating set if every node not in has a neighbour in . The domination number of , written , is the minimum cardinality of a dominating set in . We use to represent a dominating set in , where each node not in is joined to some node of . A graph parameter bounded below by the domination number is the so-called cop (or search) number of a graph. In Cops and Robbers, there are two players, a set of cops (or searchers) , where is a fixed integer, and the robber The cops begin the game by occupying a set of nodes of a simple, undirected, and finite graph . While the game may be played on a disconnected graph, without loss of generality, assume that is connected (since the game is played independently on each component and the number of cops required is the sum over all components). The cops and robber move in rounds indexed by non-negative integers. Each round consists of a cop’s move followed by a robber’s move. More than one cop is allowed to occupy a node, and the players may pass; that is, remain on their current nodes. A move in a given round for a cop or the robber consists of a pass or moving to an adjacent node; each cop may move or pass in a round. The players know each others current locations; that is, the game is played with perfect information. The cops win and the game ends if at least one of the cops can eventually occupy the same node as the robber; otherwise, wins. As placing a cop on each node guarantees that the cops win, we may define the cop number, written which is the minimum cardinality of the set of cops needed to win on While this node pursuit game played with one cop was introduced in [33, 34], the cop number was first introduced in [3]. For a survey of results on Cops and Robbers, see [20].

We prove that the domination and cop numbers of depend only on the initial graph . Theorem 1.8 shows that even as the graph becomes large as progresses, the same number of nodes needed at time to dominate the graph will be needed at time .

Theorem 1.8.

For all ,

and

In Theorem 1.8, we prove that the cop number remains the same for . This implies that no matter how large the graph becomes, the robber can be captured by the same number of cops used at time . In terms of OSNs, Theorem 1.8 suggests that users in the network can easily spread and track information (such as gossip) no matter how large the graph becomes.

An automorphism of a graph is an isomorphism from to itself; the set of all automorphisms forms a group under the operation of composition, written . We say that an automorphism extends to if

that is, the restriction of the map to equals We show that symmetries from are preserved at time . This provides further evidence that the ILT model retains a memory of the initial graph from time

Theorem 1.9.

For all , embeds in .

As shown in Theorem 1.1, the ILT model has a fixed densification exponent equalling . We consider a randomized version of the model which allows for this exponent to become tuneable. To motivate the model, in OSNs some new users are friends outside of the OSN. Such users immediately seek each other out as they join the OSN and become friends there. The stochastic model ILT() is defined as follows. Define to be A sequence of graphs is generated so that for all , is an induced subgraph of At time first clone all the nodes of as in the deterministic ILT model. Let be the number new nodes are added at time (Note that is a function of and is not a new parameter.) To form , add edges independently between the new nodes with probability Hence, the new nodes form a random graph .

Several properties of the ILT model are inherited by the ILT() model. For example, as we are adding edges to the graphs generated by the ILT model, the average distance may only decrease, and the clustering coefficient may only increase. The following theorem proves that ILT() generates graphs following a densification power law with exponent where . For a positive integer representing time, we say that an event holds asymptotically almost surely (a.a.s.) if the probability that it holds tends to as tends to infinity.

Theorem 1.10.

Let , and define

(1.3)

Then a.a.s.

Hence, by choosing an appropriate , the densification power law exponent in graphs generated by the ILT() model may achieve any value in the interval . We also prove that for the normalized Laplacian, the ILT() model maintains a large spectral gap.

Theorem 1.11.

A.a.s.

2. Proofs of Results

This section is devoted to the proofs of the theorems outlined in Section 1.

2.1. Proof of Theorem 1.1

We now consider the number of edges and average degree of and prove the following densification power law for the ILT model. Define the volume of by

The proof of Theorem 1.1 follows directly from the following Lemma 2.1, since the average degree of is

Lemma 2.1.

For

In particular,

Proof.

By (1.1) and (1.2) we have that

(2.1)

Hence by (2.1) for

where the third equality follows by summing a geometric series. ∎

2.2. Proof of Theorem 1.2

When computing distances in the ILT model, the following lemma is helpful.

Lemma 2.2.

Let and be nodes in with Then

and

Proof.

We prove that . The proofs of the other equalities are analogous and so omitted. Since in the ILT model we do not delete any edges, the distance cannot increase after a “cloning” step occurs. Hence, Now suppose for a contradiction that there is a path connecting and in with length Hence, contains nodes not in . Choose such a with the least number of nodes, say , not in . Let be a node of not in , and let the neighbours of in be and Then is joined to and Form the path by replacing by . But then has length and has many nodes not in , which supplies a contradiction. ∎

We now turn to the proof of Theorem 1.2. We only prove item (1), noting that items (2) and (3) follow from (1) by computation. We derive a recurrence for as follows. To compute there are five cases to consider: distances within and distances of the forms: and The first three cases contribute by Lemma 2.2. The 4th case contributes The final case contributes (the term comes from the fact that each edge contributes

Thus,

Hence,

Diameters are constant in the ILT model. We record this as a strong indication of the (ultra) small world property in the model.

Lemma 2.3.

For all graphs different than a clique,

and when is a clique.

Proof.

This follows directly from Lemma 2.2. ∎

2.3. Proof of Theorem 1.3

We introduce the following dependency structure that will help us classify the degrees of nodes. Given a node we define its descendant tree at time , written , to be a rooted binary tree with root , and whose leaves are all of the nodes at time . To define the th row of , let be a node in the th row ( corresponds to a node in ). Then has exactly two descendants on row : itself and In this way, we may identify the nodes of with a length binary sequence corresponding to the descendants of using the convention that a clone is labelled We refer to such a sequence as the binary sequence for at time We need the following technical lemma.

Lemma 2.4.

Let be the nodes of with exactly many ’s in their binary sequence at time Then for all

Proof.

The degree is minimized when is identified with the binary sequence beginning with many ’s: In this case,

The degree is maximized when the sequence with the many ’s at the end of the sequence: Then

It can be shown (using Lemma 2.4) that the number of nodes of degree at least at time denoted by satisfies

Indeed, when a vertex is identified with the binary sequence with many ’s, then the degree is at least . We have such sequences. On the other hand, if the binary sequence has many ’s, then the corresponding vertex has degree smaller than . In particular, for and therefore, the degree distribution of does not follow a power law. Since nodes have degree around the degree distribution has “binomial-type” behaviour. As an example of the degree distribution of a graph generated by the ILT model, see Figure 2.

Figure 2. A log-log plot of the degree distribution for with

We now prove the following lemma. Recall that is the number of edges in

Lemma 2.5.

For all with ’s in their binary sequence, we have that

We note that the constants hidden in and notations (both in the statement of the lemma and in the proof below) do not depend on nor

Proof of Lemma 2.5. For we have that

For we have that

Since there are many ’s and is always positive for all initial graphs , and the lower bound follows.

For the upper bound, a general binary sequence corresponding to is of the form

with the ’s in positions (). Consider a path in the descendant tree from the root of the tree to node . By Lemma 2.4, the node on the path in the th row () has (at time ) degree .

Hence, the number of edges we estimate is until the th row, increases to in the next row, and increases to in the th row. By induction, we have that

We now prove our result on clustering coefficients.

Proof of Theorem 1.3. For with many ’s in its binary sequence, by Lemmas 2.4 and 2.5 we have that

and

Hence, since we have nodes with many ’s in its binary sequence,

In a similar fashion, it follows that

2.4. Proofs of Theorems 1.4, 1.5, 1.6, and 1.7

We present proofs of the spectral properties of the ILT model. For ease of notation, let

Proof of Theorem 1.4. We use the expander mixing lemma for the normalized Laplacian (see [10]). For sets of nodes and we use the notation for the volume of the subgraph induced by and for the number of edges with one end in each of and

Lemma 2.6.

For all sets

We observe that contains an independent set (that is, a set of nodes with no edges) with volume . Let denote this set, that is, the new nodes added at time . Then by (2.1) it follows that

Since is independent, Lemma 2.6 implies that

Proof of Theorem 1.5. Before we proceed with the proof of Theorem 1.5, we begin by stating some notation and a lemma. For a given node , we let denote the node in that is a descendant of. Given , we define

and for , we set

We use the following lemma, for which the proof of items (1) and (2) follow from Lemma 2.1. The final item contains a standard form of the Raleigh quotient characterization of the second eigenvalue; see [10].

Lemma 2.7.


  1. For ,

  2. For

  3. Define

    Then

    (2.2)

Note that in item (3), is a function of Now let be the harmonic eigenvector for so that

and

Furthermore, we choose scaled so that . This is the standard version of the Raleigh quotient for the normalized Laplacian from [10], so such a exists so long as has at least two eigenvalues, which it does by our assumption that . Our strategy in proving the theorem is to show that lifting to provides an effective bound on the second eigenvalue of using the form of the Raleigh quotient given in (2.2).

Define by Then note that

By Lemma 2.7 (1) and (2) it follows that

By Lemma 2.1 and proceeding as above, noting that , we have that

where is the average degree of , and the last inequality follows from the Cauchy-Schwarz inequality.

By we have that