SUBGRAPHS IN PREFERENTIAL ATTACHMENT MODELS

Subgraphs in Preferential Attachment Models

abstract

We consider subgraph counts in general preferential attachment models with power-law degree exponent . For all subgraphs , we find the scaling of the expected number of subgraphs as a power of the number of vertices. We prove our results on the expected number of subgraphs by defining an optimization problem that finds the optimal subgraph structure in terms of the indices of the vertices that together span it and by using the representation of the preferential attachment model as a Pólya urn model.

\AtEveryBibitem\clearfield

url\clearfieldurldate\clearfielddoi\clearfieldeprint\clearfieldISSN

1 Introduction

The degree distribution of many real-world networks can be approximated by a power-law distribution [faloutsos1999, vazquez2002], where for most networks the degree exponent was found to be between 2 and 3, so that the degree distribution has infinite variance. Another important property of networks are subgraph counts, also referred to as motif counts. In many real-world networks, several subgraphs were found to appear more frequently than other subgraphs [milo2002]. Which type of subgraph appears most frequently varies for different networks, and the most frequently occurring subgraphs are believed to be correlated with the function of the network [milo2004, milo2002, wuchty2003]. The triangle is the most studied subgraph, allowing to compute the clustering coefficient of the network, which expresses the fraction of connected neighbors of a vertex.

To investigate which subgraphs occur more frequently than expected in a given network, the subgraph count in a given network is usually compared to the subgraph count in a random graph null model [gao2017, maugis2017, milo2004, onnela2005]. Several random graph models could potentially serve as null models. In practice, the null model is frequently obtained by randomly switching edges while preserving the degrees. This model however, is not mathematically tractable for , so that it requires simulations to estimate the subgraph count in such networks [marcus2012, wernicke2006].

Several other null models for simple, scale-free networks exist, such as the configuration model [bollobas1980], the rank-1 inhomogeneous random graph [boguna2003, chung2002], the preferential attachment model [albert1999] or hyperbolic random graphs [krioukov2010]. When the degree-exponent satisfies , the configuration model results in a network with many multiple edges and self-loops [hofstad2009, Chapter 7], so that it is not a null model for simple networks anymore. A possible solution is to merge all multiple edges of the configuration model, and consider the erased configuration model instead [britton2006]. This model is mathematically tractable, and subgraph counts for this model were derived in [hofstad2017d].

In this paper, we analyze subgraph counts for a different random graph null model, the preferential attachment model. The preferential attachment model was first introduced by Albert and Barabási [albert1999, albert2002]. In their original work, they described a growing random graph model where a new vertex appears at each time step. The new vertex connects with a fixed number of existing vertices chosen with probability proportional to the degrees. This original Barabási-Albert model has been generalized over the last years, generating the broad class of random graphs called preferential attachment models (PAMs).

The original Barabasi-Albert model is known to produce a power-law degree distribution with [albert1999]. Often, a modification is considered, where edges are attached to vertices with probability proportional to the degree plus a constant . The constant allows to obtain different values for the power-law exponent . For , we retrieve the original Albert-Barabási model.

In the present paper we focus on the case where is fixed, and our results hold for any value of . Taking results in , as observed in many real-world networks. An important difference between the preferential attachment model and most other random graph null models is that edges can be interpreted as directed. Thus, it allows us to study directed subgraphs. This is a major advantage of the PAM over other random graph null models, since most real-world network subgraphs in for example biological networks are directed as well [milo2004, shen-orr2002].

1.1 Literature on subgraphs in PAMs

We now briefly summarize existing results on specific subgraph counts in preferential attachment models. The triangle is the most studied subgraph, allowing to investigate clustering in the preferential attachment model. Bollobás and Riordan [bollobas2003] prove that for any integer-valued function there exists a PAM with triangles, where denotes the number of vertices in PAM. They further show that the clustering coefficient in the Albert-Barabási model is of order , while the expected number of triangles is of order and more generally, the expected number of cycles of length scales as .

Eggmann and Noble [eggmann2011] consider , so that and investigate the number of subgraphs for (so subtrees), and for they study the number of triangles and the clustering coefficient. They observe that the expected number of triangles is of order while the clustering coefficient is of order , which is different than the results in [bollobas2003]. Our result on general subgraphs for any value of in Theorem 2.2 explains this difference (in particular, we refer to (2.1)).

In a series of papers [prokhorenkova2013, prokhorenkova2016a, prokhorenkova2016] Prokhorenkova et al. proved results on the clustering coefficient and the number of triangles for a broad class of PAMs, assuming general properties on the attachment probabilities. These attachment probabilities are in a form that increases the probability of creating a triangle. They prove that in this setting the number of triangles is of order , while the clustering coefficients behaves differently depending on the exact attachment probabilities.

1.2 Our contribution

For every directed subgraph, we obtain the scaling of the expected number of such subgraphs in the PAM, generalizing the above results on triangles, cycles and subtrees. Furthermore, we identify the most likely degrees of vertices participating in such subgraphs, which shows that subgraphs in the PAM are typically formed between vertices with degrees of a specific order of magnitude. The order of magnitude of these degrees can be found using an optimization problem. For general subgraphs, our results provide the scaling of the expected number of such subgraphs in the network size . For the triangle subgraph, we obtain precise asymptotic results on the subgraph count, which allows to study clustering in the PAM.

We use the interpretation of the PAM as a Pólya urn graph. This interpretation allows to view the edges as being present independently, so that we are able to obtain the probability that a subgraph is present on a specific set of vertices.

1.3 Organization of the paper

We first describe the specific PAM we study in Section 1.4. After that, we present our result on the scaling on the number of subgraphs in the PAM and the exact asymptotics on the number of triangles in Section 2. Section 3 then provides an important ingredient for the proof of the scaling of the expected number of subgraphs: a lemma that describes the probability that a specific subgraph is present on a subset of vertices. After that, we prove our main results in Section 4 and Section 5. Finally, Section 7 gives the conclusions and the discussion of our results.

1.4 Model

As mentioned in Section 1, different versions of PAMs exist. Here we define the specific PAM we consider, which is a modification of [berger2014, Model 3]:

Definition 1.1 (Sequential PAM).

Fix , . Then is a sequence of random graphs defined as follows:

  • for , consists of a single vertex with no edges;

  • for , consists of two vertices with edges between them;

  • for , is constructed recursively as follows: conditioning on the graph at time , we add a vertex to the graph, with new edges. Edges start from vertex and, for , they are attached sequentially to vertices chosen with the following probability:

    (1.1)

In (1.1), denotes the degree of in , while denotes the degree of vertex after the first edges of vertex have been attached. Here we assume that .

To keep notation light, we write instead of throughout the rest of the paper. The first term in the denominator of (1.1) describes the total degree of the first vertices in when vertices are present and edges have been attached. The term in the denominator comes from the fact that there are vertices to which an edge can attach. Note that we do not allow for self-loops, but we do allow for multiple edges.

The PAM of Definition 1.1 generates a random graph where the asymptotic degree sequence is close to a power law [hofstad2018+, Lemma 4.7], where the degree exponent satisfies

(1.2)

Labeled subgraphs.

As mentioned before, the PAM in Definition 1.1 is a multigraph, i.e., any pair of vertices may be connected by different edges. One could erase multiple edges in order to obtain a simple graph, similarly to [britton2006] for the configuration model. In the PAM in Definition 1.1 there are at most edges between any pair of vertices, so that the effect of erasing multiple edges is small, unlike the in configuration model. We do not erase edges, so that we may count a subgraph on the same set of vertices multiple times. Not erasing edges has the advantage that we do not modify the law of the graph, therefore we can directly use known results on PAM.

Figure 1: Two labeled triangles.

More precisely, to count the number of subgraphs, we analyze labeled subgraphs, i.e., subgraphs where the edges are specified. In Figure 1 we give the example of two labeled triangles on three vertices , one consisting of edges and the other one of edges . As it turns out, the probability of two labeled subgraphs being defined by the same vertices and different edges is independent of the choice of the edges. For a more precise explanation, we refer to Section 3.1.

Notation.

We use for convergence in probability. We say that a sequence of events happens with high probability (w.h.p.) if . Furthermore, we write if , and if is uniformly bounded, where is nonnegative. We say that for a sequence of random variables if is a tight sequence of random variables, and if . We further use the notation .

2 Main results

In this section, we present our results on the number of directed subgraphs in the preferential attachment model. We first define subgraphs in more detail. Let be a connected, directed graph. Let be a one-to-one mapping of the vertices of to . In the PAM, vertices arrive one by one. We let correspond to the order in which the vertices in have appeared in the PAM, that is if vertex was created before vertex . Thus, the pair is a directed graph, together with a prescription of the order in which the vertices of have arrived. We call the pair an ordered subgraph.

In the PAM, it is only possible for an older vertex to connect to a newer vertex but not the other way around. This puts constraints on the types of subgraphs that can be formed. We call the ordered subgraphs that can be formed in the PAM attainable. The following definition describes all attainable subgraphs:

Definition 2.1 (Attainable subgraphs).

Let be an ordered subgraph with adjacency matrix , where the rows and columns of the adjacency matrix are permuted by . We say that is attainable if defines a directed acyclic graph, where all out-degrees are less or equal than .

We now investigate how many of these subgraphs are typically present in the PAM. We introduce the optimization problem

(2.1)

where and denote respectively the in- and the out-degree in the subgraph . Let denote the number of times the connected graph with ordering occurs as a subgraph of a PAM of size . The following theorem studies the scaling of the expected number of directed subgraphs in the PAM, and relates it to the optimization problem (2.1):

Theorem 2.2.

Let be a directed subgraph on vertices with ordering such that is attainable and there are different optimizers to (2.1). Then, there exist such that

(2.2)

Theorem 2.2 gives the asymptotic scaling of the number of subgraphs where the order in which the vertices appeared in the PAM is known. The total number of copies of for any ordering, , can then easily be obtained from Theorem 2.2:

Corollary 2.3.

Let be a directed subgraph on vertices with the set of orderings such that is attainable. Let

(2.3)

and let be the largest number of different optimizers to (2.1) among all that maximize (2.3). Then, there exist such that

(2.4)

Note that from Corollary 2.3 it is also possible to obtain the undirected number of subgraphs in a PAM, by summing the number of all possible directed subgraphs that create some undirected subgraph when the directions of the edges are removed.

(a)

(b)

(c)

(d)
Figure 2: Order of magnitude of for all attainable connected directed graphs on 3 vertices and for . Vertices with degree proportional to a constant are light pink, vertices with free degrees are bright red, and vertices of degree proportional to are dark red.

Interpretation of the optimization problem.

The optimization problem in  2.1 has an intuitive explanation. Assume that is the identity mapping, so that vertex 1 is the oldest vertex of , vertex 2 the second oldest and so on. We show in Section 3.2 that the probability that an attainable subgraph is present on vertices with indices scales as

(2.5)

with as in (2.1). Thus, if for all , for some , then the probability that the subgraph is present scales as . The number of vertices with index proportional to scales as . Therefore, heuristically, the number of times subgraph occurs on vertices with indices proportional to such that scales as

(2.6)

Because the exponent is linear in , the exponent is maximized for for all . Because of the extra constraint which arises from the ordering of the vertices in the PAM, the maximal value of the exponent is . This suggests that the number of subgraphs scales as .

Thus, the optimization problem finds the most likely configuration of a subgraph in terms of the indices of the vertices involved. If the optimum is unique, the number of subgraphs is maximized by subgraphs occurring on one set of very specific vertex indices. For example, when the maximum contribution is , this means that vertices with constant index, the oldest vertices of the PAM are most likely to be a member of subgraph at position . When is the optimal contribution, vertices with index proportional to , the newest vertices, are most likely to be a member of subgraph at position . When the optimum is not unique, several maximizers contribute equally to the number of subgraphs, which introduces the extra logarithmic factors in (2.2).

(a) depends on

(b) depends on

(c) depends on

(d) depends on

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l) depends on

(m) depends on

(n)

(o)

(p)

(q)

(r)

(s)

(t)

(u)

(v)

(w)
Figure 3: Order of magnitude of for all attainable connected directed graphs on 4 vertices and for . Vertices with degree proportional to a constant are light pink, vertices with free degrees are bright red, and vertices of degree proportional to are dark red. Vertices where the optimizer depends on are gray.

Most likely degrees.

As mentioned above, the optimization problem (2.1) finds the most likely orders of magnitude of the indices of the vertices. When the optimum is unique, the optimum is attained by some vertices of constant index, and some vertices with index proportional to . The vertices of constant index have degrees proportional to with high probability [hofstad2009], whereas the vertices with index proportional to have degrees proportional to a constant. When the optimum is not unique, the indices of the vertices may have any range, so that the degrees of these vertices in the optimal subgraph structures have degrees ranging between 1 and . Thus, the optimization problem (2.1) also finds the optimal subgraph structure in terms of its degrees. The most likely degrees of all directed connected subgraphs on 3 and 4 vertices resulting from Corollary 2.3 and the asymptotic number of such subgraphs for are visualized in Figures 2 and 3. For some subgraphs, the optimum of (2.1) is attained by the same and therefore the same most likely degrees for all , while for other subgraphs the optimum may change with .

One such example is the complete graph of size 4. For the directed complete graph, there is only one attainable ordering satisfying Definition 2.1, so we take the vertices of to be labeled with this ordering. For , the optimizer of (2.1) is given by with optimal value , whereas for it is given by and optimal value -4. Thus, for a complete graph of size four typically contains three hub vertices of degree proportional to and one vertex of constant degree, and the number of such subgraphs scales as whereas for the optimal structure contains four hub vertices instead and the number of such subgraphs scales as a constant.

Fluctuations of the number of subgraphs.

In Theorem 2.2 we investigate the expected number of subgraphs, which explains the average number of subgraphs over many PAM realizations. Another interesting question is what the distribution of the number of subgraphs in a PAM realization behaves like. In this paper, we mainly focus on the expected value of the number of subgraphs, but here we argue that the limiting distribution of the rescaled number of subgraphs may be quite different for different subgraphs.

In Section 3.2 we show that by viewing the PAM as a Pólya urn graph, we can associate a sequence of random independent random variables to the vertices of the PAM , where has a Beta distribution with parameters depending on , and . Once we condition on , the edge statuses of the graph are independent of each other. Furthermore, the degree of a vertex depends on the index and . The higher is, the higher is. Thus, we can interpret as a hidden weight associated to the vertex .

Using this representation of the PAM we can view the PAM as a random graph model with two sources of randomness: the randomness of the -variables, and then the randomness of the independent edge statuses determined by the -variables. Therefore, we can define two levels of concentration for the number of ordered subgraphs . Denote by . Furthermore, let denote the number of ordered subgraphs conditionally on . Then, the ordered subgraph can be in the following three classes of subgraphs:

  • Concentrated: is concentrated around its conditional expectation , i.e., as ,

    (2.7)

    and as ,

    (2.8)
  • Only conditionally concentrated: condition (2.7) holds, and as

    (2.9)

    for some random variable .

  • Non-concentrated: condition (2.7) does not hold.

For example, it is easy to see that the number of subgraphs as shown in Figure 1(d) satisfies , so that it is a subgraph that belongs to the class of concentrated subgraphs. Below we argue that the triangle belongs to the class of only conditionally concentrated subgraphs. We now give a criterion for the conditional convergence of (2.7) in the following proposition:

Proposition 2.4 (Criterion for conditional convergence).

Consider a subgraph such that as . Denote by the set of all possible subgraphs composed by two distinct copies of with at least one edge in common. Then, as ,

(2.10)

Proposition 2.4 gives a simple criterion for conditional convergence for a subgraph , and it is proved in Section 6. The condition in (2.10) is simple to evaluate in practice. We denote the subgraphs consisting of two overlapping copies of sharing at least one edge by . To identify the order of magnitude of , we apply Corollary 2.3 to or, in other words, we apply Theorem 2.2 to all possible orderings of . Once we have all orders of magnitude of for all orderings , and for all , it is immediate to see if hypothesis of Proposition 2.4 is satisfied.

There are subgraphs where the condition in Proposition 2.4 does not hold. For example, merging two copies of the subgraph of Figure 2(q) as in Figure 4 violates the condition in Proposition 2.4. We show in Section 6 that this subgraph is in the class of non-concentrated subgraphs with probability close to one.

Figure 4: The order of magnitude of this subgraph containing two merged copies of the subgraph of Figure 2(q) is , so that the condition in Proposition 2.4 is not satisfied for the subgraph in Figure 2(q).

2.1 Exact constants: triangles

Theorem 2.2 allows to identify the order of magnitude of the expected number of subgraphs in PAM. In particular, for a subgraph with ordering , it assures the existence of two constants as in (2.2). A more detailed analysis is necessary to prove a stronger result than Theorem 2.2 of the type

for some constant . In other words, given an ordered subgraph , we want to identify the constant such that

(2.11)

We prove (2.11) for triangles to show the difficulties in the evaluation of the precise constant for general subgraphs. The following theorem provides the detailed scaling of the expected number of triangles:

Theorem 2.5 (Phase transition for the number of triangles).

Let and be parameters for . Denote the number of labeled triangles in by . Then, as ,

  1. if , then

  2. if , then

  3. if , then

Theorem  2.5 in the case coincides with  [bollobas2003, Theorem 14]. For we retrieve the result in  [eggmann2011, Proposition 4.3], noticing that the additive constant in the attachment probabilities in the Móri model considered in  [eggmann2011] coincides with (1.1) for .

The proof of Theorem 2.5 in Section 5 shows that to identify the constant in  (2.11) we need to evaluate the precise expectations involving the attachment probabilities of edges. The equivalent formulation of PAM given in Section 3.1 simplifies the calculations, but it is still necessary to evaluate rather complicated expectations involving products of several terms as in (3.10). For a more detailed discussion, we refer to Remark 5.1.

The distribution of the number of triangles.

Theorem 2.5 shows the behavior of the expected number of triangles. The distribution of the number of triangles across various PAM realizations is another object of interest. We prove the following result for the number of triangles :

Corollary 2.6 (Conditional concentration of triangles).

For , the number of triangles is conditionally concentrated in the sense of (2.7).

Corollary 2.6 is a direct consequence of Proposition 2.4, and the atlas of the order of magnitudes of all possible realizations of the subgraphs consisting of two triangles sharing one or two edges, presented in Figure 5. Figure 6 shows a density approximation of the number of triangles obtained by simulations. These figures suggest that the rescaled number of triangles converges to a random limit, since the width of the density plots does not decrease in . Thus, while the number of triangles concentrates conditionally, it does not seem to converge to a constant when taking the random -variables into account. This would put the triangle subgraph in the class of only conditionally concentrated subgraphs. Proving this and identifying the limiting random variable of the number of triangles is an interesting open question.

(a) :

(b) : constant

(c) :

(d) : constant

(e) :

(f) : constant

(g)

(h)

(i)

(j) : constant

(k) :

(l) : constant

(m) :

(n)
Figure 5: Order of magnitude of for all merged triangles on 4 vertices and for . Vertices with degree proportional to a constant are light pink, vertices with free degrees are bright red, and vertices of degree proportional to are dark red. Vertices where the optimizer depends on are gray.
(a)
(b)
(c)
Figure 6: Density approximation of the number of triangles in realizations of the preferential attachment model with and various values of .

3 The probability of a subgraph being present

In this section, we prove the main ingredient for the proof of Theorem 2.2, the probability of a subgraph being present on a given set of vertices. The most difficult part of evaluating the probability of a subgraph being present in is that the PAM is constructed recursively. We consider triangles as an example. We write the event of a labeled triangle being present by , where denotes the event that the -th edge of vertex is attached to vertex . Notice that in this way we express precisely which edges we consider in the triangle construction. Then,

(3.1)

In (3.1), the indicator function and are not independent, therefore evaluating the expectation on the right-hand side of  (3.1) is not easy. A possible solution for the evaluation of the expectation in  (3.1) is to rescale with an appropriate constant to obtain a martingale, and then recursively use the conditional expectation. For a detailed explanation of this, we refer to  [Bol01, Szym05] and [hofstad2009, Section 8.3]. This method is hardly tractable due to the complexity of the constants appearing (see Remark  5.1 for a more detailed explanation).

We use a different approach to evaluate of the expectation in  (3.1) using the interpretation of the PAM as a Pólya urn graph, focusing mainly on the the age (the indices) of the vertices, and not on precise constants. We give a lower and upper bound of the probability of having a finite number of edges present in the graph, as formulated in the following lemma:

Lemma 3.1 (Probability of finite set of labeled edges).

Fix . For vertices and and edge labels , consider the corresponding finite set of distinct labeled edges . Assume that the subgraph defined by set is attainable in the sense of Definition  2.1. Define . Then:

  1. There exist two constants such that

    (3.2)
  2. Define the set

    (3.3)

    Then, there exist two constants such that

    (3.4)

Formula (3.2) in the above lemma bounds the probability that a subgraph is present on vertices and such that the -th edge from connects to . Notice that (3.2) is independent of the precise edge labels . To be able to count all subgraphs, and not only subgraphs where the edge labels have been specified, (3.4) bounds the expected number of times a specific subgraph is present on vertices and . This number is given exactly by the elements in set as in (3.3). Note that the expectation in (3.4) may be larger than one, due to the fact that the PAM is a multigraph.

Lemma 3.1 gives a bound on the probability of presence of distinct edges in the graph as function of the indices of the endpoints of the edges. Due to the properties of PAM, the index of a vertex is an indicator of its degree, due to the old-get-richer effect. Lemma 3.1 is a stronger result than [DSvdH, Corollary 2.3], which gives an upper bound of the form in  (3.2) only for self-avoiding paths.

The proof of Lemma 3.1 is based on the interpretation of the PAM in Definition 1.1 as a urn experiment as proposed in [berger2014]. We now introduce urn schemes and state the preliminary results we need for the proof of Lemma  3.1, which is given in Section 3.2.

3.1 Pólya urn graph

An urn scheme consists of an urn, with blue balls and red balls. At every time step, we draw a ball from the urn and we replace it by two balls of the same color. We start with blue balls and red balls. We consider two weight functions

(3.5)

Conditionally on the number of blue balls and red balls , at time the probability of drawing a blue ball is equal to

The evolution of the number of balls obeys [hofstad2018+, Theorem 4.2]

(3.6)

where has a Beta distribution with parameters and . In other words, the number of blue balls (equivalently, of red balls) is given by a Binomial distribution with a random probability of success (equivalently, ). Sometimes we call the random variable the intensity or strength of the blue balls in the urn. We can also see the urn process as two different urns, one containing only blue balls and the other only red balls, and we choose a urn proportionally to the number of balls in the urns. In this case, the result is the same, but we can say that is the strength of the blue balls urn and is the strength of the red balls urn.

The sequential model can be interpreted as experiment with urns, where the number of balls in each urn represent the degree of a vertex in the graph. First, we introduce a random graph model:

Definition 3.2 (Pólya urn graph).

Fix and . Let be the size of the graph. Let , and consider independent random variables, where

(3.7)

Define

(3.8)

Conditioning on , let be independent random variables, with uniformly distributed on . Then, the corresponding Pólya urn graph is the graph of size where, for , the number of edges between and is equal to the number of variables in , for (multiple edges are allowed).

The two sequences of graphs and have the same distribution [berger2014, Theorem 2.1], [hofstad2018+, Chapter 4]. The Beta distributions in Definition 3.2 come from the Pólya urn interpretation of the sequential model, using urns with affine weight functions.

The formulation in Definition 3.2 in terms of urn experiments allows us to investigate the presence of subgraphs in an easier way than with the formulation given in Definition 1.1 since the dependent random variables in (3.1), are replaced by the product of independent random variables. We now state two lemmas that are the main ingredients for proving Lemma 3.1:

Lemma 3.3 (Attachment probabilities).

Consider as in Definition 3.2. Then,

  1. for ,

    (3.9)
  2. conditioning on , the probability that the -th edge of is attached to is equal to

    (3.10)

The proof of Lemma 3.3 follows from Definition 3.2, and the fact that as in (3.8) can be written as in (3.9) (see the proof of [berger2014, Theorem 2.1]).

Before proving Lemma 3.1, we state a second result on the concentration of the positions in the urn graph . In particular, it shows that these positions concentrate around deterministic values:

Lemma 3.4 (Position concentration in ).

Consider a Pólya urn graph as in Definition 3.2. Let . Then, for every there exists such that, for every ,

(3.11)

and, for large enough,

(3.12)

As a consequence, as ,

(3.13)

The proof of Lemma 3.4 is given in [berger2014, Lemma 3.1].

3.2 Proof of Lemma 3.1

We now prove Lemma 3.1, starting with the proof of (3.2). Fix . In the proof, we denote simply by to keep notation light. We use the fact that the Pólya urn graph and have the same distribution and evaluate . We consider distinct labeled edges, so we can use (