Degree Growth in Directed Networks

Degree Growth Rates and Index Estimation in a Directed Preferential Attachment Model

Tiandong Wang and Sidney I. Resnick
Abstract.

Preferential attachment is widely used to model power-law behavior of degree distributions in both directed and undirected networks. In a directed preferential attachment model, despite the well-known marginal power-law degree distributions, not much investigation has been done on the joint behavior of the in- and out-degree growth. Also, statistical estimates of the marginal tail exponent of the power-law degree distribution often use the Hill estimator as one of the key summary statistics, even though no theoretical justification has been given. This paper focuses on convergence of the joint empirical measure for in- and out-degrees and proves the consistency of the Hill estimator. To do this, we first derive the asymptotic behavior of the joint degree sequences by embedding the in- and out-degrees of a fixed node into a pair of switched birth processes with immigration and then establish the convergence of the joint tail empirical measure. From these steps, the consistency of the Hill estimators is obtained.

This work was supported by Army MURI grant W911NF-12-1-0385 to Cornell University.

MSC Classes: 60G70, 60B10, 60G55, 60G57, 05C80, 62E20.
Keywords: Hill estimators, power laws, preferential attachment, birth processes with immigration.

1. Introduction.

The preferential attachment model generates a growing sequence of random graphs based on the assumption that popular nodes with large degrees attract more edges. Nodes and edges are added to the graph following probabilistic rules. Such mechanism provides a basis for studying the evolution of social networks, collaborator and citation networks, as well as recommender networks, and is applicable to both directed and undirected graphs. Mathematical formulations of the undirected preferential attachment model are available in [2, 7, 22], and those of the directed model can be found in [13, 3]. This paper only considers the directed model where at each stage, a new node is born and either it points to one of the existing nodes or one of the existing nodes attaches to the new node. Results on the degree growth in the undirected case are investigated in [1, 27].

Empirical studies on social network data often reveal that in- and out-degree distributions marginally follow power laws. Theoretically, this is also true for linear preferential attachment models, which makes preferential attachment appealing in network modeling; see [3, 12, 13] for references. Also, the empirical joint degree frequency converges to the probability mass function (pmf) of a pair of limit random variables that are jointly regularly varying (cf. [13, 26, 20, 19]). However, questions related to joint degree growth and index estimation still remain unresolved. In this paper, we focus on three main problems:

  1. For a fixed node in a linear preferential attachment graph, what is the joint behavior of in- and out-degree as the graph size grows?

  2. What are the convergence properties of the tail empirical joint measure of in- and out-degrees indexed by node?

  3. When estimating the marginal power-law indices of in- and out-degree, can we use the Hill estimator as a consistent estimator?

What is the justification for interest in Hill estimation of power-law indices for network data? Repositories of large network datasets such as KONECT (http://konect.uni-koblenz.de/, [14]) provide summary statistics for all the archived network datasets and among the summary statistics are estimates of degree indices computed with Hill estimators, despite the fact that evidence for Hill estimator consistency is scant for network data [27].

Another justification is robust parameter estimation methods in network models based on extreme value techniques. In [23], we couple the Hill estimation of marginal degree distribution tail indices with a minimum distance threshold selection method introduced in [4] and compare this method with the parametric estimation approaches used in [24]. The Hill estimation is more robust against modeling errors and data corruptions. Therefore, an affirmative answer to the third question helps justify all of these inference methodologies.

In the directed case, consistency of the two marginal Hill estimators results from resolving the first two questions, since in a similar vein to [27], we consider the Hill estimator as a functional of the marginal tail empirical measure. So convergence results of marginal tail empirical measures lead to the consistency of Hill estimators by a mapping argument.

To answer the first question about degree behavior of fixed nodes as graph size grows, we mimic in- and out-degree growth of a fixed node using pairs of switched birth processes with immigration (SBI processes). The SBI processes use Bernoulli switching between pairs of independent birth processes with immigration (BI processes). We embed the directed network growth model into a sequence of paired SBI processes. Whenever a new node is added to the network, a new pair of SBI processes is initiated. Using convergence results for BI processes (cf. [17, Chapter 5.11], [21, 27]), we give the joint limits of the in- and out-degrees of a fixed node as well as the joint maximal degree growth. Proving the convergence of the tail empirical joint measure in the second question requires showing concentration results for degree counts compared with expected degree counts. With embedding techniques, we prove the limit distribution of the empirical joint degree frequencies in a way that is different from the one used in [20], and then justify the concentration results.

Our paper is structured as follows. In the rest of this section, we review background on the tail empirical measure and Hill estimator. Section 2 sets up the linear preferential attachment model and formulates the power-law phenomena in network degree distributions. Section 3 summarizes facts about BI processes and introduces the SBI process, which is the foundation of the embedding technique. We analyze the joint in- and out-degree growth in Section 4 by embedding it into a sequence of paired SBI processes and derive convergence results of the in- and out-degrees for a fixed node. Results on the convergence of the joint empirical measure are given in Section 5 and the consistency of Hill estimators for both in- and out-degrees is proved in Section 6. Useful concentration results are collected in Section 7.

1.1. Background

Our approach to the Hill estimator considers it as a functional of the tail empirical measure so we start with necessary background and review standard results (cf. [18, Chapter 3.3.5 and 6.1.4]).

1.1.1. Non-standard regular variation.

Let be the set of Radon measures on . Then a random vector is non-standard regularly varying on if there exist scaling functions , such that as ,

(1.1)

where is called the limit or tail measure [19, 20], and “” denotes the vague convergence of measures in . The phrasing in (1.1) implies the marginal distributions have regularly varying tails.

1.1.2. Hill Estimator

For , define the measure on Borel subsets of by

Let be the set of non-negative Radon measures on . A point measure is an element of of the form

(1.2)

For iid and non-negative with common regularly varying distribution tail , , there exists a sequence satisfying , such that for any , ,

(1.3)

where the limit measure satisfies , .

Define the Hill estimator based on upper order statistics of as [10]

(1.4)

where are order statistics of . In the iid case there are many proofs of consistency [15, 16, 9, 6, 5]: For , we have

(1.5)

The treatment in [18, Theorem 4.2] approaches consistency by showing (1.5) follows from (1.3) and we follow this approach for the network context where the iid case is inapplicable.

1.1.3. Node degrees.

The next section constructs a directed preferential attachment model, and gives behavior of , the in- and out-degrees of node at the th stage of construction. These degrees when scaled by appropriate powers of (see (4.12)) have limits and Theorem 5.4 shows that the degree sequences have a joint tail empirical measure

(1.6)

that converges weakly to some limit measure in , where are appropriate power law scaling functions and is some intermediate sequence such that

It also follows from (1.6) that for some tail indices , , and intermediate sequence ,

(1.7)
(1.8)

This leads to consistency of the Hill estimator for and .

2. Preferential Attachment Models.

2.1. Model setup.

Consider , a growing sequence of preferential attachment graphs. The graph consists of nodes, denoted by , and directed edges; the set of edges of consisting of ordered pairs of nodes in is denoted by . The initial graph consists of one node, labeled node 1, with a self loop. Thus node 1 has in- and out-degrees both equal to 1. For , we obtain a new graph by appending a new node and a new directed edge to the existing graph according to probabilistic rules described below. For , are the in- and out-degree of node in . The direction of the new edge in is determined by flipping a 2-sided coin, which has probabilities and , such that given and two positive parameters (not necessarily equal):

  • If the coin comes up heads with probability , direct the new edge from the new node to the existing node with probability depending on the in-degree of in :

    (2.1)
  • If the coin comes up tails with probability , direct the new edge from an existing node to the new node , with probability depending on the out-degree of in :

    (2.2)

We refer the two scenarios as - and -schemes, respectively.

2.1.1. Model construction.

One way to formally construct the model which helps with proofs is by using independent exponential random variables (r.v.’s). Define derived parameters

(2.3)

and for , we will recursively define what corresponds to the in- and out-degree sequences as random elements of ,

(2.4)

with initialization

(2.5)

corresponding to assuming has a single node with a self loop. For , the recursive definition of uses the variables

(2.6)
(2.7)

and relies on competitions from exponential alarm clocks based on , a sequence of iid standard exponential r.v.’s. Assuming has been given, requires and the variables which are independent of (which can be checked recursively) and we define

Conditionally on , use the to create a competition between exponentially distributed alarm clocks. For and , define choice variables

So is the index of the minimum of indicating the winner of the competition. Also, for , define the Bernoulli random variable

and given , we have

(2.8)

This increments the -st pair by if and the -th pair by (0,1) if ; the first case corresponds to an increase of in-degree and the second case to an increase of out-degree. The recursion also assigns to pair either or depending on the case. This construction expresses as a function of and something independent, namely and therefore the process is an -valued Markov chain. Also, because of the initialization (2.5), a simple induction argument applied to (2.8) gives the sum of the components satisfies

(2.9)

Then using (2.3), (2.9) and standard calculations with exponential rv’s, we have for ,

P
(2.10)
and likewise
P
(2.11)

These probabilities agree with the attachment probabilities (2.1), (2.2) in - and -schemes, respectively.

2.2. Power-law tails.

Suppose is a random graph generated by the dynamics above after steps. Let be the number of nodes in with in-degree and out-degree , i.e.

(2.12)

then and are the number of nodes in with in-degree equal to and strictly greater than , respectively. A similar definition also applies to out-degrees: and .

It is shown in [3, Theorem 3.2] using concentration inequalities and martingale methods that for as ,

(2.13)

where is a probability mass function (pmf) and [26, 20, 19] show that is jointly regularly varying and so is the associated joint measure. The analytical form of is given in [3], but later in Section 5.1, we give another proof using Section 4’s embedding technique.

From [3, Theorem 3.1], the scaled marginal degree counts and , , also converge:

(2.14)
(2.15)
(2.16)

Both and are pmf’s and the asymptotic form follows from Stirling’s formula:

Let and be the complementary cdf’s and by Scheffé’s lemma as well as [22, Equation (8.4.6)], we have

(2.17)
(2.18)

so again by Stirling’s formula we get from (2.17) and (2.18) that

In other words, the marginal tail distributions of the asymptotic in- and out-degree sequences in a directed linear preferential attachment model are asymptotic to power laws with tail indices and , respectively.

3. Preliminaries: Switched Birth Immigration Processes.

In this section, we introduce a pair of switched birth immigration processes (SBI processes). This lays the foundation for Section 4, where we embed the in- and out-degree sequences of a fixed network node into a pair of SBI processes and derive the asymptotic limit of the degree growth.

3.1. Birth immigration processes.

We start with a brief review of the birth immigration process. A linear birth process with immigration (BI process), , having lifetime parameter and immigration parameter is a continuous time Markov process with state space and transition rate

When there is no immigration and the BI process becomes a pure birth process and in such cases, the process usually starts from 1.

For , the BI process starting from 0 can be constructed from a Poisson process and an independent family of iid linear birth processes [21]. Suppose that is the counting function of homogeneous Poisson points with rate and independent of this Poisson process we have independent copies of a linear birth process with parameter and for . The BI process is a shot noise process with and for ,

(3.1)

Theorem 3.1 modifies slightly the statement of [21, Theorem 5] summarizing the asymptotic behavior of the BI process. This is also reviewed in [27].

Theorem 3.1.

For as in (3.1), we have as ,

(3.2)

where are independent unit exponential random variables satisfying a.s. for each ,

The random variable in (3.2) is a.s. finite and has a Gamma density given by

Remark 3.2.

The form of in (3.2) and its Gamma density is justified in [21, 27]. For a BI process with , modifying the representation in (3.1) gives

Therefore, where has a Gamma density given by , .

3.2. Switched birth immigration processes.

A switched birth immigration (SBI) process uses a Bernoulli choice variable to choose randomly from two independent BI processes with the same linear transition rates with one starting from at and the other starting from . A pair of SBI processes takes two SBI processes which are linked through the same Bernoulli choice variable.

Process
0 1 1 0
Rate
Table 1. Ingredients for a pair of switched BI processes.

Suppose that is a Bernoulli switching random variable with

and , , , are four independent BI processes (also independent of ) with , and transition rates

See Table 1 for quick reminders. Then we construct a pair of SBI processes using five independent ingredients:

(3.3)

We then consider the convergence of the pair of SBI processes, , as . Write a Gamma random variable with density , and , as . Then from Theorem 3.1, Remark 3.2 and (3.3), we have with , , , being four independent Gamma random variables and , , , , as ,

(3.4)

Also, has joint density

(3.5)

4. Embedding Process.

In order to prove the weak convergence of the sequence of empirical measures in (1.6), we need to embed the in- and out-degree sequences into a process constructed from pairs of SBI processes, as specified in Section 3. The embedding idea is proposed in [1] and has been used in [27] to model two different undirected linear preferential attachment models.

4.1. Embedding.

Here we discuss how to embed the directed network growth model into a process constructed from an infinite sequence of SBI pairs.

4.1.1. Directed network model and SBI processes.

The building blocks of the embedding procedure is an infinite family of independent BI processes

defined on the same probability space and satisfying:

  1. , and for each .

  2. Any process labeled with an is a BI process with transition rates

    and any process labeled with an is a BI process with transition rates

    These hold for when and for .

On , define

and the -algebra so that is strong Markov with respect to Set and define the stopping time with respect to as

(4.1)

Then is the minimum of two independent exponential r.v.’s with means

From (2.3), we have

Let so that . Also, let be index of the -pair that jumps first at which in this case is . However, note that determines which one of and will jump at , and is independent of by the property of independent exponential r.v.’s (cf. [17, Exercise 4.45(a)]). In addition, we also have , that is, measurable with respect to .

Now use the independent quantities to define a pair of SBI processes as in (3.3). Let and

Define the -algebra

so that is strong Markov with respect to . Also, let