Extreme Value Estimation

Are Extreme Value Estimation Methods Useful for Network Data?

Phyllis Wan Phyllis Wan
Department of Statistics
1255 Amsterdam Avenue, MC 4690
Columbia University
New York, NY 10027
Tiandong Wang Tiandong Wang
School of Operations Research and Information Engineering
Cornell University
Ithaca, NY 14853
Richard A. Davis Prof. Richard A. Davis
Department of Statistics
1255 Amsterdam Avenue, MC 4690
Room 1004 SSW
Columbia University
New York, NY 10027
 and  Sidney I. Resnick Prof. Sidney I. Resnick
School of Operations Research and Information Engineering
Cornell University
Ithaca, NY 14853

Preferential attachment is an appealing edge generating mechanism for modeling social networks. It provides both an intuitive description of network growth and an explanation for the observed power laws in degree distributions. However, there are often limitations in fitting parametric network models to data due to the complex nature of real-world networks. In this paper, we consider a semi-parametric estimation approach by looking at only the nodes with large in- or out-degrees of the network. This method examines the tail behavior of both the marginal and joint degree distributions and is based on extreme value theory. We compare it with the existing parametric approaches and demonstrate how it can provide more robust estimates of parameters associated with the network when the data are corrupted or when the model is misspecified.

Key words and phrases:
power laws, multivariate heavy-tailed statistics, preferential attachment, regular variation, estimation
Research of the four authors was partially supported by Army MURI grant W911NF-12-1-0385 to Cornell University.

1. Introduction

Empirical studies [18] suggest that the distribution of in- and out-degrees of the nodes of many social networks have Pareto-like tails. The indices of these distributions control the likelihood of nodes with large degrees appearing in the data. Some social network models, such as preferential attachment, theoretically exhibit these heavy-tailed characteristics. This paper estimates heavy tail parameters using semi-parametric extreme value (EV) methods and compares such EV estimates with model-based likelihood methods. The EV estimates only rely on the upper tail of the degree distributions so one might expect these estimates to be robust against model error or data corruption.

Preferential attachment (PA) describes the growth of a network where edges and nodes are added over time based on probabilistic rules that assume existing nodes with large degrees attract more edges. This property is attractive for modeling social networks due to intuitive appeal and ability to produce power-law networks with degrees matched to data [9, 23, 16, 17, 3]. Elementary descriptions of the preferential attachment model can be found in [10] while more mathematical treatments are available in [9, 23, 1]. Also see [15] for a statistical survey of methods for network data and [11] for inference for an undirected model.

The linear preferential attachment model has received most attention. Marginal degree power laws were established in [16, 17, 3], while joint power-law behavior, also know n as joint regular variation, was proved in [22, 21, 26] for the directed linear PA model. Given observed network data, [25] proposed parametric inference procedures for the model in two data scenarios. For the case where the history of network growth is available, the MLE estimators of model parameters were derived and shown to be strongly consistent, asymptotically normal and efficient. For the case where only a snapshot of the network is available at a single time point, the estimators based on moment methods as well as an approximation to the likelihood were shown to be strongly consistent. The loss of efficiency relative to full MLE was surprisingly mild.

The drawback of these two methods is that they are model-based and sensitive to model error. To overcome this lack of robustness, this paper describes an EV inference method applied to a single snapshot of a network and where possible, compares the EV method to model-based MLE methods. The EV method is based on estimates of in- and out-degree tail indices, and , using a combination of the Hill estimator [13, 20] coupled with a minimum distance thereshold selection method [5]. We also describe estimation of model parameters using the joint tail distribution of in- and out-degrees relying on the asymptotic angular measure [20, page 173] density obtained after standardizing [20, page 203] the data.

If the data are generated by the linear PA model, the EV estimators can be applied to estimate the parameters of the model and compared with MLE estimates and not surprisingly, the EV estimates exhibit larger variance. However, if there is model error or data corruption, the EV estimates more than hold their own and we illustrate the comparison in two ways:

  • The data is corrupted; linear PA data ha ve edges randomly deleted or added. The EV approach reliably recovers the original preferential attachment parameters while parametric methods degrade considerably.

  • The data comes from a misspecified model, namely a directed edge superstar model [2] but is analyzed as if it comes from the linear PA model. The EV method gives good estimates for superstar model tail indices and outperforms MLE based on a misspecified linear PA model if the probability of attaching to the superstar is significant.

The rest of the paper is structured as follows. Section 2 formulates the power-law phenomena in network degree distributions along with joint dependency in the in- and out- degrees. We describe two network models which exhibit such heavy tail properties, the linear PA and the superstar linear PA models. The EV inference method for networks is described in Section 3 where we discuss its use for estimating the parameters of the linear PA model. Section 4 gives EV estimation results for simulated data from the linear PA model. Since the generating model is correctly specified, we use the previous parametric methods as benchmarks for comparison in Section 4.1. Section 4.2 analyzes network data generated from the linear PA model but corrupted by random edge addition or deletion. Pretending ignorance of the perturbation, we compare the performance of the extreme value method with the MLE and snapshot methods to recover the original model. In Section 4.3, we use our EV inference approach on data from the directed superstar model and attempt to to recover the tail properties of the degree distributions. A concluding Section 5 summarizes the discussion and reasons why EV methods have their place. Appendices give proofs and a fuller discussion of MLE and the snapshot method for linear PA models abstracted from [25].

2. Networks and Heavy-Tailed Degree Distributions

2.1. General discussion.

We begin with a general discussion of power laws and networks. Let denote a directed network, where is the set of nodes, is the set of edges, and is the number of edges. Let denote the number of nodes in and be the number of nodes with in-degree and out-degree . The marginal counts of nodes with in-degree and out-degree are given by

respectively. For many network data sets, log-log plots of the in- and out-degree distributions, i.e., plots of vs.  and vs. , appear to be linear and generative models of network growth seek to reflect this. Consider models such that the empirical degree frequency converges almost surely,


where is a bivariate probability mass function (pmf). The network exhibits power-law behavior if


for some positive constants . Let be a fictitious random vector with joint pmf , then

In the linear PA model, the joint distribution of satisfies non-standard regular variation. Let be the set of Borel measures on that are finite on sets bounded away from the origin. Then is non-standard regularly varying on means that as ,


where is called the limit or tail measure [19, 7, 14]. Using the power transformation with , the vector becomes standard regularly varying, i.e.,


where with . With this standardization, the transformed measure is directly estimable from data [20].

In the following we describe two classes of preferential attachment models that generate networks with power-law degree distributions.

2.2. The linear preferential attachment (linear PA) model.

The directed linear PA model [3, 17] constructs a growing sequence of directed random graphs ’s whose dynamics depend on five nonnegative parameters , and , where and . To avoid degenerate situations, assume that each of the numbers is strictly smaller than 1.

We start with an arbitrary initial finite directed graph with at least one node and edges. Given an existing graph , a new graph is obtained by adding a single edge to so that the graph contains edges for all . Let and denote the in- and out-degree of in , that is, the number of edges pointing into and out of , respectively. We allow three scenarios of edge creation, which are activated by flipping a 3-sided coin with probabilities and . More formally, let be an iid sequence of trinomial random variables with cells labelled and cell probabilities . Then the graph is obtained from as follows.

  • If (with probability ), append to a new node and an edge leading from to an existing node . Choose the existing node with probability depending on its in-degree in :

  • If (with probability ), add a directed edge to with and and the existing nodes are chosen independently from the nodes of with probabilities

  • If (with probability ), append to a new node and an edge leading from the existing node to the new node . Choose the existing node with probability


For convenience we call these scenarios the -, - and -schemes. Note that this construction allows for the possibility of multiple edges between two nodes and self loops. This linear preferential attachment model can be simulated efficiently using the method described in [25, Algorithm 1] and linked to http://www.orie.cornell.edu/orie/research/groups/multheavytail/software.cfm.

It is shown in [22, 21, 26] that the empirical degree distribution

and the marginals satisfy (2.2) and (2.3), where the tail indices are


Furthermore, the joint regular variation condition (2.5) is satisfied by the limit degree distribution and the limit measure [22] or its density [26] can be explicitly derived. We shall use this property for parameter estimation in Section 3.

2.3. The superstar linear Pa model.

The key feature of the superstar linear PA model that distinguishes it from the standard linear PA model is the existence of a superstar node, to which a large proportion of nodes attach. A new parameter represents the attachment probability. The -, - and -schemes of the linear PA model are still in action. However, for the - and -schemes, an outgoing edge will attach to the superstar node with probability , while with probability it will attach to a non-superstar node according to the original linear PA rules.

For simplicity, the network is initialized with two nodes where node is the superstar node. We assume at the first step, there is an edge pointing from so . Again each graph contains edges for all . Let

so that is the set of edges in that do not point to the superstar. Let and denote the number of nodes and edges in the non-superstar subgraph of , respectively.

The model is specified through the parameter set . Let be another iid sequence of Bernoulli random variables where

The Markovian graph evolution from to is modified from the linear PA model as follows.

  • If (with probability ), append to a new node and an edge leading from to an existing node .

    • If (with probability ), , the superstar node;

    • If (with probability ), is chosen according to the linear PA rule (2.6) applied to .

  • If (with probability ), add a directed edge to where

    • If (with probability ), and is chosen with probability (2.6) applied to ;

    • If (with probability ), are chosen with probability (2.7) applied to .

  • If (with probability ), append to a new node and an edge leading from the existing node to , where is chosen with probability (2.8) applied to .

If we use and to denote the number of non-superstar nodes that have in-degree and out-degree , respectively, then Theorem 2.1 shows that almost surely where the limits are deterministic constants that decay like power laws.

Theorem 2.1.

Let be the in- and out-degree counts of the non-superstar nodes of the superstar model. There exists constants and such that as ,


  1. As ,


    where is a positive constant and

  2. As ,


    where is a positive constant and


The proof of Theorem 2.1 is provided in Appendix B.

3. Estimation Using Extreme Value Theory

In this section, we consider network parameter estimation using extreme value theory. Given a graph at a fixed timestamp, the data available for estimates are the in- and out-degrees for each node denoted by , . Let be the empirical distribution of this data on . Then from (2.1), almost surely converges weakly to a limit distribution on which is the measure corresponding to the mass function . Let be the Dirac measure concentrating on and we have from (2.1),


3.1. Estimating tail indices; Hill estimation.

We review tail index estimation of ( is similar) using the Hill estimator [13, 20] applied to in-degree data , . From (2.2), the marginal of , called is regularly varying with index . From Karamata’s theorem can be expressed as a function of [8, page 69],


The Hill estimator of replaces with the marginal of the empirical distribution in (3.1) of in-degrees, called , and with in (3.2). Let be the decreasing order statistics of , . The resulting estimator is

With iid data, if we assume and , then the Hill estimator is consistent. Of course, our network data is not iid but Hill estimation still works in practice. Consistency for an undirected graph is proven in [27] but for directed graphs, this is an unresolved issue.

To select in practice, [5] proposed computing the Kolmogorov-Smirnov (KS) distance between the empirical distribution of the upper observations and the power-law distribution with index :

Then the optimal is the one that minimizes the KS distance

and the tail index is estimated by . This estimator performs well if the thresholded portion comes from a Pareto tail and also seems effective in a variety of non-iid scenarios. It is widely used by data repositories of large network datasets such as KONECT (http://konect.uni-koblenz.de/) [18] and is realized in the R-package poweRlaw [12].

We refer to the above procedure as the minimum distance method in estimating for network data. There are two issues when applying this method. First, the data is node-based and not collected from independent repeated sampling. Secondly, degree counts are discrete and do not exactly comply with the Pareto assumption made in the minimum distance method. Our analysis shows that even if we ignore these two issues, the tail estimates are still reasonably good.

3.2. Estimating dependency between in- and out-degrees

If the limiting random vector corresponding to in (2.1) is jointly regularly varying and satisfies (2.5), we may apply a polar coordinate transformation, for example, with the -norm,

where . Then, with respect to in (3.1), the conditional distribution of given converges weakly (see, for example, [20, p. 173]),

where is the angular measure and describes the asymptotic dependence of the standardized pair . Since for large , and for large , , it is plausible that for and large . Skeptics may check [20, p. 307] for a more precise argument and recall is the empirical measure defined in (3.1).

Based on observed degrees , how does this work in practice? First is replaced by estimated from Section 3.1. Then the distribution is estimated via the empirical distribution of the sample angles for which exceeds some large threshold . This is the POT (Peaks Over Threshold) methodology commonly employed in extreme value theory [6].

In the cases where the network model is known, may be specified in closed form. For the linear PA model, has a density that is an explicit function of the linear PA parameters [22]. After estimating and by the minimum distance method, the remaining parameters can then be estimated by an approximate likelihood method that we now explain.

3.3. EV estimation for the linear PA model

From (2.9),

so that the linear PA model may be parameterized by . To construct the EV estimates, begin by computing the minimum distance estimates of the in- and out-degree indices. The parameter , which represents the proportion of edges connected between existing nodes, is estimated by .

From (2.5), given converges weakly as to the distribution of a random variable [22, Section 4.1.2], whose pdf is given by ()


By replacing with their estimated values , , and and setting the density (3.3) can be viewed as a profile likelihood function (based on a single observation ) of the unknown parameter , which we denote by

Given the degrees , can be computed by maximizing the profile likelihood based on the observations for which for a large threshold . That is,


where is typically chosen as the -th largest ’s for a suitable . This estimation procedure is sometimes referred to as the “independence estimating equations” (IEEs) method [4, 24], in which the dependence between observations is ignored. This technique is often used when the joint distribution of the data is unknown or intractable. Finally, using the constraint, , we estimate by .

4. Estimation results

In this section, we demonstrate the estimation of the linear PA and related models through the EV method described in Section 3.3. In Section 4.1, data are simulated from the standard linear PA model and used to estimate the true parameters of the underlying model. Section 4.2 considers data generated from the linear PA model but corrupted by random addition or deletion of edges. Our goal is to estimate the parameters of the original linear PA model. In Section 4.3, we simulate data from the superstar linear PA model and attempt to use the standard linear PA estimation to recover the degree distributions.

Throughout the section, the EV method is compared with two parametric estimation approaches for the linear PA model, namely the MLE and snapshot (SN) methods, proposed in [25]. For a given network, when the network history is available, that is, each edge is marked with the timestamp of its creation, MLE estimates are directly computable. In the case where only a snapshot of the network is given at a single point in time (i.e., the timestamp information for the creation of the edges is unavailable), we have an estimation procedure combining elements of method of moments with an approximation to the likelihood. A brief summary of the MLE and SN methods is in Appendix A and desirable properties of these estimators are in [25].

Note that a main difference between the MLE, SN and EV methods lies in the amount of data utilized. The MLE approach requires the entire growth history of the network while the SN method uses only a single snapshot of the network. The EV method, on the other hand, requires only a subset of a snapshot of the network; only those degree counts of nodes with large in- or out-degrees. When the underlying model is true, MLE is certainly the most efficient, but also hinges on having a complete data set. As we shall see, in the case where the model is misspecified, the EV method provides an attractive and reliable alternative.

4.1. Estimation for the linear PA model

4.1.1. Comparison of EV with MLE and SN

Figure 4.1. Boxplots of biases for estimates of using EV, MLE and SN methods. Panels (a)–(c) correspond to the case where and (d)–(f) are for , holding constant.

Figure 4.1 presents biases for estimates of using EV, MLE, and SN methods on data simulated from the linear PA model.

We held constant and varied so that the true values of were also varying. For each set of parameter values , 200 independent replications of a linear PA network with edges were simulated and the true values of were computed by (2.9). We estimated by the minimum distance method , MLE and the one-snapshot methods applied to the parametric model (cf. Section A), denoted by and , respectively. With , is calculated by (3.4) using .

As seen here, for simulated data from a known model, MLE outperforms other estimation procedures. The EV procedure tends to have much larger variance than both MLE and SN with slightly more bias. This is not surprising as the performance of the EV estimators is dependent on the quality of the following approximations:

  1. The number of edges in the network, , should be sufficiently large to ensure a close approximation of to the limit joint pmf .

  2. The choice of thresholds must guarantee the quality of the EV estimates for the indices and the limiting angular distribution. The thresholding means estimates are based on only a small fraction of the data and hence have large uncertainty.

  3. The parameter used to transform the in- and out-degrees to standard regular variation is estimated and thus subject to estimation error which propagates throughout the remaining estimation procedures.

4.1.2. Sensitivity analysis.

We explore how sensitive EV estimates are to choice of , the threshold for the approximation to the limiting angular density in (3.4). Equivalently, we consider varying , the number of tail observations included in the estimation.

For the sensitivity analysis, 50 linear PA networks with edges and parameter set

or equivalently,

are generated. We use to calculate the EV estimates for . The performances of across different value s of are demonstrated by the blue boxplots in Figure 4.2(a).

Figure 4.2. (a) Boxplots of biases of and for different and over 50 replications, where . (b) Linearly interpolated trajectories of biases of and from 10 randomly picked realizations.

We see that the biases of remain small until increases to 300, and for larger values of , considerably underestimates .

We note that the angular components , are also power-lawed. As an attempt to select the optimal value of , we apply the minimum distance method to the ’s and use the selected threshold, , as the truncation threshold. The boxplot of for the 50 simulated networks are represented by the horizontal boxplot in Figure 4.2(a). The EV estimator with respect to this threshold for each simulation, denoted by , is shown by the red boxplot and plotted at , the mean of . Overall, varies between 300 and 1500 and results in an underestimated .

In Figure 4.2(b), we randomly choose 10 realizations (among the 50 replications) and plot the linearly interpolated trajectories of , based on different values of . Black points are the estimation results using fixed thresholds and red ones are determined by using the minimum distance method. Black and red points denoted by the same symbol belong to the same realization. Comparison among estimation results for different values of reveals that choosing a fixed threshold outperforms selecting a using the minimum distance method, as it produces estimates with smaller biases and variances.

4.2. Data corrupted by random edge addition/deletion.

PA models are designed to describe human interaction in social networks but what if data collected from a network is corrupted or usual behavior is changed? Corruption could be due to collection error and atypical behavior could result from users hiding their network presence or trolls acting as provocateurs. In such circumstances, the task is to unmask data corruption or atypical behavior and recover the parameters associated with the original preferential attachment rules.

In the following, we consider network data that are generated from the linear PA model but corrupted by random addition or deletion of edges. For such corrupted data, we attempt to recover the original model and compare the performances of MLE, SN, and EV methods.

4.2.1. Randomly adding edges.

We consider a network generating algorithm with linear PA rules but also a possibility of adding random edges. Let denote the graph at time . We assume that the edge set can be decomposed into two disjoint subsets: , where is the set of edges resulting from PA rules, and is the set of those resulting from random attachments. This can be viewed as an interpolation of the PA network and the Erdös-Rényi random graph.

More specifically, consider the following network growth. Given , is formed by creating a new edge where:

  1. With probability , two nodes are chosen randomly (allowing repetition) from and an edge is created connecting them. The possibility of a self loop is allowed.

  2. With probability , a new edge is created according to the preferential attachment scheme on .

The question of interest is, if we are unaware of the perturbation effect and pretend the data from this model are coming from the linear PA model, can we recover the PA parameters? To investigate, we generate networks of edges with parameter values

For each network, the original PA model is fitted using the MLE, SN and EV methods, respectively. The angular MLE in (3.4) in the extreme value estimation is performed based on tail observations. In order to compare these estimators, we repeat the experiment 200 times for each value of and obtain 200 sets of estimated parameters for each method. Figure 4.3 summarizes the estimated values for for different values of . The mean estimates are marked by crosses and the and empirical quantiles are marked by the bars. The true value of parameters are shown as the horizontal lines.

While all parameters deviate from the true value as increases and the network becomes more “noisy”, the EV estimates for exhibit smaller bias than the MLE and SN methods (Figure 4.3 (a) and (b)). All three methods give underestimated probabilities (Figure 4.3 (c) and (d)). This is because the perturbation step (1) creates more edges between existing nodes and consequently inflates the estimated value of .

Also note that the mean EV estimates of stay close to the theoretical values for all choices of ; see Figure 4.3 (e) and (f). The MLE and SN estimates of , which are computed from the corresponding estimates for , show strong bias as increases. In this case, the EV method is robust for estimating the PA parameters and recovering the tail indices from the original model.

Figure 4.3. Mean estimates and and empirical quantiles of (a) ; (b) ; (c) ; (d) ; (e) ; (f) , using MLE (black), SN (red) and EV (blue) methods over 200 replications, where and . For the EV method, 500 tail observations were used to obtain .

4.2.2. Randomly deleting edges.

We now consider the scenario where a network is generated from the linear PA model, but a random proportion of edges are deleted at the final time. We do this by generating and then deleting edges by sampling without replacement. For the simulation, we generated networks with parameter values

Again, for each value of , the experiment is repeated 200 times and the resulting parameter plots are shown in Figure 4.4 using the same format as for Figure 4.3. For the EV method, 100 tail observations were used to compute an .

Surprisingly, for all six parameters considered, MLE estimates stay almost unchanged for different values of while SN and EV estimates underestimate and overestimate , with increasing magnitudes of biases as increases. For tail estimates, the minimum distance method still gives reasonable results (though with larger variances), whereas the SN method keeps underestimating and .

The performance of MLE in this case is surprisingly competitive. This is intriguing and in ongoing work, we will think about why this is the case.

Figure 4.4. Mean estimates and and empirical quantiles of (a) ; (b) ; (c) ; (d) ; (e) ; (f) , using MLE (black), SN (red) and EV (blue) methods over 50 replications, where and . For the EV method, 100 tail observations were used to compute .

4.3. Superstar model.

In this section, we consider network data generated from the superstar model. We compare the accuracy of tail index estimates under parametric methods applied to the linear PA model with extreme value estimates applied directly to data.

Networks are simulated from the superstar model with the following parameter values:

The MLE estimates of the tail indices based on (2.9), , are compared to the EV estimates calculated directly from the node degree data, . According to Theorem 2.1, the theoretical marginal tail indices for and , , based on a superstar PA model are given by (2.11), (2.13). This experiment is repeated 50 times and Table 1 records the mean estimates for over these 50 replications.

(2.43, 2.29) (2.11, 2.31) (2.24 2.25)
(2.51, 2.29) (2.03, 2.33) (2.28 2.20)
(2.61, 2.29) (1.97, 2.34) (2.35 2.18)
(2.71, 2.29) (1.91, 2.36) (2.43 2.18)
(2.84, 2.29) (1.86, 2.38) (2.51 2.15)
Table 1. Mean estimates for using both MLE and minimum distance methods, with .

As increases and the influence of the superstar node becomes more profound, the MLE method does not give an accurate estimate of tail indices, while the EV method stays more robust. However, when becomes too large, the in-degrees of non-superstar nodes will be greatly restricted, which increases the finite sample bias in the EV estimates.

Note that the theoretical indices in Table 1 are for the in- and out-degrees of the non-superstar nodes. In the EV methods, the inclusion of the superstar node can severely bias the estimation of . Let be some intermediate sequence such that and as and use to denote the upper order statistics of . Then the corresponding Hill estimator is


From the construction of the superstar model, we know that the superstar node likely has the largest in-degree, which is approximately equal to for large . Hence, the first term in (4.1) goes to 0, as long as