Non-Uniform Distribution of Nodes in the Spatial Preferential Attachment Model
The spatial preferential attachment (SPA) is a model for complex networks. In the SPA model, nodes are embedded in a metric space, and each node has a sphere of influence whose size increases if the node gains an in-link, and otherwise decreases with time. In this paper, we study the behaviour of the SPA model when the distribution of the nodes is non-uniform. Specifically, the space is divided into dense and sparse regions, where it is assumed that the dense regions correspond to coherent communities. We prove precise theoretical results regarding the degree of a node, the number of common neighbours, and the average out-degree in a region. Moreover, we show how these theoretically derived results about the graph properties of the model can be used to formulate a reliable estimator for the distance between certain pairs of nodes, and to estimate the density of the region containing a given node.
Keywords— Spatial Random Graphs, Spatial Preferrential Attachment Model, Preferential Attachment, Complex Networks, Web Graph, Co-citation, Common Neighbours
There has been a great deal of recent interest in modelling complex networks, a result of the increasing connectedness of our world. The hyperlinked structure of the Web, citation patterns, friendship relationships, infectious disease spread, these are seemingly disparate linked data sets which have fundamentally very similar natures.
Many models of complex networks have a common weakness: the ‘uniformity’ of the nodes; other than link structure there is no way to distinguish the nodes. One family of models which overcomes this deficiency is the family of spatial (or geometric) models, wherein the nodes are embedded in a metric space. A node’s position—especially in relation to the others—has real-world meaning: the character of the node is encoded in its location. Similar nodes are closer in the space than dissimilar nodes. This metric space has many potential meanings: in communication networks, perhaps physical distance; in a friendship graph, an interest space; in the World Wide Web, a topic space. As an illustration, a node representing a webpage on pet food would be closer in the metric space to one on general pet care than to one on travel.
The Spatial Preferrential Attachment Model , designed as a model for the World Wide Web, is one such spatial model. Indeed, as its name suggests, the SPA Model combines geometry and preferential attachment. Setting the SPA Model apart is the incorporation of ‘spheres of influence’ to accomplish preferential attachment: the greater the degree of the node, the larger its sphere of influence, and hence the higher the likelihood of the node gaining more neighbours. The SPA model produces scale-free networks, which exhibit many of the characteristics of real-life networks (see [1, 4]). In , it was shown that the SPA model gave the best fit, in terms of graph structure, for a series of social networks derived from Facebook.
As the motivation behind spatial models is the ‘second layer of meaning’—the character of the nodes as represented by their positions in the metric space—we hope to uncover this layer through examination of the link structure. In particular, estimating the distance between nodes in the metric space forms the basis for two important link mining tasks: finding entities that are similar—represented by nodes that are close together in the metric space—and finding communities—represented by spatial clusters of nodes in the metric space. We show how a theoretical analysis of a spatial model can lead to reliable tools to extract the ‘second layer of meaning’.
The majority of the spatial models to this point have used uniform random distribution of nodes in the space. However, considering the real-world networks these models represent, this concept does not capture the following essential aspect of real-life data. Indeed, on a basic level, if the metric space represents actual physical space, and the nodes people, then we note that people cluster in cities and towns, rather than being uniformly spread across the land. More abstractly, there are more webpages on a popular topic, corresponding to a small area of our metric space, than for a more obscure topic. The development of spatial network models naturally then begins to incorporate varying densities of node distribution: both ‘clumps’ of higher/lower density, as well as gradually changing densities, are both possibilities. Of the more important goals is that of community recognition: the discovery and quantification of characteristically (semantically) similar nodes.
In this work we generalize the SPA model to an inhomogeneous distribution of nodes within the space. We assume distinct regions of different densities, where the dense regions are the ‘clusters’. We find that the local regions behave almost as if generated by independent SPA models of parameters derived from the densities. Many earlier results from the SPA Model then translate easily to this inhomogeneous version and we begin the process of uncovering the geometry using link analysis.
In the remainder of this section, we first review related work, and then we give a formal definition of the SPA model. In Section 2 we state our main theoretical results. In particular, we give the typical behaviour of the in-degree of a node, and use this to derive a relationship between spatial distance and number of common neighbours of a pair of nodes. The proofs of the theorems are given in Section 4.
In Section 3 we verify the asymptotic results from Section 2 through a simulation of the SPA model to generate large graphs. Specifically, we show how the relationship between spatial distance and common neighbours can be used to devise a distance estimator which gives precise results. We also use the theoretical results to estimate the local density around a node. Our simulations show that these estimators give reliable results on the simulated data.
1.1 Background and Related Work
Efforts to extract node information through link analysis began with a heuristic quantification of entity similarity: numerical values, obtained from the graph structure, indicating the relatedness of two nodes. Early simple measures of entity similarity, such as the Jaccard coefficient , gave way to iterative graph theoretic measures, in which two objects are similar if they are related to similar objects, such as SimRank . Many such measures also incorporate co-citation, the number of common neighbours of two nodes, as proposed in the context of bibliographic research in an early paper by Small . In , the authors make inferences on the social space for nodes in a social network, using Bayesian methods and maximum likelihood.
Generative spatial models were proposed in a more general setting, where the main objective was to generate graphs with properties that correspond to those observed in real-life networks. Different approaches were explored, for example in  using thresholds, or in [5, 6] using a geometric variant of the preferential attachment. Graph properties of this model were analyzed by Jordan in ; follow-up work on this model can be found in . In , a non-uniform distribution of the points in space is considered. In , Jacob and Mörters propose a probabilistic spatial model where the link probability is a function decreasing with distance. The setting is general, and includes the SPA model as a special case. Follow-up work on this model can be found in .
The SPA model was first proposed in  as a model for the World Wide Web. In  and , it was proved that the SPA model produces graphs with certain graph properties that correspond to those observed in real-life networks. The authors’ previous paper, , used common neighbours to explore the underlying geometry of the SPA model and quantify node similarity based on distance in the space. However, the distribution of nodes in space was assumed to be uniform. The approach used in this paper is similar to that in , but we investigate the complications that arise when the distribution is non-uniform, which is clearly a more realistic setting.
An earlier version of this work, containing no proofs, was presented at the workshop WAW 2013. An extended abstract can be found in .
1.2 The Inhomogeneous SPA Model
We begin with a brief description of our inhomogeneous SPA model. The model presented here is a generalization of the SPA model introduced in , the main difference being that we allow for an inhomogeneous distribution of nodes in the space.
Let be the unit hypercube in , equipped with the torus metric derived from the Euclidean norm, or any equivalent metric. The nodes of the graphs produced by the SPA model are points in chosen via an -dimensional point process. Most generally, the process is given by a probability density function ; is a measurable function such that . Precisely, for any measurable set and any such that ,
In fact, we will restrict ourselves to probability functions that are locally constant. Precisely, we assume that the space is divided into equal sized hypercubes, where is a constant natural number. Each hypercube is of the form (), where . Note that any density function can be approximated by such a locally constant function, so that this restriction is justified.
To keep notation as simple as possible, we assume that each hypercube is labelled , . Let be the density of , so the density function has value on . For any node , let be the hypercube containing , and let be the density of . Clearly, every hypercube has volume . Then the probability that a node , introduced at time , falls in equals , and the expected number of points in equals . It is easy to see that . Thus we may model the point process as follows: at each time step , one of the regions is chosen as the destination of ; region is chosen with probability . Then, a location for is chosen uniformly at random from the chosen region .
The SPA model generates stochastic sequences for graphs ; for each , , where is an edge set, and is a node set. The in-degree of a node at time is given by . Likewise the out-degree is given by . The sphere of influence of a node at time is defined as the ball, centred at , with total volume
where are given parameters. If , then and so . We impose the additional restriction that ; this avoids regions becoming too dense. This property will be always assumed. The generation of a SPA model graph begins at time with being the null graph. At each time step (defined to be the transition from to ), a node is chosen from according to the given spatial distribution, and added to to form . Next, independently, for each node such that , a directed link is created with probability , being another parameter of the model.
Let be the distance from to the boundary of . Let be the radius of the sphere of influence of node at time . So if , then is completely contained in at time . We see that
where is the volume of the unit ball; for example, in 2-dimensions with the Euclidean metric, .
As typical in random graph theory, we shall consider only asymptotic properties of as . We say that an event in a probability space holds asymptotically almost surely (a.a.s.) if its probability tends to one as goes to infinity. We emphasize that the notations and refer to functions of , not necessarily positive, whose growth is bounded. Since we aim for results that hold a.a.s., we will always assume that is large enough.
2 Graph properties of the SPA model
In this section we investigate typical properties of graphs produced by the inhomogeneous SPA model, aiming to use the results to infer the spatial distances between the nodes. A central observation is that in the inhomogeneous SPA model with a locally constant density function, the probability of an edge forming from a new node to an existing node at time equals
In the analysis of the original SPA model from , we find that spheres of influence of nodes that are born early typically shrink rapidly, while nodes born late start with small spheres of influence. A node would have to be quite close to the boundary of its region with another one for the effect of any other region to be felt. With this assumption, the expression for the link probability is very similar to that of the link probability of the original SPA model. Therefore, it seems reasonable to expect that the graph formed by nodes in a region with local density behaves like an independent SPA model of density . Our results will show that this expectation is justified and can be made rigorous.
To be specific, assume that nodes in the SPA model do not arrive at fixed, discrete, time instances , but instead arrive according to a homogeneous Poisson process with rate 1. (This will not significantly change the analysis but is a convenient assumption.) Then, the process inside a region with density will behave like the SPA model with the same parameters , and , but with points arriving according to a Poisson process with rate . This means that in each time interval we expect points to arrive, and the expected time interval between arrivals equals . If we use to denote the -th node arriving, then the arrival time of is approximately , and thus the volume of the sphere of influence of an existing node at the time that is born equals
Thus, in the analysis of the degree of an individual node, we expect a node in the inhomogeneous SPA model to behave like a node in the original SPA model with parameters , instead of , , where the degree of node at time in the inhomogeneous SPA model corresponds to the degree of a node at time in the corresponding SPA model. The following theorems show that this is indeed the case.
Let be any function tending to infinity together with , and let . The following holds with probability . For every node for which
and for which
it holds that for all values of such that ,
Times and are defined as follows:
Condition (1) on ensures that at time , is completely contained in (deterministically). In fact, due to the additional multiplicative factor of , is some distance removed from the boundary of . The expression for is chosen so that at this time node has a.a.s. at least neighbours. Likewise, is chosen such that at this time a.a.s. the sphere of influence has shrunk so that its radius is sufficiently smaller than , again with some extra room to spare. The implication of this theorem is that once a node accumulates at least neighbours and its sphere of influence has shrunk so that it does not intersect neighbouring regions, its behaviour can be predicted with high probability until the end of the process, and is completely governed by its region, and no others. In particular, it follows that from time onwards the sphere of influence is completely contained in .
For most vertices, the moment when they first achieve neighbours (,) will come before the moment that their sphere of influence has shrunk so that it is well contained in the region (). Indeed, consider a vertex of degree at least for which this is not the case. Let be the moment when the vertex reaches in-degree . By definition, the sphere of influence of at this time has a radius of influence of order . If , then the radius is , and the probability that is this close to the border is also . The only vertices for which potentially the radius at time could be fairly large are those vertices for which . Thus, these are the oldest vertices. These vertices do have high degree, but their spheres of influence still tend to shrink over time, so most of their edges will be acquired after time , that is, when their sphere of influence has shrunk to be contained in the region.
We can use the results on the degree to show that each graph induced by one of the regions has a power law degree distribution. Let denote the number of nodes of degree at time in the region . The proof of the following result is a straightforward adaptation of the differential equations method used to prove the counterpart result for the uniform model (see ). Since this theorem is not needed to prove the main result of this paper, the proof is omitted here.
A.a.s. the graph induced by the nodes in region has a power law degree distribution with coefficient . Precisely, a.a.s. for any there exists a constant such that for any ,
Moreover, a.a.s. the entire graph generated by the inhomogeneous SPA model has a degree distribution whose tail follows a power law with coefficient .
The number of edges also validates our hypothesis that a region of a certain density behaves almost as a uniform SPA model with adjusted parameters. In the original SPA model with parameters and replaced by , and , the average out-degree is approximately , as per [1, Theorem 1.3]. The following theorem shows that the subgraph induced by one of the regions has the equivalent expected number of edges. This theorem also shows that a.a.s. the number of edges that cross the boundary of a region is of smaller order than the number of edges completely contained in that region. Thus, almost all edges have both endpoints in the same region.
A.a.s., for all regions of density , . Moreover,
Here we see that we need the condition . If , then the number of edges would grow superlinearly.
Our ultimate goal is to derive the pairwise distances between the nodes in the metric space through an analysis of the graph. The following theorem, obtained using the approach of , provides an important tool. Namely, it links the number of common in-neighbours of a pair of nodes to their (metric) distance. Using this theorem, we can then infer the distance from the number of common in-neighbours.
The theorem distinguishes three cases. If and are relatively far from each other, then they will have no common neighbours. If the nodes are very close, then the number of common neighbours is approximately equal to a fraction of the degree of the node of smallest degree. The third case provides a ‘sweet spot’ where the number of common neighbours is a direct function of the metric distance and the degrees of the nodes. For any two nodes and , let denote the number of common in-neighbours of and at time .
Let be any function tending to infinity together with , and let . The following holds a.a.s. Let and be nodes of final degrees and such that , and . Let and let , and assume that
Let be the distance between and in the metric space. Then, we have the following result about the number of common in-neighbours of and :
Note that, if , then we have a precise asymptotic formula for . If and are approximately equal, then the formula only states that .
3 Reconstruction of Geometry
We set out to discover the character of nodes in a network purely through link structure, and to quantify the similarities. Spatial models allow us a convenient definition of similarity: distances between nodes. In examining the SPA model, the number of common neighbours allows us to uncover a good approximation of pairwise distances, a first step in the reconstruction of the geometry.
Description of Model Used:
For simulations, we use an inhomogeneous SPA model that we call a diagonal layout, which has 4 ‘clusters’ of identical high density, with . In the diagonal layout, and the 4 regions , , are dense, with the others sparse. We will use ‘dense region’ and ‘sparse region’ to denote the union of all regions with densities and , respectively. For ease of notation, we note that , so . Thus it is enough to provide the value of only. In Figure 1 we see an example of the diagonal layout with nodes and edges, and we also see evidence that the densest region does dominate the power law degree distribution. The yellow line is the prediction for the degree distribution with the power law exponent based on the maximum density, as in Theorem 2.2.
Our estimator for the distance is derived from Case 3 of Theorem 2.4, and in particular Equation (4), ignoring the error term. This leads to the following formula for the estimated distance . For a pair of nodes with and , , whose distance is such that Case 3 applies, this estimate is given by:
Since a relationship between the spatial distance only exists in Case 3 of Theorem 2.4, we try to eliminate pairs to which one of the other cases applies. Pairs which are in Case 1 are very close, and for such pairs, the expected number of common neighbours is . In an attempt to avoid this case, we filter out all pairs where the number of common neighbours is greater than . Pairs that are in Case 2 are so far apart that their spheres of influence have overlap for a very short time, if at all. We try to avoid this case by eliminating pairs with 10 or fewer common neighbours.
To see the effect of the non-uniform density, we first apply the original estimator to our diagonal layout. In other words, we define the estimated distance as in Equation (5), but taking for all , and we are applying this estimator to the points obtained from a non-uniform distribution. The motivation of this experiment is that, when applying our techniques to real-life data, we are not likely to know the local density of a node. Figure 2 (left side) gives the estimated versus real distance for a graph with nodes, generated via the SPA model from the diagonal layout with parameters , , , . After filtering as described above, 2,270 pairs are left.
The figure shows that the approach of assuming uniform density leads to a consistent overestimate of the distance for the nodes. This may seem counterintuitive. The trouble lies with the estimator’s assumption about a node’s age, which is based on its final in-degree. A node in has more neighbours than is expected when one assumes uniform density, and thus the node is thought to be much older than it actually is. This confounds the distance estimator.
Using the same simulation results, we now apply the estimator from Equation (5), and use our knowledge about . The figure on the right in Figure 2 shows the estimated distance using Equation (5) vs. actual node distance. The results indicate that our new estimator is significantly more accurate in predicting distances for the pairs of nodes in the dense region.
Let us mention that the estimation for pairs in the sparse region is still not accurate, while the estimation for cross-border pairs appears to be even worse. This is likely caused by the fact that nodes that are involved in cross-border pairs, and in sparse region pairs, and that have enough common neighbours to qualify to be included, are likely the older nodes, i.e. they are born near the beginning of the process. For such nodes, in the early stages there is likely some overlap between their sphere of influence and the bordering, dense regions. Thus, the degree likely does not follow the prediction from Theorem 2.1, which, in turn, affects the performance of the distance estimator for those pairs.
Better performance for cross-border pairs could possibly be obtained by using a linear combination of the densities in Equation (5). However, we will see in what follows that better performance for all pairs occurs when we use the data itself to estimate the density. Also, we point out that the pairs in the dense region constitute the large majority of all pairs. Moreover, the dense regions are those that are most likely to correspond to communities of interest. Therefore, accurate prediction for pairs in these regions is most important.
Estimating the density:
In real-world situations, we cannot assume to know the density of the region containing a given node. In fact, the density of the region containing a node is an important part of the ‘second layer of meaning’ which we aim to extract from the graph. Here we will show that our theoretical results give us a tool for estimating the local density around a node, using only its neighbourhood. We also apply our distance estimator once more, this time using the estimated density for our formula.
Using the theoretical results obtained from the previous section, we can estimate the density of the region containing a given node from the average out-degree of the in-neighbours of . As per Theorem 2.3, the average out-degree in is approximately
If we have a large enough set of nodes from the same region, then we can use the formula above to estimate the density of the region. Consider a node , and make two assumptions: () almost all neighbours of are contained in , and () the neighbours of form a representative sample of all nodes of . Simulations show that these assumptions are justified and allow us to make an estimate for . Assumption () is additionally justified by the second part of Theorem 2.3, which states that the number of edges crossing the border is negligible compared to the total number of edges.
Set to be the average out-degree of the in-neighbours of . Specifically,
Given our assumptions, an estimator for the density in , denoted by , can be derived from this average out-degree, using Equation (6):
where is the set of in-neighbours of .
The left side of Figure 3 shows a histogram of the values of for our simulated graph. Displayed are the results for nodes with . The graph is obtained from the SPA model where points have the previously described diagonal layout, with density in the dense region, and consequently density in the sparse region. For these parameters, Equation (6) gives a theoretical value of 5.85 for if node lies in the dense region, and a value of 1.45 if lies in the sparse region.
We see in Figure 3 (left side) that the values of in the dense region are quite accurate, with peaks occurring around the calculated value of 5.85. For the sparse region, the peaks occur around 2.5, giving an estimate for the average out-degree which is higher than expected. Likely, this is caused by nodes in the sparse region that are located close to the border, and thus are likely to have neighbours in the dense region. Such nodes also tend to have high degree, and our condition on the minimum degree favours the ‘rich’ sparse region nodes.
Figure 3 (right side) gives a histogram of the estimated densities of the nodes. For nodes in the dense region, the true value is 1.6, and we see a good estimation of this value for these nodes. For nodes in the sparse nodes, the true value is 0.8, while the peak of the estimated densities occurs around 1.15, and almost all values are greater than 0.8. Again, this is likely caused by nodes whose sphere of influence overlapped with the dense region.
To obtain better performance for nodes in the sparse region, we propose to base our estimated density for a node in the sparse region only on the out-degree of neighbours of of low in-degree; such neighbours are young and so the sphere of influence of had shrunk, and thus was more likely to be fully inside the sparse region, when the neighbours were born. To obtain density estimates for nodes with small in-degree, we can take the second neighbourhood to compute the average out-degree. Nodes with small in-degrees are young, so even second neighbours are likely to be close. We plan to explore these possibilities in future work.
Finally, we use , and known values of all other parameters, to calculate the distance between the nodes based on the number of common neighbours, Equation (5), using the same simulation results as those we used earlier. Here we use the calculated density of the node of higher degree in the distance formula. (Using the lower degree node gives similar results.) The results are seen in Figure 4.
The figure shows that there is very good agreement between calculated and estimated densities. In fact, we see that the agreement is greatly improved for the cross-border pairs, and also better for the sparse region pairs. This can be understood as follows. The distance estimator is derived indirectly from Theorem 4.1, which predicts the approximate degree of a node throughout the process, based on its final degree and the density of its region. For nodes in the sparse region which have a sizeable number of neighbours in the sparse region, the degree will be larger than predicted using this method but also, the density estimator will predict a higher density. So the estimated density is a better indicator of the behaviour of the degree than the real density, and thus the distance estimator gives better performance. This indicates that this last variation of the distance estimator is the most robust against local fluctuations in density. Thus we have a good prognosis for the applicability of the estimator on real data, where such fluctuations are to be expected.
In this section, we give the proofs of the main theorems. Our results all refer to typical behaviour of the random SPA model process, and are asymptotic in , the number of vertices. We will sometimes use the stronger notion of w.e.p. in favour of the more commonly used a.a.s., since it simplifies some of our proofs. We say that an event holds with extreme probability (w.e.p.), if it holds with probability at least as , where is any function tending to infinity together with . Thus, if we consider a polynomial number of events that each holds w.e.p., then w.e.p. (and hence also a.a.s.) all events hold.
First we state and prove a theorem that bounds the in-degree of any node, regardless of its distance of the boundary.
Let be any function tending to infinity together with . The expected in-degree at time of a node born at time satisfies
Moreover, for any node born at time we have
In order to simplify calculations, we make the following substitution:
It follows immediately from the definition of the process that and for
We couple with another random variable so that for . Random variable is defined as follows: and for
Finding the conditional expectation,
Taking expectations, we get
and, since ,
If , we have
This shows the upper bound.
For the lower bound, we first observe that for all nodes , . Thus the node links to with probability at least
Using this, we can use the exact same approach to bound the expectation of from below. This gives the lower bound. ∎
Proof of Theorem 2.1
Here we show that, once a vertex has reached an in-degree of and its area of influence is well contained within the region, its degree can be closely predicted with high probability. We will be using the following version of the Chernoff bound, as seen in e.g. [10, p. 27, Corollary 2.3].
Let be a random variable that can be expressed as a sum of independent random indicator variables, , where with (possibly) different . If , then
Let us start with the following key lemma.
Let be any function tending to infinity together with , and let . For a given node , suppose that and that
Then, with probability , for every value of , ,
Let . Our goal is to estimate . We will show that the upper bound holds; the lower bound can be obtained by using an analogous, symmetric, argument. Note that the assumption on implies that .
We use the following stopping time
Note that if , then the in-degree of remained bounded as required during the entire time interval . Hence, in order to prove the bound, we need to show that with probability we have .
Suppose that . Note that for up to and including time-step , the random variable is (deterministically) bounded from above. Moreover, it is straightforward to see that this upper bound, together with the assumption on (note the additional multiplicative term), implies that for all . Hence, the number of new neighbours accumulated during this phase of the process, , can be (stochastically) bounded from above by the sum of independent indicator random variables , where
Clearly, since ,
Since , the in-degree of at time failed the desired condition, which implies that
using again that it is assumed that . It follows from the Chernoff bound (7) that
where . The maximum value of corresponds to and so
So . Therefore, the probability that is at most and the proof is finished. ∎
Proof of Theorem 2.1.
Let be a function going to infinity with , and let . Let be a vertex with final degree , let , and assume that . Let be the first time that the in-degree of exceeds , and be the first time that the radius of influence . Moreover, let be the first time that the two events hold. Finally, let . We obtain from Lemma 4.3 that, with probability ,
for . It follows that the degree tends to grow but the sphere of influence tends to shrink between and , and thus that the conditions of Lemma 4.3 again hold at time . We can now keep applying the same lemma for times , , , , using the final value as the initial one for the next period, to get the statement for all values of from up to and including time . Precisely, for , let . Then by Lemma 4.3, we have for that , where . Since we apply the lemma times (for a given vertex ), the following statement holds with probability from time on: for any , we have that
It remains to make sure that the accumulated multiplicative error term is still only . For that, let us note that
since grows faster than . A symmetric argument can be used to show a lower bound for the error term and so the result holds.
It follows that we have the desired behaviour from time . Precisely, for times , we have that
where . As , we need to consider two cases. Suppose first that . Setting and , we obtain that
Therefore, for large enough , we have that . Suppose then that . By definition,
and, since ,
Again, for large enough , we have that . In either case, . As a result, we obtain that, for ,