Analyzing complex networks through correlations in centrality measurements
Many real world systems can be expressed as complex networks of interconnected nodes. It is frequently important to be able to quantify the relative importance of the various nodes in the network, a task accomplished by defining some centrality measures, with different centrality definitions stressing different aspects of the network. It is interesting to know to what extent these different centrality definitions are related for different networks. In this work, we study the correlation between pairs of a set of centrality measures for different real world networks and two network models. We show that the centralities are in general correlated, but with stronger correlations for network models than for real networks. We also show that the strength of the correlation of each pair of centralities varies from network to network. Taking this fact into account, we propose the use of a centrality correlation profile, consisting of the values of the correlation coefficients between all pairs of centralities of interest, as a way to characterize networks. Using the yeast protein interaction network as an example we show also that the centrality correlation profile can be used to assess the adequacy of a network model as a representation of a given real network.
Important aspects of many systems can be represented by complex networks Barabasi02 (); Dorogovtsev02 (); Newman03 (); Newman10 (); Costa11 (). As the nodes frequently differ with respect to their essentiality, since the beginning of the study of networks the quantification of the importance of their nodes has been receiving attention, as can be seen, e.g. in Refs. Freeman78 (); Bolland88 (); Friedkin91 (); Faust97 (); Jeong01 (); Borgatti06 (); Manimaran09 (); Wang11 (). There are different ways to quantify the importance, or centrality, of a node, and therefore a large number of measures used for this purpose, with new centrality measures being constantly proposed for use in new applications or to achieve better results in old ones, see e.g., Refs. Bonacich87 (); Stephenson89 (); Rothenberg95 (); Valente98 (); Newman05 (); Estrada05 (); Borgatti05 (); Costa07 (); Estrada08 (); Guimera05 (); Koschutzki05 (); Koschutzki08 (); Kitsak10 (); Pauls12 (); Avrachenkov13 (); Molinero13 (); Campiteli13 (); Benzi13 ().
Although based on different definitions, the various node centralities are, in real networks, correlated: important nodes using one of the definitions are frequently also important using others. For example, nodes with high degree have also high closeness centrality Bolland88 (). Some papers already analysed those correlations, while others do correlation analysis when proposing new centralities Bolland88 (); Valente08 (); Ochab12 (); Pauls12 (); Avrachenkov13 (); Iyer13 (); Campiteli13 (); Benzi13 (). Nonetheless, there are nodes with high value for one centrality and low value for another, and the correlations are not the same for all networks, as will be shown below.
In this work we systematically study the correlations between all pairs of a set of centralities using some real-world networks and two network models. We find that the correlations are generally strong, but there are marked differences between correlations in real network and models, and also among different real networks. We therefore suggest the use of such correlations as a way to characterize networks.
This paper is organized as follows. In Section II we present the considered network centralities and the real networks and models used. In Section III we present scatter plots of the centrality values for some pairs of centralities (Section III.1) and show that the relations are mostly power law; taking into account this power-law behavior, we present Pearson correlations for the logarithms of the centralities, which will give us a measure of how closely the centralities are related by a power law. Next (Section III.2) we present the concept of centrality correlation profile to characterize networks. This is followed by showing (Section III.3) that this centrality correlation profile can be used to distinguish the real networks from their randomly rewired counterparts, as well as from the network models. Finally (Section III.4) we propose the use of this profile to evaluate models for a given real network, using as an example a model for protein-protein interaction networks.
Ii Basic concepts and datasets
We are considering only undirected, unweighted networks without multiple edges or self connections. In this case, the network can be represented by a symmetric adjacency matrix whose elements (where is the number of nodes) are if nodes and are connected and otherwise. Some networks are in fact weighted, as described below, but we disregard the weights. We also drop self and multiple connections when present. When a network has more than one connected component, we consider only the nodes in the largest component.
An important concept is that of shortest paths. A path is a sequence of nodes where each two subsequent nodes are directly connected and no node is repeated in the path. A shortest path between nodes and is a path starting at node , ending at node and with the smallest possible number of intermediate nodes in the path.
ii.1 Node centralities
There is a large number of centrality measures in the literature. For brevity, we will work with some of them, including the most used ones, instead of trying to be comprehensive. Although other centralities can be important in many applications, the methodology employed here could be, if needed, easily extended to include other centralities. Furthermore, we use PCA (see Sec. III.2) to automatically compensate for possible redundancies among the centralities, and the results show that using other centralities would not contribute significantly for the considered networks (as enough discrimination is already achieved). It is plausible that other centralities could be necessary for a different dataset.
This centrality quantifies the importance of a node counting its number of connections. Using the adjacency matrix, the degree of node , represented as is computed as
Here we use a normalized degree centrality, given by dividing the degree by the maximum possible degree:
Just counting the number of connections, as done in the degree centrality, can give a distorted view of the importance of a node, because it does not quantify the importance of its neighbors. In principle, the importance of the neighbors should be considered when accessing the importance of a node. If is the importance of node , we can compute it in a self-consistent way through
where must be chosen appropriately. In vector form we have:
which tells us that is an eigenvector of the adjacency matrix and the corresponding eigenvalue. In fact, we use the eigenvector associated with the largest eigenvalue of the adjacency matrix, and the eigenvector centrality of node is the -th entry in this eigenvector.
It is also possible to take the word “centrality” more literally and search for nodes that are central in the sense of being in average closer to the other nodes. If is the shortest path distance between nodes and we can compute the closeness centrality of node as
Assuming pairs of nodes in the network must interact, if they are not directly connected the interaction must go through intermediary nodes. A node is important in the betweenness centrality sense if it must be used as an intermediary for many pairs of nodes (under the assumption that the interactions always follow a shortest path, that is, a path with minimum number of intermediaries).
The betweenness centrality of node , represented as , is computed by the expression:
where , is the number of shortest paths from to and is the number of shortest paths from to that pass through .
Current flow betweenness centrality
Betweenness centrality takes into account only the shortest paths from a node to another node . It is possible for the nodes to interact through other paths. This is taken into account by the centrality measure based on computing the current flow through the network elements supposing that each link is a resistor (with a value of 1 for unweighted networks) considering all possible sources and drains for the current. This is equivalent to counting the number of times a random walk from a node to node passes through a given node , for all pair (but canceling back-and-forth movements of the walker that do not contribute to a net movement toward the target) Newman05 (); Brandes05 () and is therefore also called random walk betweenness centrality.
Current flow closeness
Also know as information centrality Brandes05 (), this measure first proposed in Ref. Stephenson89 (), is a generalization of the closeness centrality in the same lines than the current flow (or random walk) betweeness centrality is a generalization of the shortest-path betweenness centrality: by considering alternate paths from a node to other nodes instead of just the shortest path.
This measure takes into account the participation of a node in subgraphs, given larger weight for smaller subgraphs Estrada05 (). Closed walks starting and ending in a node are counted and weighted with the inverse factorial of their size. With the chosen weighting, the values can be efficiently computed using the spectral decomposition of the adjacency matrix. If are the eigenvalues and are the respective eigenvectors, the subgraph centrality of node can be computed using the expression
where is the -th element of eigenvector .
ii.2 Networks and network models
We work here with the following available network datasets: Zachary’s karate club (represents friendship between 34 members of a karate club) Zachary77 (); dolphin social network (frequent association between 62 dolphins) lusseau03 (); high-energy theory collaboration (coauthorship in preprints on the hep-th section in arXiv.org) Newman01:pnas (); Newman01:pre:1 (); Newman01:pre:2 (); network science collaborations (coauthorship in network science papers) Newman06 (); books about US politics published around 2004 and sold on Amazon.com (edges show frequent co-purchase) Krebs (); power grid (topology of the power grid of the Western States of the USA) Watts98 (). Table 1 shows some measurements for the networks.
Our emphasis is showing results for real networks. We therefore include only two simple models for comparison, the Erdős-Rényi (ER) random graphs Erdos59 () and the Barabási-Albert (BA) scale-free networks Barabasi97 (). The method used here could be applied for other models (as done in Section III.4), if appropriate.
Iii Results and discussion
Given a network, we compute the centralities of each of its nodes and search for correlations between pairs of centralities, with each node in the network corresponding to a data point.
Figure 1 shows scatterplots for some of the pairs of centralities for the network models, while the plots for the real networks are presented in Figure 2 (best correlations) and Figure 3 (worst correlations). We do not show closeness or current flow closeness centralities results for best cases as these measurements span a limited range, which limits the significance of a good correlation in a log log plot. Due to space limitations, we show only two of the pairs with largest and smallest correlation values, respectively, for each network. These plots suggest that the centralities are correlated, with visible correlations even in the weakest cases for some networks, and that the correlations are close to a power law, specially for high values of centralities.
To quantify how closely two measurements are related by a power law, we use log-log plots and compute the Pearson correlation coefficient between the logarithms of the values of the centralities. Similar results were also found when using the Pearson and Spearman correlations of the measurements (without the logarithms). The use of the logarithms here is to emphasize possible power-laws, as observed in Figures 1 and 3.
Table 2 shows the values of the Pearson coefficients for all pairs of logarithms of the centralities in the networks studied. For the ER and BA models, 50 networks (for each model) of 1000 vertices and average degree 6 where used. We can see that the centralities have, in general, large values of the Pearson coefficient, which in our case implies proximity to a power law relation. With the exception of a small negative coefficient between closeness and eigenvector centralities for the power grid network, all coefficients are positive. This means that nodes that are important with respect to one definition are, in general, also important according to other definitions. It can be seen that the network models studied present larger coefficients than the real networks (with the exception of the karate club network, which also has large coefficients.111The karate club is a small network, with a few dominating nodes and many peripheral nodes. This could explain the large values.) Especially noticeable is the coefficient of closeness with eigenvector centrality, which is almost perfect for the network models (0.99 for ER and 0.98 for BA), but non-existent for the power grid network. It is, therefore, important to be careful when generalizing conclusions from results using such simplified models to real networks. The coefficients of degree with random walk closeness and subgraph centrality are large for all networks, with the exception of the power grid network. To a lesser extend, the same is true for other pairs involving degree, betweenness and current flow betweenness. Other pairs have large coefficients in some networks, but small coefficients in others. For instance, betweenness and subgraph centralities have large coefficients for the network models, the dolphins and karate networks, but small values for the other networks. Most interesting is the case of betweenness and eigenvector centralities: they have small coefficients for all real networks (with the exception of the karate club network, where it is large, but smaller than for other pairs), but large coefficients for the network models. This suggests that they complement each other when analising real world networks, and reinforces our previous observation of inadequacy of generalizing conclusions based on simple models.
|CF betweenness/CF Closeness||0.91||0.76||0.57||0.39||0.77||0.70||0.58||0.93|
iii.2 Correlation profile of networks
These results suggest that each network or network model has a specific profile of correlations between centrality measurements. We call this the centrality correlation profile of the network. To show that this profile can be used to characterize the networks, Figure 4 plots a two-dimensional projection of the real world networks from the space defined by the centrality correlation profile using principal component analysis (PCA) Abdi10 (). Each network is a point in a 21-dimensional space defined by the values of the Pearson correlation between the logarithms of the seven considered measurements. The points are projected to the two principal components for visualization. Note how the networks generated by the same model are clustered in small regions, while the different real networks or models are spread through the graph. The only exception is the small karate club network, which is close to the ER model cluster.
iii.3 Comparison with random rewiring
In our next experiment we generate, for each real network, 100 random networks with the same degree sequence through link rewiring Milo02 (). In this method, a pair of edges is randomly chosen, the original edges are removed and substituted by two new edges among the same vertices; the process is repeated a certain number of times. Each rewired network is generated by a number of random rewirings equal to 100 times the number of edges in the original network. We also include, for comparison, 100 networks each for the ER and BA models with the same number of nodes and similar average degrees. We compute the centrality correlation profiles of all networks and generate a two-dimensional PCA projection. The results are shown in Figure 5. With the exception of the karate club network, where the real network is inside the region of the rewired networks, we can distinguish the networks from their random rewirings and the two other models. This stresses the fact that, although there is generally a strong correlation between the various centralities, there is also important information in the specific wiring pattern of real networks, resulting in distinct correlation profiles. It is also interesting to note that the randomly rewired networks are sometimes closer to the ER, sometimes to the BA networks, but always closer to the models than the corresponding real network, supporting the assertion that the centrality correlation profile is characteristic of the specific network.
iii.4 Evaluating models with the correlation profile
Considering the previously presented results, we suggest that the centrality correlation profile can be used as a tool to test the adequacy of a network model developed to study a given real network. If the real network can be considered typical, with respect to the correlation profile, in comparison to networks generated using the proposed model, the model is appropriate. In an ideal case, we would know the distribution of points representing the generated networks in the correlation profile space and use standard statistical methods to evaluate the probability of the real network being generated by the model. In practice, when the correlation profile of the model is not known, we can use PCA projections of the real network and a large number of generated networks to achieve an informal confirmation of the model. To demonstrate this procedure, we use the yeast protein interaction network from Ref. Bu03 () and compare it with Barabási-Albert networks and the model developed by Pastor-Satorras et al. Pastor03 () specifically for protein interaction networks. Figure 6 shows a PCA projection of the centrality correlation profile of the network and 30 random networks generated by each model. The real network is much closer to the networks generated by the Pastor-Satorras et al. model than to the ones generated the Barabási-Albert model. But the yeast network cannot be considered a typical network from the Pastor-Satorras et al. model, as it lies outside of the region of correlation profile space spanned by the random networks generated according to the model, demonstrating that there are still important structural details in the real network not accounted for by the model.
Various centrality measurements are commonly used to discriminate important nodes in complex networks. The different measurements correspond to different definitions of the importance of the nodes, but our results have shown that they are in general strongly correlated for real networks, and even more for the two network models studied. We considered the following measurements: degree, closeness, betweenness, eigenvector, subgraph, current flow closeness, and current flow betweeness centralities. For most pairs of centralities, their Pearson correlation coefficients are above 0.5 for most networks, with some pair showing coefficients above 0.95 for some networks, specially the network models. The log-log scatter plots show that the correlations are specially strong for high centrality nodes, where they follow a power law. But the correlation values vary strongly from one network to another. For example, while the Pearson correlation coefficient between closeness and eigenvector centrality is 0.99 for the ER model and 0.92 for the karate club network, it is almost zero for the power grid network. We proposed therefore the use of the centrality correlation profile, consisting of the values of the correlation coefficient for all pairs of centralities studied, to characterize a network. Our results show that the networks can be distinguished using this profile. We have also shown, using the example of the yeast protein interaction network, how the centrality correlation profile can be used to verify to what extent a model (in our example the Pastor-Satorras et al. model) is adequate to explain a given network.
Interesting open questions suggested by this work include: Why are the correlation coefficients for the network models so strong for almost all pairs of centralities? Are the power laws seen for high centrality values due to specific topological features of the considered networks or do they result from the definitions of the measurements? Why are correlations in real networks consistently smaller than in the models? Do the results hold for other models and real networks? What kind of topological features makes some correlations smaller and other larger for a given network? An answer to the last question would help us design more adequate models for some network and therefore understand them better.
-  Réka A. and A.-L. Barabási. Statistical mechanics of complex networks. Rev. Mod. Phys., 74:47–97, 2002.
-  S. N. Dorogovtsev and J. F. F. Mendes. Evolution of networks. Advances in Physics, 51(4):1079–1187, 2002.
-  M. Newman. The structure and function of complex networks. SIAM Review, 45(2):167–256, 2003.
-  M. Newman. Networks: An Introduction. Oxford University Press, Inc., New York, NY, USA, 2010.
-  L. da F. Costa, O. N. Oliveira, G. Travieso, F. A. Rodrigues, P. R. Villas Boas, L. Antiqueira, M. P. Viana, and L. E. Correa Rocha. Analyzing and modeling real-world phenomena with complex networks: A survey of applications. Advances in Physics, 60(3):329–412, 2011.
-  L. C. Freeman. Centrality in social networks conceptual clarification. Social Networks, 1(3):215–239, January 1978.
-  John M. Bolland. Sorting out centrality: An analysis of the performance of four centrality models in real and simulated networks. Social Networks, 10(3):233–253, September 1988.
-  N. E. Friedkin. Theoretical Foundations for Centrality Measures. The American Journal of Sociology, 96(6):1478–1504, 1991.
-  K. Faust. Centrality in affiliation networks. Social Networks, 19(2):157–191, April 1997.
-  H. Jeong, S. P. Mason, A. L. Barabasi, and Z. N. Oltvai. Lethality and centrality in protein networks. Nature, 411(6833):41–42, May 2001.
-  S. P. Borgatti and M. G. Everett. A Graph-theoretic perspective on centrality. Social Networks, 28(4):466–484, October 2006.
-  P. Manimaran, Shubhada R. Hegde, and Shekhar C. Mande. Prediction of conditional gene essentiality through graph theoretical analysis of genome-wide functional linkages. Mol. BioSyst., 5(12):1936–1942, 2009.
-  J. Wang, H. Mo, F. Wang, and F. Jin. Exploring the network structure and nodal centrality of China’s air transport network: A complex network approach. Journal of Transport Geography, 19(4):712–721, July 2011.
-  P. Bonacich. Power and Centrality: A Family of Measures. American Journal of Sociology, 92(5):1170–1182, 1987.
-  K. Stephenson and M. Zelen. Rethinking centrality: Methods and examples. Social Networks, 11(1):1–37, March 1989.
-  R. B. Rothenberg, J. J. Potterat, D. E. Woodhouse, W. W. Darrow, S. Q. Muth, and A. S. Klovdahl. Choosing a centrality measure: Epidemiologic correlates in the Colorado Springs study of social networks. Social Networks, 17(3-4):273–297, July 1995.
-  T. W. Valente and R. K. Foreman. Integration and radiality: Measuring the extent of an individual’s connectedness and reachability in a network. Social Networks, 20(1):89–105, January 1998.
-  M. E. J. Newman. A measure of betweenness centrality based on random walks. Social Networks, 27(1):39–54, January 2005.
-  E. Estrada and J. A. Rodríguez-Velázquez. Subgraph centrality in complex networks. Phys. Rev. E, 71:056103, May 2005.
-  S. P. Borgatti. Centrality and network flow. Social Networks, 27(1):55–71, January 2005.
-  L. da F. Costa, F. A. Rodrigues, G. Travieso, and P. R. Villas Boas. Characterization of complex networks: A survey of measurements. Advances in Physics, 56(1):167–242, 2007.
-  E. Estrada and N. Hatano. Communicability in complex networks. Phys. Rev. E, 77:036111, Mar 2008.
-  R. Guimerà, S. Mossa, A. Turtschi, and L. A. N. Amaral. The worldwide air transportation network: Anomalous centrality, community structure, and cities’ global roles. Proceedings of the National Academy of Sciences, 102(22):7794–7799, May 2005.
-  D. Koschützki, K. A. Lehmann, L. Peeters, S. Richter, D. Tenfelde-Podehl, and O. Zlotowski. Centrality Indices. In Ulrik Brandes and Thomas Erlebach, editors, Network Analysis, volume 3418 of Lecture Notes in Computer Science, pages 16–61. Springer Berlin Heidelberg, 2005.
-  D. Koschützki and F. Schreiber. Centrality analysis methods for biological networks and their application to gene regulatory networks. Gene regulation and systems biology, 2:193–201, 2008.
-  M. Kitsak, L. K. Gallos, S. Havlin, F. Liljeros, L. Muchnik, H. E. Stanley, and H. A. Makse. Identification of influential spreaders in complex networks. Nature Physics, 6(11):888–893, August 2010.
-  S. D. Pauls and D. Remondini. Measures of centrality based on the spectrum of the Laplacian. Physical Review E, 85:066127+, June 2012.
-  K. Avrachenkov, N. Litvak, V. Medyanikov, and M. Sokol. Alpha current flow betweenness centrality, August 2013.
-  X. Molinero, F. Riquelme, and M. Serna. Power indices of influence games and new centrality measures for social networks, June 2013.
-  M. G. Campiteli, A. J. Holanda, L. D. H. Soares, P. R. C. Soles, and O. Kinouchi. Lobby index as a network centrality measure, June 2013.
-  M. Benzi and C. Klymko. Total communicability as a centrality measure, February 2013.
-  T. W. Valente, K. Coronges, C. Lakon, and E. Costenbader. How Correlated Are Network Centrality Measures? Connections (Toronto, Ont.), 28(1):16–26, January 2008.
-  J. K. Ochab. Maximal-entropy random walk unifies centrality measures. Physical Review E, 86(6), August 2012.
-  S. Iyer, T. Killingback, B. Sundaram, and Z. Wang. Attack Robustness and Centrality of Complex Networks. PLoS ONE, 8(4):e59613+, April 2013.
-  U. Brandes and D. Fleischer. Centrality measures based on current flow. In Volker Diekert and Bruno Durand, editors, STACS 2005, volume 3404 of Lecture Notes in Computer Science, pages 533–544. Springer Berlin Heidelberg, 2005.
-  W. W. Zachary. An information flow model for conflict and fission in small groups. Journal of Anthropological Research, 33(4):452–473, 1977.
-  D Lusseau, K. Schneider, O. J. Boisseau, P. Haase, E. Slooten, and S. M. Dawson. The bottlenose dolphin community of doubtful sound features a large proportion of long-lasting associations. Behavioral Ecology and Sociobiology, 54(4):396–405, 2003.
-  M. E. J. Newman. The structure of scientific collaboration networks. Proceedings of the National Academy of Sciences, 98(2):404–409, 2001.
-  M. E. J. Newman. Scientific collaboration networks. i. network construction and fundamental results. Phys. Rev. E, 64:016131, Jun 2001.
-  M. E. J. Newman. Scientific collaboration networks. ii. shortest paths, weighted networks, and centrality. Phys. Rev. E, 64:016132, Jun 2001.
-  M. E. J. Newman. Finding community structure in networks using the eigenvectors of matrices. Phys. Rev. E, 74:036104, Sep 2006.
-  V. Krebs. http://www.orgnet.com. Unpublished.
-  D J Watts and S H Strogatz. Collective dynamics of ’small-world’ networks. Nature, 393(6684):440–442, 1998.
-  P. Erdős and A. Rényi. On random graphs. Publicationes Mathematicae, 6:290–297, 1959.
-  A.-L. Barabási and R. Albert. Emergence of scaling in random networks. Science, 286:509–512, 1997.
-  The karate club is a small network, with a few dominating nodes and many peripheral nodes. This could explain the large values.
-  H. Abdi and L. J Williams. Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2(4):433–459, 2010.
-  R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon. Network motifs: simple building blocks of complex networks. Science, 298(5594):824–827, 2002.
-  D. Bu, Y. Zhao, L. Cai, H. Xue, X. Zhu, H. Lu, J. Zhang, S. Sun, L. Ling, and N. et al. Zhang. Topological structure analysis of the protein-protein interaction network in budding yeast. Nucleic acids research, 31(9):2443–2450, 2003.
-  R. Pastor-Satorras, E. Smith, and R. V Solé. Evolving protein interaction networks through gene duplication. Journal of Theoretical biology, 222(2):199–210, 2003.