Accounting for the Role of Long Walks on Networks via a New Matrix Function

# Accounting for the Role of Long Walks on Networks via a New Matrix Function

Ernesto Estrada and Grant Silver Department of Mathematics & Statistics, University of Strathclyde, 26 Richmond Street, Glasgow G1 1XQ, UK
###### Abstract

We introduce a new matrix function for studying graphs and real-world networks based on a double-factorial penalization of walks between nodes in a graph. This new matrix function is based on the matrix error function. We find a very good approximation of this function using a matrix hyperbolic tangent function. We derive a communicability function, a subgraph centrality and a double-factorial Estrada index based on this new matrix function. We obtain upper and lower bounds for the double-factorial Estrada index of graphs, showing that they are similar to those of the single-factorial Estrada index. We then compare these indices with the single-factorial one for simple graphs and real-world networks. We conclude that for networks containing chordless cycles—holes—the two penalization schemes produce significantly different results. In particular, we study two series of real-world networks representing urban street networks, and protein residue networks. We observe that the subgraph centrality based on both indices produce significantly different ranking of the nodes. The use of the double factorial penalization of walks opens new possibilities for studying important structural properties of real-world networks where long-walks play a fundamental role, such as the cases of networks containing chordless cycles.

###### keywords:
matrix error functions; matrix tanh function; communicability functions; double-factorial; chordless cycles; complex networks

## 1 Introduction

 G=∞∑k=0ckAk, (1.1)

where the coefficients are responsible of giving more weight to shorter walks. The most popular of these communicability functions is the one derived from the scaling of , which gives rise to the exponential of the adjacency matrix (see further for definitions). This function, and the graph-theoretic invariants derived from it, have been widely applied in practical problems covering a wide range of areas. Just to mention a few, the communicability function is used for studying real-world brain networks and the effects of diseases on the normal functioning of the human brain Brain networks (); brain networks_2 (). On the other hand, the so-called subgraph centrality Subgraph Centrality ()—a sort of self-communicability of a node in a graph—has been used to detect essential proteins in protein-protein interaction networks PIN_1 (); PIN_2 (). The network bipartivity—a measure derived from the use of the self-communicability—has found applications ranging from detection of cracks in granular material Tordesillas (), to the stability of fullerenes fullerene (), and transportation efficiency of airline networks airlines ().

A typical question when studying the structural indices derived from (1.1) when using is whether or not we are penalizing the longer routes in the graph too heavily (see Preliminaries for formal definitions) Zooming in and out (). To understand this problem let us consider the communicability function between the nodes and in the graph:

 Gpq=∞∑k=0ck(Ak)pq, (1.2)

where gives the number of routes of length between these two nodes. Then, when we use a route of length 2 is penalized by 1/2 and a walk of length 3 is penalized by 1/6. However, a walk of length 5 is already penalized by , which could be seen as a very heavy penalization for a relatively short walk between these two nodes. This means that the longer walks connecting two nodes make a little contribution to the communicability function. If we consider the function accounting for the self-returning walks starting (and ending) at a given node , a heavy penalization of longer walks means that this index is mainly dependent on the degree of the corresponding node, i.e., the number of edges incident to it. That is,

 Gpp=1+c2kp+∞∑k=3ck(Ak)pp, (1.3)

where is the degree of the node . Then, the main question here is to study whether using coefficients that do not penalize the longer walks as heavily will reveal some structural information of networks which is important in practical applications of these indices.

Here we consider the use of a double-factorial penalization 1/ DF () of walks as a way to increase the contribution of longer walks in communicability-based functions for graphs and real-world networks. The goal of this paper is two-fold. First, we want to investigate whether this new penalization of walks produces structural indices that are significantly different from the ones derived from the factorial penalization. The other goal is to investigate whether the information contained in longer walks is of significant relevance for describing the structure of graphs and real-world networks. While in the first case we can obtain analytical results that account for the similarities and differences among the two penalization schemes, in the second case we need to use some kind of indirect inference. That is, we aim to explore some practical applications of the indices derived from these two schemes and show whether or not there are significant advantages when using one or the other for solving such practical problems. In the current work we have strong evidences that the contributions of long walks in networks is very important for such graphs containing chordless cycles—also known as holes. In particular we have considered a centrality index based on single- as well as on double-factorial penalization of the walks, and observed that there are significant differences in the ranking of the nodes when the graphs contain such kind of topological features. Chordless cycles are ubiquitous in certain scenarios, such as urban street networks and protein residue networks, which are both studied here. In addition, these chordless cycles are undesired features in certain networks like sensor networks, mobile phone networks and other communication systems, where they represent zones of no coverage of the signals.

## 2 Preliminaries

We consider in this work simple, undirected and connected graphs with nodes (vertices) and edges. A walk of length in is a set of nodes such that for all , . A closed walk is a walk for which . Let be the adjacency operator on , namely . For simple finite graphs is the symmetric adjacency matrix of the graph. In the particular case of an undirected network as the ones studied here, the associated adjacency matrix is symmetric, and thus its eigenvalues are real. We label the eigenvalues of in non-increasing order: . Since is a real-valued, symmetric matrix, we can decompose it as

 A=UΛUT, (2.1)

where is a diagonal matrix containing the eigenvalues of and is the matrix containing the orthonormalized eigenvectors associated with as its columns. The graphs considered here are connected, therefore is irreducible and from the Perron-Frobenius theorem we can deduce that and that the leading eigenvector , which will sometimes be referred to as the Perron vector, can be chosen such that its components are positive for all .

The degree of a node is the number of edges incident to that node. The graph density is defined as

 d=2mn(n−1), (2.2)

where is the number of edges in the graph.

The so-called ’exponential’ communicability function Communicability (); Estrada Higham (); Estrada Hatano Benzi () is defined for a pair of nodes and on as

 Gpq=∞∑k=0(Ak)pqk!=(exp(A))pq=n∑j=1eλj→ψj,p→ψj,q. (2.3)

The terms of the communicability function characterize the degree of participation of a node in all subgraphs of the network, giving more weight to the smaller ones. Thus, it is known as the subgraph centrality of the corresponding node Subgraph Centrality (). The global structural index defined by

 EE(G)=tr(exp(A))=n∑j=1eλj, (2.4)

is known as the Estrada index of the graph. The indices have been generalized by the use of a parameter in the matrix function following the work of Temperature ().

 Gpq(β)=∞∑k=0(βkAk)pqk!=(exp(βA))pq. (2.5)

## 3 Double-Factorial Penalization of Network Walks

Let us start this section by recalling what the double-factorial is. Let be a positive integer, then the double-factorial is defined by

 k!!=⎧⎪⎨⎪⎩k(k−2)(k−4)...3.1k odd k(k−2)(k−4)...4.2k even1k=−1,0. (3.1)

As a variation of the factorial the double-factorial appears very suitable for use as the penalization factor of the number of walks of length in the definition of communicability functions. Other functions have been used in the past for changing the heavy penalization imposed by the single factorial. For instance, the use of where , has been used since the introduction of the Katz index in 1953 Katz centrality (). It is well-known that in this case the Eq. (2.5) converges to the resolvent of the adjacency matrix, i.e., . Another choice of the coefficient in the Eq. (1.2) is to consider for some Zooming in and out (). In this case the function (1.2) converges to Zooming in and out ():

 [At(I+AeA−eA)]pq. (3.2)

The main problem with the two functions previously mentioned is that they are parametric. In the first case we should select the parameter that is more appropriate for each individual problem. It should be noticed that for very big networks, where the range of this parameter is very narrow leaving very little choice for its variation. In the second case we also need to select the parameter for each particular problem. Then, our consideration here is the selection of a penalization which is not as heavy as the single factorial but that does not contain any parameter. To see the main differences and similarities with the other penalization discussed before let us consider the terms , where every walk of length is penalized by and let us compare it with the penalization of and . To give a simple example we consider a graph having nodes and edges and show in Figure 3.1 the values of for . As can be seen in Figure 3.1 there are significant difference among the three kinds of penalization of walks. The single factorial, and all display very similar behaviour, with very quick decay for relatively small values of . The use of shows a smoother decay with the increase of (notice that the plot is semi-log scale). However, for values the double-factorial penalizes the walks less heavily than this modified factorial. As can be seen from this Figure, the double-factorial does not penalize the long walks as heavily as the other penalization coefficients, which may retain some important structural information of graphs and networks. Hereafter, we will concentrate our analysis and comparison between the double and the single-factorial penalization of walks in graphs/networks. Figure 3.1: Comparison of the decay of ck⋅tr(Ak) for different coefficients ck (see text for discussion). The y-axis is expressed in logarithmic form.

As we are interested in defining matrix functions that allow us to calculate several graph invariants, we start by proving the following result.

###### Lemma 1.

Let be the adjacency matrix of a simple graph . Then

 ∞∑k=0Akk!!=12[√2πerf(A√2)+2I]exp(A22), (3.3)

where is the identity matrix and is the matrix error function of erf ().

###### Proof.

Let us consider the spectral decomposition (2.1). Then we can write

 (∞∑k=0Akk!!)pq = ∞∑k=0n∑j=1ψj,pψj,qλkjk!! = n∑j=1ψj,pψj,q∞∑k=1λkjk!! = 12n∑j=1ψj,pψj,qexp(λ2j/2)[√2πerf(λj√2)+2],

which can be written in the matrix form (3.3), proving the result.

From the computational point of view the main problem for obtaining (3.3) is provided by the calculation of the matrix error function. In order to circumvent this difficulty we make use here of the remarkable similarity between and (see Figure 3.2(a)). As can be seen in Figure 3.2(a) there is a gap between the functions in the interval . We can definitively improve the similarity between the two function in the following way. The function is odd and so its integral from to is zero. Then, we will consider the integral

 ∫∞0[erf(x)−tanh(kx)], (3.4)

which we will make equal to zero as a way to minimize the difference between the two functions. In other words, we will find the value of for which (3.4) is zero. Mathematically,

 lima→∞∫a0[erf(x)−tanh(kx)]=0,

which after integration becomes

 lima→∞[aerf(a)−ln(cosh(ka))k]=1√π.

Using the relation between the hyperbolic cosine and the exponential we have

 lima→∞[aerf(a)−ln(eka+e−ka)k]+ln(2)k=1√π

As grows to infinity, will vanish and . Then

 lima→∞[a−ln(eka)]+ln(2)k=1√π,

 ln(2)k=1√π.

Which gives us the result of . That is, minimizes the gap between the two functions as can be seen in Figure 3.2(b). Consequently, we define the matrix function

 D′(A) = ∞∑k=0Akk!!=12[√2πerf(A√2)+2I]exp(A22), (3.5) ≃ 12[√2πtanh(kA√2)+2I]exp(A22), (3.6)

where

 tanh(kA)=ekA−e−kAekA+e−kA.

Hereafter, we define the function

 D(A) = 12[√2πtanh(kA√2)+2I]exp(A22), (3.7)

where we use instead of . It represents an approximate double-factorial or quasi-double-factorial function, but for the sake of simplicity we will simply refer to it generalically as the double-factorial approach. We then define the following indices that will be studied in this work. Figure 3.2: (a) Illustration of the similarities between erf(x) (solid blue line) and tanh(x) (broken red line). (b) Similar comparison between erf(x) (solid blue line) and tanh(kx) (broken red line) for k=√πlog(2).
###### Definition 2.

Let and be any two nodes of the graph The double-factorial communicability between these two nodes is defined by Similarly, the term will be called the double-factorial subgraph centrality of the node and the double-factorial Estrada index of

The generalization of the new matrix function and the indices derived from it lead naturally to considering the following parameter . That is, in general we can consider

 D(A,β) = 12[√2πtanh(kβA√2)+2I]exp(β2A22), (3.8)

and the corresponding indices , , and Hereafter every time that we write , , and it should be understood that .

## 4 Properties of Γ(G)

In this section we study some of the mathematical properties of the indices derived from the new matrix function In particular, we consider bounds for the double-factorial Estrada index of graphs. In this section we consider that , but the results are trivially extended for any

###### Proposition 3.

Let be a simple connected graph on nodes. Then, the double-factorial Estrada index of is bounded as follows

 Γ(G) ≤ 12(√2πtanh(k(n−1)√2)+2)exp((n−1)22) +(n−1)2(√2πtanh(−k√2)+2)exp(12),

with equality if and only if the graph is complete.

###### Proof.

Let be an edge of and assume that is not trivial, i.e., it contains at least one edge. Let be the graph resulting from removing the edge from . Let be the number of closed walks of length in . Then, , where is the number of closed walks of length in which contain the edge . Consequently,

 n∑p=1(∞∑k=0μk(G−l)k!!)pp≤n∑p=1(∞∑k=0μk(G)k!!)pp,

which means that with equality if the graph is the complete graph won vertices. We now obtain the formula for The spectrum of is with multiplicity one and with multiplicity from which the result immediately appears.

###### Corollary 4.

Let be a graph and let be a spanning tree of . Then

 Γ(G)≥Γ(T). (4.2)

In the next part of this section we will find a lower bound for the double-factorial Estrada index of graphs. First, we find an expression for this index for the path graph , which will be needed for proving the lower bound.

###### Lemma 5.

Let be a path with nodes. Then, when

 Γ(Pn)=eI0(1)(n+12)−e22+o(n), (4.3)

where are modified Bessel functions of the first kind Bessel ().

###### Proof.

Let be number of nodes in

 Γ(Pn) = n∑p=1Γpp(Pn). (4.4)

By substituting the eigenvalues and eigenvectors of the path graph into the expression for we obtain

 Γpp(Pn) = 2n+1n∑j=1sin2(jπpn+1)exp(2cos2(jπn+1)) (4.5) = (4.6) = (4.7)

Now, when the summation in (4.7) can be evaluated by making use of the following integral

 Γpp(Pn)=eπ∫π0exp(cosθ)dθ−eπ∫π0cos(pθ)exp(cosθ)dθ+o(n), (4.8)

where . Thus, when we have

 Γpp(Pn)=e(I0(1)−Ip(1))+o(n). (4.9)

We then have

 Γ(Pn)=eI0(1)n−en∑p=1Ip(1)+o(n). (4.10)

Replacing the sum we finally obtain the result when .

Now, we can find the lower bound for the double-factorial Estrada index of graphs.

###### Lemma 6.

Let be a graph with nodes. Then,

 Γ(G)≥eI0(1)(n+12)−e22, (4.11)

with equality if and only if is the path graph , where are modified Bessel functions of the first kind Bessel ().

###### Proof.

Let be a tree with n nodes

 Γ(T) = n∑j=1exp(λ2j/2) (4.12) = n+∑jλ2j/2+∑j(λ2j/2)22!+∑j(λ2j/2)33!+⋯ (4.13) ≥ n+m+m22+m36+m424=n+4∑k=1(n−1)kk=F(n) (4.14)

where is the number of edges in the path graph. It is easy to show that for

 Γ(T)≥F(n)≥eI0(1)(n+12)−e22=Γ(Pn). (4.15)

Using Corollary 4 we easily see that , which proves the result.

In closing, the double-factorial Estrada index of graphs is bounded , which is similar to . In the next section we will see that the two indices display significant differences when used to analyze graphs containing significantly large chordless cycles or holes.

## 5 Graphs with holes

We now consider a graph and and two nodes and in . Suppose that all the number of subgraphs of sizes smaller than a certain value to which the node belongs to is larger than that for the node . Also consider that the node is involved in a larger number of subgraphs of size larger than than the node . This situation can be found in any graph containing holes. A hole is a chordless cycle, that is a closed sequence of nodes in such that each two adjacent nodes in the sequence are connected by an edge and each two non-adjacent nodes in the sequence are not connected by any edge in . Then, the situation previously described can appear when one of the nodes is in a chordless cycle and the other not. An example of this situation is represented in Figure 5.1. Here node takes place, for instance in 3 triangles, while node takes place in 6. However, the number of walks of length larger than 17 is bigger for node , which indeed is part of a chordless cycle of length 18, than for the node . Figure 5.1: Illustration of a triangular lattice with 27 nodes. The nodes marked in red belong to a chordless cycle of length 18, which is highlighted with bolded edges. The nodes A and B are discussed in the main text.

Now, let us express mathematically the situation that we have described in the precedent paragraph. That is, for all , and for all , where . Let us now consider the difference between the double factorial subgraph centrality and the single-factorial version of it for the node ,

 Δp = Γpp(G)−EEpp(G) (5.1) = 16(A3)pp+112(A4)pp+7120(A5)pp+7360(A6)pp+⋯ (5.2) = ∞∑k=3(k−1)!!−1k!!(k−1)!!(Ak)pp. (5.3)

Then, in the situation previously described it is plausible that the functions and follow similar trends to the spectral moments of the adjacency matrix. That is, for all , and for all .

For the example illustrated in Figure 5.1 we give in Figure 5.2 the plots of and for the nodes labelled as and in Figure 5.1, as well as the plot of and As can be seen the node , which is in the chordless cycle, has smaller contribution from small subgraphs than node , which is outside the hole. However, node takes place in longer subgraphs, such as its own chordless cycle, than node and consequently for . Figure 5.2: Values of ck(Ak)pp for ck=1/k! and ck=1/k!! for different values of k in the triangular lattice illustrated in Figure 5.1 for two nodes. Blue (continuous) line is for the node B and red (broken) line is for the node A. The panel on the right illustrates the values of the differences between both types of penalizations for the two studied nodes (see text for details).

The consequences of the previous kind of situation is that there is a different ranking of the nodes according to the subgraph centrality (or communicability) based on the single and double-factorial penalization. For instance, according to the single factorial penalization and . That is, the node is more central than node .. However, according to the double-factorial penalization we obtain and , which indicates that indeed the node is more central than node . In many cases this difference in ranking is observed for many pairs of nodes in a network, which produces a lack of correlation between the corresponding parameters. In Figure 5.3 we illustrate the correlations between the subgraph centralities (left panel) and communicability (right panel) for the nodes and pairs of nodes, respectively, in the network illustrated in Figure 5.1. Figure 5.3: Scatterplot of the subgraph centrality (left panel) and communicability (right panel) based on the single and double-factorial penalization for all the nodes and pairs of nodes, respectively, in the graph illustrated in Figure 5.3.

The graph previously considered is a triangular lattice, which is a planar graph. The situation previously considered where a chordless cycle appears can be seen frequently in these type of graphs and it is consequently of importance for studying certain kinds of real-world networks as we will see in the next section. However, such kinds of examples are not exclusive to planar graphs. As a simple illustration we destroy the planarity of the graph in Figure 5.3 by adding a few edges but keeping the same chordless cycle as in the original triangular lattice. As can be seen in Figure 5.4 there is a total lack of correlation between the communicability obtained with the single and double factorial for all pairs of nodes in this graph. We will show more realistic examples from real-world networks in the next section of the paper. Figure 5.4: (Top) Nonplanar graph obtained from the triangular lattice with 27 nodes illustrated in 5.3. The nodes marked in red belong to a chordless cycle of length 18, which is highlighted with bolded edges. (bottom) Scatterplot of the communicability based on the single and double factorial for all pairs of nodes for the graph in the top panel.

## 6 Analysis of Real-World Networks

### 6.1 General analysis

An important problem to be considered in practical applications is that the entries of grow very fast with in large graphs with relatively high density. Although most real-world networks are sparse, the calculation of indices based on can be affected by the presence of these very large numbers. For instance, for a network representing the synaptic connections among the neurons of the worm C. elegans, which has nodes and edge density the entries of are bigger than , which far exceeds the largest finite floating-point number in IEEE single precision (), but is still below the largest finite floating-point number in IEEE double precision (). However, for the network representing the USA system of airports having nodes and the entries of exceed this maximum floating number and a program like Matlab® returns infinity for all its entries. In those cases the adjacency matrix can be multiplied by in order to reduce the magnitude of the entries of as we will illustrate in some of the examples in this section. We then study the influence of this parameter on the function .

In this subsection we study networks that do not contain significantly large chordless cycles. We now conduct a computational study of the index of all connected graphs with nodes and compare it with the index for values . In Figure 6.1(a) we illustrate the correlation between the two indices for that show the existence of a power-law relation between them. However, by zooming into the smallest valued region of the indices—this region corresponds to graphs with relatively low density of edges—it is revealed that such a correlation between the two indices is far from being simple (see (6.1(b)). This reveals the fact that decreasing the penalization of the walks in graphs from the factorial to the double-factorial make non-trivial changes in the ordering of the graphs, particularly for graphs with relatively low density of edges. This is very important as most real-world networks are sparse and we should expect significant differences between the two different indices for them. More interestingly, we plot the two indices for in (6.1(c)) where it can be observed that the correlation between the two indices have now been dramatically decreased. Figure 6.1: (a) Scatterplot of the indices EE(G,β) and Γ(G,β) for all the 11,117 connected graphs on 8 nodes using β=1. (b) Magnified plot of the region 100≤Γ(G)≤1000 for the same plot as in (a). (c) The same as in (a) but using β=0.1.

In order to understand this decay in the correlation between the two indices we express them in terms of the eigenvalues of the adjacency matrix:

 EE(G,β)=n∑j=1exp(βλj), (6.1)
 Γ(G,β)=n∑j=1exp(β2λ2j2)+√π2n∑j=1tanh(kβλj√2)exp(β2λ2j2). (6.2)

It is easy to see that when both indices are dominated by the principal eigenvalue (spectral radius) of the adjacency matrix, i.e.,

 EE(G,β→∞)=exp(βλ1), (6.3)
 Γ(G,β→∞) = exp(β2λ212)+√π2tanh(kβλ1√2)% exp(β2λ212) (6.4) ≃ (1+√π2)exp(β2λ212), (6.5)

due to the fact that for . Then, it is evident that both indices are highly correlated. On the contrary, when , the second term of (6.2) becomes more relevant. First of all, in this case the term is smaller than one for many of the eigenvalues of the network, which means that a larger number of eigenvalues and not only those close to zero make a contribution to this part of the function. Although the first term of (6.2) may still correlate with , the second term does not, which results in a lack of global correlation between and when . This lack of correlation for small values of will be useful for the study of some of the indices derived from the matrix function .

We now study a group of 61 real-world networks representing social, biological, ecological, infrastructural and technological systems. We first illustrate the correlation between and when . To avoid size effects we normalized both indices by dividing their logarithms by the number of nodes. As can be seen in Figure 6.2 in general there is a good correlation between the two indices except for a few networks–about one third of the total number of networks studied—which display large deviations from the linear trend observed. That is, there are 19 networks for which is significantly larger than expected from the linear correlation between this index and . Excluding these 19 outliers the Pearson correlation coefficient between the two indices is . We have calculated the average Watts-Strogatz clustering coefficient and the global transitivity of all the network studied. They have average values for the 61 networks studied of and , respectively. However, if we consider the networks that deviate significantly from the linear correlation between the two indices, the clustering coefficients have average values of and , respectively, which are much higher than the average observed for the total networks studies. Indeed, if we only consider those networks for which there is a perfect fit between the two indices studied we obtain average clustering coefficients of and , which confirms that the ‘anomalous’ behavior is observed for networks with the highest clustering coefficients among all the networks studied.

If we consider the difference between the two indices studied we have

 Γ(G,β)−EE(G,β)=β36tr(A3)+β412tr(A4)+7β5120tr(A5)+⋯, (6.6)

which clearly indicates that the first term is the one having the largest contribution. We recall that where is the number of triangles. Then, when the number of triangles has the largest influence in the difference between the two indices. Consequently, those networks having the largest clustering—which account for the relative abundance of triangles—display the largest difference between the two indices among all the networks studied. Figure 6.2: Correlation between the logarithms of Γ(G,β=0.2) and EE(G,β=0.2) normalized by the number of nodes for 61 real-world networks.

### 6.2 Centrality

One of the most important uses of matrix functions in the study of networks is the definition of centrality indices. The double-factorial subgraph centrality is defined as the diagonal entries of the matrix function , which can be expressed in terms of the eigenvalues and eigenvectors of the adjacency matrix as

 Γpp(G,β) =n∑j=1ψ2j,p[√π2tanh(kβλj√2)+1]exp(β2λ2j2). (6.7)

The way the double-factorial Estrada index is correlated to for large values of is similar to how the double-factorial subgraph centrality is also correlated to the subgraph centrality for large values of . In Figure 6.3 we illustrate the correlations between the subgraph centralities and for the protein-protein interaction network of yeast (top plots) and the network of directors in the corporate elite in US (bottom plots). As can be seen in the plots on the left hand side of the Figure, for there is a very good linear relation between both centralities as expected from the fact that they can both be approximated by

 Gpp(G,β→∞)=ψ21,pexp(βλ1), (6.8)
 Γpp(G,β→∞) = ψ21,pexp(β2λ212) (6.9) + √π2tanh(kβλ1√2)exp(β2λ212) (6.10) ≃ (1+√π2)ψ21,pexp(β2λ212). (6.11)

However, when , as in the right hand side plots of Figure 6.3, this correlation disappears and both indices differ significantly for several nodes in these networks. The reason for this difference is analogous to the one explained in the previous section for the corresponding Estrada indices. Figure 6.3: Correlation between the subgraph centrality based on the exponential matrix function Gpp(G,β) and on the matrix function D(A), Γpp(G,β) for the protein-protein interaction network of yeast (top plots) and the network of directors in the corporate elite in US (bottom plots). The plots on the left are for β=1 and on the right for β=0.1.

In order to study how significant the differences between and are for relevant network properties we consider the following problem. We consider the identification of essential proteins in the protein-protein interaction network of yeast PIN_1 (). Essential proteins are those for which if the corresponding gene is knocked out the entire cell dies. Thus, they are considered to be essential for the survival of the corresponding organisms. In this case we study how many essential proteins exists in the top 10% of proteins ranked according to a given centrality index. The hypothesis behind this experiment is that the most central proteins have higher probability of being essential. Consequently, we rank all the proteins in the yeast PIN according to and for with step . We then select the top 10% of these proteins and count how many of them are essential. The results for the two centrality indices considered here are illustrated in Figure 6.4(a) where it can be seen that both indices reach the same maximum number of 115 essential proteins identified. However, while reaches this maximum for , the maximum is reached by for . In the Figure 6.4(b) we illustrate the receiving operating characteristic (ROC) for the classification of the essential proteins using both indices. Figure 6.4: (a) Number of essential proteins identified by Gpp(G,β=0.5) (red broken curve) and by Γpp(G,β=0.18) (blue continuous line). (b) Illustration of the ROC curves for the classification of essential proteins in yeast protein interaction network (PPI) using Gpp(G,β=0.5) (red broken line) and Γpp(G,β=0.18) (blue continuous line).

Apart from the visual similarities which are evident from a simple inspection of the curves, the quantitative analysis also indicates that there are no significant differences in the quality of the classification using these indices. For instance, the area below the curves for the classification of essential proteins in yeast protein interaction network (PPI) using and are both 0.69. We can see in (6.5), the indices highly correlate for higher values of and also for small , provided they are close together. In closing, there are no significant differences in the quality of the classification models using and when the appropriate values of are considered. For the sake of comparison we give the values of essential proteins identified by other centrality indices: eigenvector (97); degree (86); closeness (77); betweenness (71) and a random ranking of the proteins identifies 52 essential proteins. Figure 6.5: Correlation between the diagonal entries of the protein-protein interaction network of yeast based on the exponential matrix function Gpp(G,β1) and on the matrix function D(A), Γpp(G,β2) for β1,β2 between 0 and 0.75 (step size of 0.01).

The main conclusion of this subsection is that in the case of networks that do not contain significantly large chordless cycles, the double factorial penalization of walks can produce similar results as the indices using single-factorial one when the change of the parameter is allowed. In the next section we will explore the differences observed between these two schemes for networks containing significantly large holes in their structures.

### 6.3 Networks with holes

As we have analyzed in Section 5 the graphs containing holes, i.e., chordless cycles, in their structures display a different ranking of the nodes according to the indices developed from the single- and the double-factorial penalization of walks. This situation is frequently observed in real-world networks. A couple of very typical examples are the urban street networks spatial networks () and the protein residue networks (see for instance EstradaBook ()). In the first case, the streets of a city are represented as the edges of the network and their intersections are represented by the nodes. In the second case, the nodes represent the -carbons of the amino acids in the protein and two nodes are connected if the corresponding amino acids are at a distance that allows their physical interaction. In urban street networks—see Figure 6.6—the holes are regions without streets, such as parks, big stores or natural environments like ponds. In the protein residue networks the holes are binding sites—regions in which amino acids are spatially separated to allocate other molecules—for small organic molecules or other proteins. A notable difference between the two systems is that while the first are represented mainly by planar graphs, the second is represented mainly by nonplanar ones. The determination of holes in networks is not a trivial problem and many efforts are directed to this goal due to the importance of these topological features in real-world systems holes in graphs (); chordless (); homotopy (). Information about whether the network contains holes or not can be obtained by means of the so-called “spectral scaling method” spectral scaling (); structural classes (). Figure 6.6: Illustration of the urban street network of the city centres of Bologna (left) and a protein residue network (right). Both networks contain chordless cycles (holes) although the first is planar and the second is nonplanar.

In this Section we compare the ranking of nodes using the subgraph centrality based on single and double-factorial penalization of the walks for a series of urban street networks as well as protein residue networks. For the urban street networks we selected 14 cities from around the World as a representative set. For the protein residue network we selected 14 proteins whose structures have been obtained from x-ray crystallography and deposited in the Protein Data Bank (PDB) PDB (). Each protein is identified with a unique code of one number and three lowercase letters. A fourth letter sometimes appears to designate the name of the chain to which the protein belongs to. We selected proteins of different domains and sizes ranging from 100 to 1,000 amino acids. In order to compare the ranking based on both approaches, i.e., single- and double-factorial subgraph centralities, we use the Spearman rank correlation coefficient. This index indicates how correlated are the ranking of the nodes are according to both indices.

In Table 1 we give the values of the Spearman rank correlation coefficient for all the networks analyzed in this work. For the urban street networks the average rank correlation coefficient is and for the protein residue networks it is . In both cases the rank correlation indicates that the two indices rank the nodes in significantly different ways. For the sake of comparison the Spearman correlation coefficient between the two indices for the network of Internet as an Autonomous System is . This finding—that both indices are not highly rank-correlated—is very important because it indirectly indicates that the double-factorial penalization of walks adds some structural characteristics not described by the single-factorial indices. It is important to remark that in the urban street networks there are rank correlation coefficients as low as and as high as . However, in the protein residue networks the rank correlations are less deviated from the mean. These differences reflect the important fact that cities are in general more heterogeneous than proteins. That is, there are cities with clearly defined holes in their structures while others do not necessarily display such topological features. However, we have previously shown that 95% of representative proteins contain holes indicating that the presence of chordless cycles is a universal property in these systems universality ().

In order to illustrate the differences in the ranking of nodes in a network using both types of centrality indices we consider here the urban street network of Cambridge in the UK and the protein residue network of the protein 1qba. In Figure 6.7 we plot the subgraph centralities of each node in the Cambridge urban street network (top images) and of the protein residue network (bottom images) with radius proportional to the logarithm of the subgraph centrality based on single- (left image) and double-factorial (right image). The logarithm is used here to avoid be fooled by the very large numbers of the subgraph centrality in these networks. As can be seen there are significant differences in the centrality of the nodes based on both approaches. The main difference is that the centrality based on the single-factorial identifies fewer hubs than the double-factorial, which is able to delineate complete regions in the networks. Figure 6.7: Illustration of the subgraph centrality of nodes in the urban street network of Cambridge, UK. The radius of the nodes is proportional to the logarithm of the subgraph centrality based on single- (left image) and double-factorial (right image) penalization of the walks.

Here we have confirmed what we have first analyzed in Section 5, that in networks containing holes, the subgraph centrality and other indices based on walks are significantly different when the single- or double-factorial penalization are used. In general, we observe significantly different ranking of the nodes, with the subgraph centrality based on double-factorial identifying a larger number of central nodes than the one based on single-factorial penalization. These findings are important for the analysis of specific network problems in particular areas of research, as holes mean different things in different contexts.

## 7 Conclusion

We have introduced here a new matrix function for studying graphs and real-world networks. This new matrix function of the adjacency matrix of a graph is based on the double-factorial penalization of walks between nodes in a graph. We have observed here that there are two groups of networks for which the behavior of the indices based on the double-factorial penalization changes with respect to that of the single-factorial one. In the first case we have considered networks where there are no structural holes, such as the case of the protein-protein interaction networks of yeast. In this situation we have observed that by introducing a weighting scheme of the form or the double factorial indices produce similar results as the single-factorial one for the identification of essential proteins in the yeast PIN.

The second group of networks is formed by those containing significant chordless cycles or holes in their structures, such as urban street networks and protein residue networks. In those cases the contribution of long walks is very important, in particular for navigating around such long holes in the network. In those cases we have shown how a centrality index based on single- as well as on the double-factorial penalization of the walks produce significant differences in the ranking of the nodes. We should stress that significantly large chordless cycles are present in a variety of networks and that their study is of major importance in communication systems, where they should be avoided as regions of zero-coverage of the communication signals. Consequently, the new scheme of penalizing walks by using the double-factorial opens new possibilities for the study of many problems in real-world networks.

## 8 Acknowledgment

EE thanks the Royal Society of London for a Wolfson Research Merit Award. GS thanks EPRSC for a PhD scholarship. The authors thank both referees for excellent revision and constructive comments on the manuscript.

## References

• (1) M. Abramovich, I. Stegun, Handbook of Mathematical Functions with Formulas, graphs and mathematical Tables, National Bureau of Standarts, Appl. Math. series 55 (1964).
• (2) F. Arrigo, M. Benzi, Updating and downdating techniques for optimizing network communicability, SIAM J. Sci. Comp. 38 (2016) B25–B49.
• (3) M. Barthélemy, Spatial networks, Phys. Rep. 499 (2011) 1–101.
• (4) M. Benzi, C. Klymko, On the limiting behavior of parameter-dependent network centrality measures, SIAM J. Matrix Anal. Appl. 36 (2015) 686–706.
• (5) M. Benzi, E. Estrada, C. Klymko, Ranking hubs and authorities using matrix functions, Linear Algebra Appl. 438 (2013) 2447–2474.
• (6) H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, P. E. Bourne, The protein data bank, Nucleic Acids Res. 28 (2000) 235–242.
• (7) N. Chandrasekharan, V. S. Lakshmanan, M. Medidi, Efficient parallel algorithms for finding chordless cycles in graphs, Parallel Proc. Lett. 3 (1993) 165–170.
• (8) L. F. Costa, O. N. Oliveira Jr, G. Travieso, F. A. Rodrigues, and P. R. Villas Boas, L. Antiqueira, M. P. Viana, and L. E. Correa Rocha. Analyzing and modeling real-world phenomena with complex networks: a survey of applications, Adv. Phys. 60 (2011) 329–412.
• (9) J.J. Crofts, D.J. Higham, A weighted communicability measure applied to complex brain networks, J. Roy. Soc. Interface 6 (2009) 411–414.
• (10) J.J. Crofts, D.J. Higham, R. Bosnell, S. Jbabdi, P.M. Matthews, T.E.J. Behrens, H. Johansen-Berg, Network analysis detects changes in the contralesional hemisphere following stroke, Neuroimage 54 (2011) 161–169.
• (11) H. Deng, S. Radenković, I. Gutman, The Estrada index, in: D. Cvetković, I. Gutman (Eds.), Applications of Graph Spectra, Math. Inst, Belgrade, 2009, pp. 123–140.
• (12) E. S. Dias, D. Castonguay, H. Longo, W. A. R. Jradi, Efficient enumeration of all chordless cycles in graphs, Comput. Res. Repos. abs/1309.1051 (2013).
• (13) T. Došlić, Bipartivity of fullerene graphs and fullerene stability, Chem. Phys. Lett. 412 (2008) 336–340.
• (14) V. Ejov, J.A. Filar, S.K. Lucas, P. Zograf, Clustering of spectra and fractals of regular graphs, J. Math. Anal. Appl. 333 (2007) 236–246.
• (15) V. Ejov, S. Friedlan, G.T. Nguyen, A note on the graph’s resolvent and the multifilar structure, Linear Algebra Appl. 431 (2009) 1367–1379.
• (16) E. Estrada, D. J. Higham, Network properties revealed through matrix functions, SIAM Rev. 52 (2010) 96–714.
• (17) E. Estrada, J.A. Rodríguez-Velázquez, Subgraph centrality in complex networks, Phys. Rev. E 71 (2005) 056103.
• (18) E. Estrada, N. Hatano, Communicability in complex networks, Phys. Rev. E 77 (2008) 036111.
• (19) E. Estrada, N. Hatano, Statistical-mechanical approach to subgraph centrality in complex networks, Chem. Phys. Lett. 439 (2007) 247–251.
• (20) E. Estrada, Characterization of the folding degree of proteins, Bioinformatics 18 (2002) 697–704.
• (21) E. Estrada, Generalized walks-based centrality measures for complex biological networks, J. Theor. Biol. 263 (2010) 556–565.
• (22) E. Estrada, J. Gómez-Gardeñes, Network bipartivity and the transportation efficiency of European passenger airlines, Physica D 323-324 (2016) 57–63.
• (23) E. Estrada, N. Hatano, M. Benzi, The physics of communicability in complex networks, Phys. Rep. 514 (2012) 89–119.
• (24) E. Estrada, Protein bipartivity and essentiality in the yeast protein–protein interaction network, J. Proteome Res. 5 (2006) 2177–2184.
• (25) E. Estrada, The Structure of Complex Networks. Theory and Applications, Oxford University Press, 2011.
• (26) E. Estrada, Virtual identification of essential proteins within the protein interaction network of yeast, Proteomics 6 (2006) 35–40.
• (27) E. Estrada, Spectral scaling and good expansion properties in complex networks, Europhys. Lett. 73 (2006) 649.
• (28) E. Estrada, Topological structural classes of complex networks, Phys. Rev. E 75 (2007) 016103.
• (29) E. Estrada, Universality in protein residue networks, Biophys. J. 98 (2010) 890–900.
• (30) Q. Fang, J. Gao, L. J. Guibas, Locating and bypassing holes in sensor networks, Mobile Net. Appl. 11 (2006) 187–200.
• (31) H.W. Gould, J. Quaintance, Double fun with double factorials, Mathematics Magazine 85 (2012) 177–192, .
• (32) I. Gutman, H. Deng, and S. Radenković, The Estrada index: an updated survey, in: D. Cvetković, I. Gutman (Eds.), Selected Topics on Applications of Graph Spectra, Math. Inst., Beograd, 2011, pp. 155–174.
• (33) N. J. Higham, Functions of Matrices: Theory and Computation, Society for Industrial and Applied Mathematics, Philadelphia, PA, 2008.
• (34) L. Katz, A new index derived from sociometric data analysis, Psychometrika 18 (1953) 39–43.
• (35) M. E. J. Newman, The structure and function of complex networks, SIAM Rev. 45 (2003) 167–256.
• (36) J.A. de la Peña, I. Gutman, J. Rada, Estimating the Estrada index, Linear Algebra Appl. 427 (2007) 70–76.
• (37) R. Taylor, Solution of the linearized equations of multicomponent mass transfer, Ind. Eng. Chem. Fundam. 21 (1982) 407–413.
• (38) D.M. Walker, A. Tordesillas, Topological evolution in dense granular materials: a complex networks perspective, Int. J. Solids Struct. 47 (2010) 624–639.
• (39) J. Wu, M. Barahona, Y-J. Tan, H-Z. Deng, Robustness of regular ring lattices based on natural connectivity, Int. J. Syst. Sci. 42 (2011) 1085–1092.
• (40) J. Wu, H.-Z. Deng, Y.-J. Tan, D.-Z. Zhu, Vulnerability of complex networks under intentional attack with incomplete information, J. Phys. A: Math. Theor. 40 (2007) 2665.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters   