Centralities in Simplicial Complexes. Applications to Protein Interaction Networks
Complex networks can be used to represent complex systems which originate in the real world. Here we study a transformation of these complex networks into simplicial complexes, where cliques represent the simplices of the complex. We extend the concept of node centrality to that of simplicial centrality and study several mathematical properties of degree, closeness, betweenness, eigenvector, Katz, and subgraph centrality for simplicial complexes. We study the degree distributions of these centralities at the different levels. We also compare and describe the differences between the centralities at the different levels. Using these centralities we study a method for detecting essential proteins in PPI networks of cells and explain the varying abilities of the centrality measures at the different levels in identifying these essential proteins.
Department of Mathematics and Statistics, University of Strathclyde, 26 Richmond Street, Glasgow G1 1HX, UK
There is little doubt that the use of graphs and networks to represent the skeleton of complex systems has been a successful paradigm. This simple representation in which nodes of the graph accounts for the entities of a complex system and the edges describe the interactions between these entities captures many of the complex structural and dynamical properties of the represented systems. However, such representation is far from complete. One of its main drawbacks is its concentration of binary relations only. That is, in a network the interaction between entities occurs in a pairwise way. This excludes other higher-order interactions involving groups of entities. Let us provide some examples. Networks have been widely used to represent protein-protein interactions (PPIs) where the nodes represent proteins and pairs of interacting proteins are connected by edges of the network. These PPI networks contain many triangles in which triples of proteins are considered to be interacting to each other. Now, let us consider that there are three proteins , and that form a heterotrimer, that is an complex in which the three proteins interact with each other at the same time. The network-theoretic representation is not able to differentiate this situation from the case where there are three proteins , and and they interact in a pairwise manner, e.g. , , . The existence of heterotrimers in well-documented, an example is the heterotrimeric protein formed by the three proteins , and . An attempt to amend this problem has been made by using hypergraphs, also known as hypernetworks. In this case, the triple of proteins form an hyper-edge which accounts for the simultaneous interaction of the proteins in the complex. However, hypergraphs have a main drawback when trying to capture all the subtleties of these complexes. For instance, in the heterotrimeric protein, the proteins and form a subcomplex known as which is part of the trimeric form. This situation is not necessarily captured by the hypergraph representation where hyperedges are not necessarily closed under the subset operation. Examples of real-world systems where this closure under the subset operation is required abound and a very nice example provided by Maletić and Rajković [Maletić and Rajković, 2012] according to them provided by Spivak—, where four people have a chat in which everybody can listen to each other. Obviously, the conversation is not pairwise as represented by the graph, and is not only in the form of the hyper-edge represented by the hypergraph, but a combination of the quadruple, triangles and edges. The best way to represent such situations is by means of the so-called simplicial complexes.
Informally, a simplicial complex is a mathematical object, which originated in algebraic topology and is a generalization of a network. Starting with a set of nodes, instead of being limited to sets of size two, the simplices can contain any number of nodes. A characteristic feature of a simplex of a certain size is that all subsets of must also be simplices. In this way simplicial complexes differ from hypergraphs. For instance, if there is a simplex in a simplicial complex then are also simplices in the simplicial complex. All subsets of those four simplices must also be simplices in the complex. There is a recent interest in these mathematical objects for representing complex systems and we should mention here their applications to study brain networks [Giusti et al., 2016; Courtney and Bianconi, 2016; Lee et al., 2012; Petri et al., 2014; Pirino et al., 2015], social systems [Maletić and Rajković, 2009] [Maletić and Rajković, 2014; Kee et al., 2016], biological networks [Xia and Wei, 2014] [Xia and Wei, 2015; Cang et al., 2015], and infrastructural systems [Muhammad and Egerstedt, 2006] [Tahbaz-Salehi and Jadbabaie, 2010; De Silva and Ghrist, 2007; De Silva et al., 2005; Ghrist and Muhammad, 2005].
Centrality indices have been among the most successful tools used for discovering structural and dynamical properties of networks. A centrality index is a numeric quantification of the ’importance’ of a node in terms of its position, structural and/or dynamical, in the network. Here, we extend this concept to simplicial complexes to capture the relevance of a simplex of a given order in a simplicial complex. In particular, we apply this extended concept to the study of properties of protein-protein interaction (PPI) networks.
Simplicial complexes have been much studied in the literature [Horak et al., 2009; Sizemore et al., 2016] and definitions similar to those which appear in the preliminaries section can be found elsewhere [Muhammad and Egerstedt, 2006; Tahbaz-Salehi and Jadbabaie, 2010; Muhammad and Jadbabaie, 2007; Maletić and Rajković, 2012; Goldberg, 2002]. However, we repeat them here to make this paper self-contained.
Let be a set of nodes or vertices. Then a -simplex is a set such that and for all . A face of a -simplex is a -simplex of the form for . A simplicial complex is a collection of simplices such that if a simplex is a member of then all faces of are also members of . Less formally, a simplicial complex is a collection of simplices such that if is a simplex then all of its faces are also simplices, and all of the faces of its faces are also simplices, and so on down to the -simplices, which are formed just by the nodes. As mentioned previously, networks can be realized as simplicial complexes. The nodes are the -simplices which are specified by the set , while the edges are the -simplices and there are no higher order simplices. It is also possible to create simplicial complexes from networks. In this work we will be interested only in the kind of simplicial complexes defined below, which are known as clique complexes. A clique complex is a simplicial complex formed from a network as follows. The nodes of the network become the nodes of the simplicial complex. Let be a clique of nodes in the network. Then, is a -simplex in the clique complex. As an example in Figure 1 we illustrate a simplicial complex which has one -simplex , seven -simplices and. It also has fourteen -simplices represented by the edges and nine -simplices which are usually known as the nodes.
In network theory it is fairly clear when two nodes are adjacent. However, adjacency is less easy to define in simplicial complexes. There are two ways in which two -simplices and can be considered to be adjacent. We call them lower and upper adjacency. Let and be two -simplices. Then, the two -simplices are lower adjacent if they share a common face. That is, for two distinct -simplices and then and are lower adjacent if and only if there is a -simplex such that and . We denote lower adjacency by . For instance, in the simplicial complex in Figure 1, the -simplices and are lower adjacent because the -simplex is a common face of them and we can write . Similarly, are lower adjacent as they share the common face . However, and are not lower adjacent because although they have the common -simplex they would need to share a common -simplex to be considered lower adjacent. Note that two -simplices can never be lower adjacent as we do not allow to be a -simplex. Let and be two -simplices. Then,the two -simplices are upper adjacent if they are both faces of the same common -simplex. That is, for and then and are upper adjacent if and only if there is a -simplex such that and . We denote the upper adjacency by . In the simplicial complex in Figure 1, the -simplices and are upper adjacent because they are both faces of the -simplex which is a common face of them. So we can write . Similarly, are upper adjacent as they are both faces of the -simplex . However, is not upper adjacent to any other simplex as it is not part of any -simplices. Also note that are upper adjacent because they are both being faces of . So two -simplices are upper adjacent if they are both faces of a -simplex which is identical to saying that two nodes are adjacent if they are connected by an edge in the network theoretic sense. Hence upper adjacency of -simplices is the same as network theoretic adjacency.
We shall now introduce some families of simplicial complexes which shall be important later in the paper. Firstly, we introduce the family denoted . The simplicial complex consists of a central -simplex which is a face of every one of the -simplices. In addition, there are no other simplices except those necessary by the closure axiom. For instance, would consist of an edge and triangles of the form in addition to all subsimplices necessary by the closure axiom. While, consists of a central node with pendant nodes connected to it, which corresponds to the star graph in graph theory. The simplicial complex is shown in Figure 2(left).
Next we introduce a family of simplicial complexes labeled which consists of a central -simplex with -simplices lower adjacent through one face, -simplices lower adjacent through another, and so on. A -simplex which is lower adjacent to the central -simplex can only be lower adjacent to other -simplices which are lower adjacent to the central -simplex through the same face as itself. There are no other simplices except those necessary by the closure axiom. One member of this family of simplices, is shown in Figure 2(center).
The final family of simplicial complexes which we shall introduce are denoted , consisting of a -simplex at one end which is only adjacent to one other -simplex. This one is only lower adjacent to the first -simplex and another -simplex, and so on until arriving at another end -simplex. In addition, there are -simplices in the simplicial complex and no other simplices except those necessary by the closure axiom. Note that a simplicial complex is the same as a path graph in the traditional network theory. The simplicial complex is illustrated in Figure 2(right).
3 Adjacency Matrices in Simplicial Complexes
The goal of this section is to define a general adjacency matrix for a simplicial complex that allows us to define general centrality indices for these mathematical objects. Based on the previous definitions of lower and upper adjacency relations we define the corresponding adjacency matrices here.
Let and be two -simplices in a simplicial complex. Then, the lower adjacency matrix at the -level in the simplicial complex has entries defined by
where the subindex indicates lower adjacency.
In a similar way we have the following.
Let and be two -simplices in a simplicial complex. Then, the upper adjacency matrix at the -level in the simplicial complex has entries defined by
where the subindex indicates upper adjacency.
If two distinct -simplices and are upper adjacent then there exists some -simplex such that and . Without loss of generality we have and then . This means that and are also lower adjacent. An alternative proof of this can be found in [Goldberg, 2002].
The above two definitions for two -simplices to be adjacent leads us to the problem that there are now four possible notions we can use to define a general adjacency matrix for simplicial complexes. The four possibilities are . Each of these possible definitions of adjacency have pros and cons as we explain in the next paragraph.
Simply using the lower adjacency matrix does not isolate the effects of -simplices from higher order simplices. In particular, for -simplices the lower adjacency matrix simply describes the line graph of the network. The line graph is a transformation of the graph in which the nodes of the line graph are the edges of the graph, and two nodes of the line graph are connected if the corresponding edges in the graph are incident to a common node. On the other hand, using the upper adjacency matrix would ignore the effects of any -simplices which are not faces of higher simplices, meaning that there is potential for a lot of information to be missed. For instance, there could be many -simplices (triangles) in a network but not necessarily so many -simplices, then the upper adjacency matrix does not identify any of them as adjacent to each other. It is worth noting that the traditional adjacency matrix of a network corresponds to although two -simplices cannot be lower adjacent. Using the sum of the two adjacency matrices, would emphasize the effects of the higher simplices over the lower ones. However, it would lead to an adjacency matrix which features ’s where two simplices are upper adjacent. What we want is an adjacency matrix which indicates when two simplices are adjacent or not. Thus this would not be appropriate. This leaves us with the difference of the two adjacency matrices as our notion of general adjacency.
For we have that two -simplices are considered adjacent if they are both lower adjacent and not upper adjacent. For two simplices shall be adjacent if they are upper adjacent. We shall denote two -simplices, to be adjacent in the way defined here by .
This definition allows us to remove most of the effects of higher simplices being adjacent in the adjacency matrix at the lower simplex levels. A consequence of this is that it allows us to analyze the relationships between the centralities of simplices and their faces which we are particularly interested in at the node level. Secondly, this notion of adjacency lines up nicely with the extensively studied higher order Laplacians of simplicial complexes [Muhammad and Egerstedt, 2006]. An off-diagonal entry of the higher order Laplacian matrix is non zero if and only if the corresponding off-diagonal entry of is non-zero. This is the definition that shall be used in the rest of this work. Further information on the Hodge Laplacian matrices can be found in [Muhammad and Egerstedt, 2006; Tahbaz-Salehi and Jadbabaie, 2010; Muhammad and Jadbabaie, 2007; Maletić and Rajković, 2012; Goldberg, 2002]. Then we have the following important definition of adjacency matrix of the simplicial complex.
Let and be two -simplices in a simplicial complex. Then, for the adjacency matrix at the -level in the simplicial complex has entries defined by
for the adjacency matrix shall be given by the upper adjacency matrix.
4 Simplicial Shortest Path Distance
In this section we will extend the concept of shortest path distance to the different levels of a simplicial complex. We start by extending the concept of walks to simplicial complexes.
Let . Then, a -walk is a sequence of alternating -simplices and -simplices such that for each is a face of both and , and and are not both faces of the same -simplex. For a walk on the -simplices is just a walk in the normal graph-theoretic sense.
On the simplicial complex from Figure 1, we have that is an -walk. Meanwhile, is an -walk.
A -shortest path between two -simplices is a -walk, , such that is minimized. The value is the -shortest path length between the two -simplices . We denote this .
It can be easily seen that the simplicial shortest path length between two -simplices is a proper distance. By definition for all where is the set of -simplices. Clearly . To prove then assume then the -shortest path from to is of the form . This means that there is a -walk from to of the form . We can then relabel and so on to give a -walk from to of the form thus . If there was a -walk shorter than this then there would also be a -walk from to which was shorter than the original walk by symmetric arguments thus and . To prove let and then there is a -walk from to of the form and -walk from to of the form we can combine these and relabel the simplices in the second walk by the rules to form a -walk from to of the form . This implies that . For instance, on the simplicial complex from Figure 1, we have that is a -shortest path from to and we have . Meanwhile, is a -shortest path between and and we have .
A simplicial complex is -connected if and only if there does not exist a pair of -simplices , where is the set of -simplices, such that .
Note that a simplicial complex being -connected does not mean that it is -connected or -connected. The simplicial complex in Figure 1 is -connected but not -connected because and are not adjacent to any of the other -simplices. Many of the real world networks we will introduce in a later section are -connected but not -connected. In addition, a simplicial complex from the family is -connected but it is not -connected. The central -simplex is upper adjacent to every other -simplex and hence is not adjacent to any of them.
An -connected component of a simplicial complex is a subset of the -simplices such that for any two -simplices we have and for any and we have that .
The -eccentricity of a -simplex is the largest -shortest path distance between and any other -simplex. The -diameter of a simplicial complex is the maximum -eccentricity of any simplex in the network where is the set of -simplices. As an example, in the simplicial complex , depicted in Figure 2, the central -simplex has -eccentricity because it is adjacent to all the other -simplices in the complex. However all the peripheral -simplices have a -eccentricity of because the shortest path form a peripheral -simplex on one arm to a peripheral -simplex on another is through the central -simplex for a shortest path of length . This means that has -diameter .
Given a notion of shortest path distance we are now equipped to define the average simplicial shortest path distance. The -average simplicial shortest path length is the average -shortest path distance for all possible -simplices in the network
where is the set of -simplices in the network and is the -shortest path distance between and . Note for this measure to make any sense the simplicial complex needs to be -connected. If the simplicial complex is not -connected then we can analyze each -connected component separately. We will now prove bounds on the -average path length. If we assume that there are at least two -simplices in the simplicial complex. For to be less than there would need to be two -simplices, such that this would imply and hence by the properties of a metric. The lower bound is achieved by a simplicial complex of the form . This is easy to check. A simplicial complex of the form consists of a -simplex and some -simplices of the form , where , in addition to all subsimplices necessary by the closure axiom. Hence, all -simplices are lower adjacent to each other by the -simplex and they are not upper adjacent to each other because there are no -simplices. Thus, every -simplex is adjacent to every other -simplex and the -shortest path distance between any two -simplices is . Hence, the -average path length is , which implies that the lower bound of is .
A general upper bound of is hard to establish due to of the dependence on the number of simplices, . However, if we fix both and then we can prove the following result.
Let be the number of -simplices. Then, the upper bound of is
Assume that the simplicial complex is -connected and that . If then , the simplicial complex is -connected and there are only 2 -simplices hence they must be adjacent. Thus . In addition . Hence the lemma holds for . Assume that the Lemma holds for . Let then to maximize we need to maximize . Pick a -simplex . First, we will maximize . For for some , first it must be the case that for some such that . This means that the largest possible value of for some is . This gives where represents the th triangle number. Now this implies that there is only one -simplex, such that . This means that is adjacent to precisely one other -simplex, namely . Because is adjacent to only one other simplex, can be removed without affecting the -shortest path distances between any other -simplices. We now have a simplicial complex such that . We know that the upper bound of the -average path distance for this smaller simplicial complex is by assumption where is the contribution given by . We also know that the largest number we can add to the sum by the addition of a -simplex is given by . Thus . This means that the upper bound of is . Clearly as , and so there is no upper bound for . It should be fairly clear that a simplicial complex of the form will achieve this bound. ∎
5 Simplicial Centralities
5.1 Centralities based on simplicial shortest-path
We are now in a position to generalize some centrality notions for simplices which are based on the simplicial shortest path distance. The simplest of all centrality measures is the degree. In the case of the simplicial complexes we have three levels of degrees, which we will designate as , where is the level of the simplex, i.e., nodes, edges and triangles, respectively, and is the corresponding simplex. The degree of a -simplex is the number of other -simplices to which is adjacent. If is the probability of finding a -simplex of degree in a simplicial complex and is the probability of finding a -simplex of degree larger or equal than in the simplicial complex, then the degree distribution of the -simplices is the probability distribution of the degrees of the -simplices across the whole of the simplicial complex.
Closeness centrality is a concept first introduced by Bavelas [Bavelas, 1950] to capture the idea of how close—in terms of shortest path distance—two nodes are in a network. Here we will generalize this concept to simplicial complexes. The simplicial farness of a -simplex is the sum of its -shortest path distances to all other -simplices, . The simplicial closeness is the reciprocal of simplicial farness. That is
Note that if the simplicial complex is not -connected then could be considered undefined or for all -simplices in the simplicial complex. In this case we can calculate simplicial harmonic closeness instead. This is a generalization of a definition that can be found in [Rochat, 2009]. The simplicial harmonic closeness of a -simplex is defined as follows
where we treat .
We would now like to establish some bounds on the simplicial closeness centrality. However, there is an issue that needs to be considered before bounds can be established. The issue is that the sum, depends on the number of simplices in the complex. If for all and , where is the set of -simplices, then . Clearly this is the largest can be for -simplices. We can normalize this by multiplying by to give an upper bound of .
The upper bound of the normalized simplicial closeness centrality can be attained by all simplices in a simplicial complex of the form .
In a simplicial complex of the form we have -simplices which are all adjacent to each other. Thus if we select a particular -simplex we have that for all such that . This gives . Hence . ∎
We now prove a lower bound for the normalized simplicial closeness centrality
Let us consider a -connected simplicial complex with -simplices. Then, the lower bound for the normalized simplicial closeness centrality of a -simplex is , and this is attained asymptotically when .
Assume the simplicial complex is -connected. We are trying to minimize and hence trying to maximize . Take a -simplex in a simplicial complex which has -simplices. Firstly, because is -connected there exists a -walk between and every other -simplex in the simplicial complex. The farthest distance possible between and another simplex is . This means that the -shortest path between these two simplices looks like . The shortest path between and any other -simplex must be the path , i.e. the shortest path from to but cut off at simplex . If there was a shorter path from to then you could replace this part of the path from to with said shorter path from to and have a shorter path from to . Hence, the -shortest path distance from to any other -simplex, is . Note that for to be at distance from a -simplex , must first be at distance from a simplex adjacent to . Hence, the maximum value for . This gives a lower bound on of which after normalization by multiplication by gives a lower bound on of . This clearly tends to as . ∎
The bound of for a given number of -simplices is achieved by the end simplex in a -path.
If a simplicial complex is not -connected then or it is considered undefined for all -simplices . Thus, the peripheral -simplex which is only adjacent to the central -simplex in the complex has simplicial closeness given by . We have where is the given simplex and is a run through of the other simplices. The is contributed by the shortest path from to the central simplex while the s are given by the shortest path distances from to the other peripheral simplices on the other branches. While for the normalization.
To give an example from the simplicial complex in Figure 1 we need to use the definition of simplicial closeness given in Definition 14. So to calculate the Simplicial closeness of we have . This is because it is adjacent to and has shortest path distance to both and . There is no -path from to any of the other simplices.
The second centrality notion which is based on shortest paths that we can generalize is the betweenness centrality. Betweenness centrality was introduced by Freeman in 1977 in order to capture the notion of how central a node in a network is in passing information through other nodes. The following is a direct generalization of this definition [Freeman, 1977]. The simplicial betweenness of a -simplex is defined as follows
where is the total number of shortest paths from to and is the number of such paths that pass through , where .
The betweenness centrality of a -simplex increases as the number of pairs of other simplices increases. It is therefore sensible to divide by , the number of pairs of -simplices which are not the simplex . This gives a value for simplicial betweenness in the range . The lower bound of is attained by every -simplex in a simplicial complex of the form . It is also attained by any simplex which is adjacent to only one other simplex. The upper bound of can be attained by the central -simplex of a simplicial complex where .
5.2 Spectral simplicial centralities
We now move to the concepts of centrality based on spectral properties of the simplicial complexes. Historically, for networks the first of these centralities was developed by Katz [Katz, 1953]. The Katz centrality index tries to capture the notion that a node in a network is not only influenced by its nearest neighbors but in a lower extension by any other node separated at a given distance from it, in a way in which such influence decays with the separation between the nodes. In this section we generalize these ideas to simplicial complexes largely following the example of [Estrada et al., 2015].
To make this task easier we define an underlying network of simplices at every level of a simplicial complex. For all the adjacency matrix of the -simplices of a simplicial complex also gives rise to a network where each node corresponds to a simplicial complex and there is an edge between two nodes if and only if their corresponding -simplices are adjacent in the simplicial complex. We call this network the underlying network of simplices. This immediately gives us some results from network theory.
Let be the adjacency matrix between -simplices in a simplicial complex. Then, gives the number of -walks of length between -simplex, and -simplex, .
Every walk on the underlying network of simplices for a given simplex of size , has a corresponding -walk over the -simplices. We have that is also the adjacency matrix for the nodes in the underlying network of simplices. Thus, powers of the adjacency matrix can be used to give the numbers of walks of a given length on the underlying network of simplices. In particular, means that there are walks of length between node and node in the underlying network of simplices at the -simplex level. This precisely corresponds to -walks of length between simplex and simplex . Simplex and simplex are the simplices represented by node and node respectively in the underlying network of simplices. ∎
Let be the adjacency matrix representing the adjacency between -simplices in a simplicial complex. The simplicial Katz centrality index is given by
where . The simplicial Katz centrality is essentially the network-theoretic Katz centrality applied to the underlying network of simplices. This means that as proved in [Estrada et al., 2015] the series, converges when , where is the spectral radius of . This means that . We also have from [Estrada et al., 2015] the representation of the Katz centrality in terms the eigenvalues and eigenvectors of . This representation gives . Where and are the th and th entries of the th eigenvector of , respectively and is the th eigenvalue of .
We can now use the simplicial Katz centrality to define the simplicial eigenvector centrality. The following adjustment of the Katz centrality appears in [Estrada et al., 2015] and can also be applied to the simplicial Katz centrality.
Again following the example of [Estrada et al., 2015] allows to approach the inverse of the largest eigenvalue of from below . This gives
Therefore the eigenvector associated with the largest eigenvalue of could also be said to be a centrality measure. This leads us to the following definition. The simplicial eigenvector centrality of the th -simplex in a simplicial complex is given by the th component of the principal eigenvector of ,
In a similar way as in the previous section we make a generalization of the exponential of the adjacency matrix of -simplices which relies on results from the paper [Estrada and Rodriguez-Velazquez, 2005] The following power series of the adjacency matrix of -simplices in a simplicial complex converges to the corresponding matrix exponential
where the first term accounts for the weighted sum of even-length walks and the second one accounts for odd-length walks in the simplicial complex.
We can now define a centrality measure analogous to subgraph centrality for simplicial complexes. Subgraph centrality was introduced for networks by Estrada and RodrÃguez-VelÃ¡zquez [Estrada and Rodriguez-Velazquez, 2005] to capture the participation of a node in a network in all subgraphs in the network, giving more weight to the smaller than to the larger ones. This is a direct generalization made possible by the adjacency matrices at the different levels of the simplicial complex. Then, the simplicial subgraph centrality of a -simplex, , is given by . For the simplicial complex in Figure 1 we have that the simplicial subgraph centrality of the -simplex is 2.714 while the simplicial communicability between and is 2.0363. Note that any bounds on subgraph centrality or simplicial communicability for networks still hold due to the underlying network of simplices.
Let and let us consider an -connected simplicial complexes which contain a fixed number of -simplices. The upper bound of the simplicial subgraph centrality is attained by every simplex in a simplicial complex and the lower bound is attained by the two end simplices in a simplicial complex of the form .
Fix a number of -simplices to . It is known that for nodes the upper bound of subgraph centrality in networks is attained by every node in the complete graph [Estrada and Rodriguez-Velazquez, 2005]. The subgraph centrality for the underlying network of simplices at the -simplex level is the same as the simplicial subgraph centrality for -simplices. Thus, to find the upper bound of the simplicial subgraph centrality we need to find a simplicial complex whose underlying network of simplices is a complete graph. The simplicial complex satisfies this criterion. Similarly, the lower bound of subgraph centrality in networks is attained by the two end simplices in a path graph of length . Thus to find the lower bound of the simplicial subgraph centrality you need to find a simplicial complex whose underlying network of simplices is a path graph. The simplicial complex satisfies this criterion. ∎
6 Analysis of Protein Interaction Networks
Here we study 10 protein-protein interaction (PPI) networks. In these networks nodes represent proteins and undirected links represent the interaction between two proteins determined experimentally. The networks studied correspond to the following organisms: D. melanogaster (fruit fly) Giot et al. , Kaposi sarcoma herpes virus (KSHV) Uetz et al. , P. falsiparum (malaria parasite) LaCount et al. , varicella zoster virus (VZV) Uetz et al. , human Rual et al. , S. cereviciae (yeast) Bu et al. , A. fulgidus Motz et al. , H. pylori (Lin et al. ; Rain et al. ), E. coli Butland et al.  and B.subtilus Noirot and Noirot-Gros . We study only the largest (main) connected component of each of these networks, which range from 50 to 3039 proteins. We then transformed these networks into their clique simplicial complexes consisting of edges and of triangles, respectively. The number of simplices and interactions at the nodes, edges and triangle level are given in Table 1. Notice that the number of simplices at the edges level is the same as the number of interactions at the nodes level.
6.1 Degree distributions
The study of node degree distribution has become one of the standard tests considered for the structural analysis of networks. A network with a broad degree distribution–also know as fat-tailed distribution–is characterized by the presence of a few hubs–high degree node–which keep the network together. These hubs are important from the structural and functional point of view in these networks. In the case of PPI networks hubs are expected to play fundamental role in the cell and their knockout is expected to produce a large cellular damage. This is the main hypothesis of the centrality-lethality paradigm. A particular kind of fat-tailed distribution, the power-law one, received a large deal of attention in the literature. A power-law degree distribution is also know as a scale-free distribution and it is indicative of some self-similarities properties in the network. At the beginning of the XXI century a deluge of papers finding scale-free distributions in almost every network were published. Many of the existing PPI networks were characterized as scale-free ones based on these findings. Later, more order has being in place and some authors have found that almost none of the PPI networks previously claimed to have scale-free structures were so [Stumpf and Ingram, 2005]. The main message of these experiences is that most of PPI networks indeed display some kind of heavy-tailed degree distributions, such as power-law, lognormal, Burr, logGamma, Pareto, etc. However, as we will see here this is not necessarily true when a large number of statistical distributions and goodness of fit parameters are tested for the 10 PPI networks considered in this work.
Here we consider the probability degree functions (PDF), vs. , for 10 PPI networks at the three different levels studied in this work, i.e., nodes, edges, and triangles. For each of the PDFs we fit the data to every of the following distributions: Beta, Binomial, Birnbaum-Saunders, Burr, Exponential, Extreme Value, Gamma, Generalized Extreme Value (GEV), Generalized Pareto (gen-Pareto), Half-normal, Inverse Gaussian, Kernel, Logistic, Loglogistic, Lognormal, Nakagami, Negative Binomial, Normal, Poisson, Rayleigh, Rician, Stable, Location-Scale, and Weibull. The best fit was determined on the basis of the following statistical parameters: Akaike information criterion (AIC) [Konishi and Kitagawa, 2008; Symonds and Moussalli, 2011] and the Bayesian information criterion (BIC) [Konishi and Kitagawa, 2008]. These indices are defined as follow:
where is the number of data points, is the number of parameters to be estimated and is the maximized value of the likelihood function of the model , where are the parameter values that maximize the likelihood function and are the data points. For a series of models trying to describe the same dataset, the smallest values of these three parameters gives the best fit for the data. However, it is important to consider the differences between the values of these parameters for the corresponding models as we describe below.
We then first fit the dataset corresponding to the degrees of the corresponding simplices in a PPI to all the studied distributions. Then, we rank all the distributions in increasing order of their AIC. We then compare the values of the first few distributions in the ranking using , where is the AIC for the top distribution in the ranking. If we consider that the first distribution in the ranking is significantly different from the second (and consequently the rest) as to accept it as the most significant one. In those cases where the differences in the AIC is not significant we also consider the difference in the BIC values. In this case we apply the Kass-Raftery criterion as follows:
This means, for instance, that if the difference in the values of BIC is not bigger than 2, this criterion is not able to distinguish between the two distributions. If, however, it is between 6-10 there is a strong criterion to consider the distribution with the smallest BIC as the most significant one [Kass and Raftery, 1995]. In Table 2 we show the best distribution fitted for each of the datasets studied.
The most interesting observation from the results shown in Table 2 is that all distributions obtained for the three levels of the simplicial complexes of the 10 PPI networks studied are heavy-tailed distributions. At the node level, the 7 distributions that were statistical significant–for the other three the statistical criteria used were not able to distinguish between the first few distributions—correspond to a generalized Pareto distribution, where the probability of finding nodes of a given degree decays as a power-law of the corresponding degree (see Appendix). At the edge level, the PPI networks display GEV, generalized Pareto, and gamma distributions, all of which are heavy-tailed (see Appendix). Finally, at the triangles level 5 PPI networks display generalized Pareto distributions and for the others it was not possible to determine the best distribution. These results indicate that at the three levels studied here, nodes, edges and triangles, there are simplicial-hubs which concentrate most of the connectivity of the simplicial complexes at the corresponding level. The damage of these hubs is expected to produce catastrophic consequences for the functionality of the cell. On the other side of the coin, the existence of heavy-tailed distributions guarantees that the corresponding simplicial complexes are more robust at these levels to the random failure of simplices. These results also point out to the necessity of using other types of characterization of the degree heterogeneity for simplicial complexes by considering not-statistical indices, which can be applied even for small datasets and/or datasets with small variability in their degrees [Estrada, 2010]. This is an ongoing project in our group which will be considered in a separate work.
6.2 Comparison of centralities at different levels
Simplicial centrality measures are all designed to identify the “most important” simplices in a simplicial complex at different levels and according to certain topological feature of the complex, such as nearest-neighbor connectivity (degrees), proximity of other simplices (closeness) and participation of a simplex in small sub-complexes with other simplices (subgraph centralities). Then, it is expected that there is some correlation between the centralities inside each level of analysis. That is, it is expected that node degree is somehow correlated to node closeness or node subgraph centrality for a given PPI. For instance, in the yeast PPI the node centralities (degree, closeness and subgraph centrality) have an average rank correlation coefficient , with the hugest rank correlation coefficient being between the closeness and the subgraph centralities (). At the edges level this average rank correlation is of and at the triangle level it raises up to .
In contrast with what we expect, and observe, at the individual levels of the simplicial complex, is what we should expect on the relations between two different levels of the simplicial complex. That is, we do not have any theoretical insight indicating whether the information provided by the centralities at the node level is or is not correlated to that provided at the edges or triangles ones. In this case, however, it should be desirable that not so high rank correlation is observed as a way to increase the amount of different structural information encoded by the simplicial centralities. This is indeed what is observed for the PPI simplicial complex of yeast. The average rank correlation coefficient between the nodes and edges centralities is just , and that between nodes and triangles centralities is . Finally, the average rank correlation between the edges and triangles levels is barely . These lack of correlations between the inter-level centralities (see Table 3) clearly indicate that the top nodes in the ranking at one simplicial level does not necessarily coincide with that produced by the centralities at a different level.
It is also important to consider that none of the correlations are negative. This implies that none of the centralities fundamentally disagree with each other. It is not the case that a centrality at one level is telling us that one set of nodes is not important and another set of nodes is, while a centrality at a different level is telling us the exact opposite. It is more likely that a centrality at one level is telling us that one set of nodes is important while a centrality at a different level is telling us the same thing but the order of importance is shuffled between the two centralities. This hypothesis is backed up when we consider the triangle and node degrees. Of the 100 most central nodes according to these centralities 24 coincide. When we consider the top 300 this rises to 111 proteins (37%) and looking at the top 500 the two centralities identify 268 (53.6%). This may explain the difference between the