1 Introduction
Abstract

We present a network model in which words over a specific alphabet, called structures, are associated to each node and undirected edges are added depending on some distance between different structures.

It is shown that this model can generate, without the use of preferential attachment or any other heuristic, networks with topological features similar to biological networks: power law degree distribution, clustering coefficient independent from the network size, etc.

Specific biological networks (C. Elegans neural network and E. Coli protein-protein interaction network) are replicated using this model.

A network model with structured nodes

[2mm] Pierluigi Frisco

School of Math. and Comp. Sciences, Heriot-Watt University,

EH14 4AS Edinburgh, UK,

P.Frisco@hw.ac.uk

1 Introduction

In the last years mathematical and computer science (CS) concepts and methodologies and have been successfully used in Biology. This fascinating and fruitful combination of these disciplines has clear advantages for both of them. When biological phenomena are regarded as information processes, then they can be studied using mathematical and CS tools and concepts. This gives to Biology new ways to approach problems, solutions to them and this deepens the understanding of biological processes. At the same time CS enriches itself with new ways to define and study information process while Mathematics enriches itself with new concepts and theories.

In the last decade several studies ([2, 13, 9, 12]) showed the importance of the topology of biological networks. These results proved that biological networks are composed of motifs, that biological networks with specific functions have an abundance of certain motifs instead of others, that the number of edges for the node in the network follows specific laws, etc.

More than studying the features of empirical networks, it is also important to have algorithms able to generate networks with the same features of empirical ones. This kind of algorithms, called models, are an invaluable help in the generation of artificial networks and they provide insights on how certain features of complex empirical networks arise from the construction rules present in the model.

Examples of such procedures are: the Erdös-Rényi model [5], the Watts-Strogatz model [17] and the Barábasi-Albert model [3] and its variants [1, 14]. The E-R model allows to generate random networks able to reproduce the small-world property (short path from any node to any other node in the network) but they fail to account for the local clustering characterising many empirical networks. Both these properties are captures by the W-S model, but unfortunately it does not capture the inhomogeneous degree distribution found in many empirical networks. The B-A model can overcome these limitations and gives rise to the degree distribution. This degree distribution is obtained using preferential attachment: the probability for a node to receive an edge depends on the number of edges the node already has. The original B-A model does not capture the independence of the clustering coefficient from the size (number of nodes) of a network. This feature is captured by a variant [14] of this model in which heuristics (replication of networks) are used.

The present study originates from the wish to create a network model able to reproduce biological networks without the use of heuristics. Despite the very many successful applications of the B-A model, it was not clear to us how preferential attachment could have been present in the evolution of, say, gene networks. Why a gene with many interactions is more likely to get even more interactions than a gene with few interactions? How can a new added gene “know” what are the genes with more interactions? In this respect, we believe that preferential attachment capture the overall effect of something more basic present in the evolution of biological networks.

The network model introduced and studied in the present paper tries to capture some basic features present in the evolution of biological networks: network growth, node structure and distance between node structures.

The node structure represents, for instance, the DNA sequence in genes, proteins’ secondary structure, the personality features in humans, etc. The distance between nodes represents, for instance, the fact that proteins will interact if their tertiary structure (which depends on their secondary structure) allows it, or that two humans will be friends if the treats of their personality are somehow close.

In the following we present the model with structured nodes (Section 2), we analyse it (Section 3) and we use it to generate specific biological networks (Section 4). The paper ends with a discussion section (Section 5). Supplementary material (further technical details, generated networks, program implementing the proposed network model, etc.) is present at [11].

2 Description of the model

The network model with structured node (SN model) is such that each node in the network has a structure: a word over a specified alphabet. Given initial nodes have different structure. Nodes are added to the network one by one. Each new node has a structure given by the modification of a randomly chosen structure already present in the network. If the structure of the new node is already present in the network, then the new node is not added (that is, in the network all nodes have different structure). If the structure of the new node is not present in the network and the new node has no edge with the existing nodes, then the new node is not added (that is, isolated nodes are not allowed). Undirected edges are added to the network depending on a given distance between node structures. This process is repeated until the network reaches a given number of nodes. A simple example follows.

Let us assume that the alphabet is {A, B, C} and that the network contains only one initial node with structure ABCABC. Edges between nodes are added only if the Hamming distance [18] between the structures of the nodes is at most 1.

A node can be added to the network by mutating one symbol in the structure of an existing node. For instance, the node ABBABC can be obtained mutating the third symbol of ABCABC. An edge is added between the two nodes (they only differ in one symbol).

A third node can be added to the network by adding one symbol to the structure of a randomly selected existing node. For instance, the node ABBABBC can be obtained adding a B between B and C in node ABBABC. An edge is added between the new node and ABBABC (when computing the distance between two structures exceeding symbols in the longer structure are disregarded). No edge is added between the new node and ABCABC because there are 2 differences in their first 6 characters.

The structure ABCBC can be obtained from ABCABC deleting the second B. The node with this new structure does not become part of the network as no edge has been added (the distance between ABCBC and the other structures present in the network is bigger than 1).

The structure ABBBBABBC can be obtained from ABBABBC duplicating the second and third B. The node with this new structure does not become part of the network as no edge has been added.

Input parameters define the probabilities to mutate, add, delete and duplicate node structures and their values has to sum up to 1.

We also used a Hamming distance in which the comparison between symbols considers groups of consecutive symbols. The order of the symbols present in each such group is irrelevant to the distance. For instance, let us consider the two structures ABBABC and BABCAB. If the unit distance is 1 (i.e., symbols are compared one by one), then the distance between the two structures is 5 as the only matching symbol is the B in the third position. If the unit distance is 2 (i.e., pairs of symbols are compared), then the distance between the two structures is 2. This is because the first two pairs are considered equal (AB and BA differ only in the order of the symbols), and the other two pairs are different in the symbols they contain. If the unit distance is 3 (i.e., triplets of symbols are compared), then the distance between the two structures is 0. This is because the first triples are considered equal (ABB and BAB differ only in the order of symbols) as well as the second triple (ABC and CAB differ only in the order of symbols).

An edge between two nodes is present only if their distance is smaller/equal than the value of the input parameter maximum distance.

When unit distance is bigger than 1, then it is possible to have a file matches indicating how the different groups of symbols can be matched to eachother. In other words, a file matches behaves as the genetic code: it denotes which tuples of symbols have to be regarded as equal (in the same way different codons translate in the same amino acid). For instance, let unit distance be 2, the alphabet be {A, B}, and the file matches be:
AB =
BA =
AA = BB
BB = AA
With this file matches, the strings ABBB and ABAA have distance 0. This is because the first pair (AB) is the same in both strings, while the second pair (BB and AA) is defined by the file matches to be equal. Without the file matches, the two string have distance 1 (due to the second pair).

We call instance a set of input parameters. The complete list of input parameters together with their description can be found in the user manual of the program implementing the SN model [11].

3 Analysis of the model

We assessed our network model over the following network topological features [10]. Given an undirected network with nodes and edges we denote by the average degree, by the average path length, by the average clustering coefficient, by the degree distribution and by the clustering coefficient distribution.

We also considered the:

3-node motifs distribution, that is the number (normalised to 1) of triples of nodes having no edge, only 1 edge, only 2 edges and 3 edges between themselves;

path length distribution, denoted by , relating the number (normalised to one) of paths having a certain length ;

heterogeneity index, denoted by (where is the network), a new formulation of Randić index introduced in [7, 6]. In [7, 6] it is also shown that the Barabási-Albertmodel is not able to generate network with a heterogeneity index as high as the one found in biological networks.

We compared the network generated by an instance our the SN model with the network generated by the Barabási-Albertmodel (our implementation of this model is based on the Fortran implementation present at [16]). For this purpose we run the Barabási-Albertmodel starting with a clique of 6 nodes and adding 6 edges for each new added node. We also run the following instance of our network model: initial node ABCDEFGHILMN, alphabet A, ..., T, probability to mutate 1 (which implies that the length of the node structures is equal to the one of the initial node), Hamming distance having unit distance 2 and maximum distance 2. We run these simulation for 3000 iterations storing the resulting intermediate networks every 500 iterations. These tests run 100 times for different random seeds.

Figure 1 shows how the average degree, average path length and average clustering coefficient change in the Barabási-Albertmodel and in the SN model.

 11  12  13  14  15  16  17  18  19  500  1000  1500  2000  2500  3000 ¡k¿NABA                SN                                                                             2.5  3  3.5  4  4.5  5  500  1000  1500  2000  2500  3000 LNBBA                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     SN                                                                                                                                    500  1000  1500  2000  2500  3000 CNCBA                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   SN                                                                      
Figure 1: (A) average degree, (B) average path length and (C) average clustering coefficient of a growing Barabási-Albertmodel network (BA) and a growing SN model network.

The average path length follows the same curve in both models and the average degree slowly grows in the SN model while it remain constant in the Barabási-Albertmodel. The major difference is present in the clustering coefficient: in remains constant in the SN model while it decreases fast in the Barabási-Albertmodel. It is known that empirical networks have a clustering coefficient independent from their size and in [15] a variant of the Barabási-Albertmodel generating networks with a power law degree distribution and a clustering coefficient independent from the size of the network was presented. The motif distribution was similar in both model (data not shown).

It is well known that the Barabási-Albertmodel generates networks with a degree distribution following a power law . The same holds true for the considered instance of the SN model (this is not true for all instances of the SN model).

In both models the exponent of the power law does not change during growth. Anyhow, in the considered instance, the degree distribution of the networks generated by the SN model is not following a power law in its initial phases. This is shown in Fig. 2A where it can be seen that only after the degree distribution follows a power law. This difference with the Barabási-Albertmodel is mainly due to the fact that in the Barabási-Albertmodel each new added node has a fixed number of edges (6 in the case considered by us), while this request for a minimum number of edges is not present in the SN model.

We run another instance of the SN model for 55000 iterations and then we let all nodes having less than 5 edges to be removed from the generated networks together with their edges. The resulting networks, having around 3000 nodes, have a power law degree distribution Fig. 2B.

                                                                                                                    P(k)kA                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             P(k)kB