The HIM glocal metric and kernel for network comparison and classification

The HIM glocal metric and kernel for
network comparison and classification

G. Jurman , R. Visintainer, M. Filosi, S. Riccadonna, C. Furlanello
Fondazione Bruno Kessler, Trento, Italy CIBIO, University of Trento, Italy Research and Innovation Centre, Fondazione Edmund Mach, San Michele all’Adige, Italy
{jurman,visintainer,filosi,furlan}@fbk.eu samantha.riccadonna@fmach.it
Corresponding author
Abstract

Due to the ever rising importance of the network paradigm across several areas of science, comparing and classifying graphs represent essential steps in the networks analysis of complex systems. Both tasks have been recently tackled via quite different strategies, even tailored ad-hoc for the investigated problem. Here we deal with both operations by introducing the Hamming-Ipsen-Mikhailov (HIM) distance, a novel metric to quantitatively measure the difference between two graphs sharing the same vertices. The new measure combines the local Hamming distance and the global spectral Ipsen-Mikhailov distance so to overcome the drawbacks affecting the two components separately. Building then the HIM kernel function derived from the HIM distance it is possible to move from network comparison to network classification via the Support Vector Machine (SVM) algorithm. Applications of HIM distance and HIM kernel in computational biology and social networks science demonstrate the effectiveness of the proposed functions as a general purpose solution.

 

The HIM glocal metric and kernel for
network comparison and classification


  G. Jurmanthanks: Corresponding author , R. Visintainer, M. Filosi, S. Riccadonna, C. Furlanello Fondazione Bruno Kessler, Trento, Italy CIBIO, University of Trento, Italy Research and Innovation Centre, Fondazione Edmund Mach, San Michele all’Adige, Italy {jurman,visintainer,filosi,furlan}@fbk.eu samantha.riccadonna@fmach.it

1 Introduction

The arising prevalence of the network paradigm [1] as the elective model for complex systems analysis in different workfields has strongly contributed in stimulating graph theoretical techniques in the recent scientific literature. Methods based on graph properties have spread through the static and dynamic analysis of different economical, chemical and biological system, computer networking, social networks and neuroscience. As a relevant example, it is worthwhile mentioning the rapid diffusion, in computational biology, of the differential network analysis [2, 3, 4, 5, 6, 7, 8, 9, 10]. In particular, two key tasks constitute the backbone of most of the aforementioned analysis techinques, namely network comparison and network classification, and they both rely on the basic idea of measuring the similarity between two graphs.

Network comparison consists in the quantification of the difference between two homogeneous objects in some network space, while the aim of network classification is to predictively discriminate graphs belonging to different classes, for instance by means of machine learning algorithms. Network comparison has its roots in the quantitative description of main properties of a graph (e.g., degree distribution), which can be encoded into a feature vector [11], thus providing a convenient representation for classification tasks (see for instance [12] for a very recent approach). As a major alternative strategy, one can adopt a direct comparison method stemming from the graph isomorphism problem, by defining a suitable similarity measure on the topology of the underlying (possibly directed and/or weighted) graphs. This line of study dates back to the 70’s with the theory of graph distances, regarding both inter- and intra-graphs metrics [13]. Since then, a wide range of similarity measures has been defined, based on very different graph indicators. To mention some of the most important metrics, we list the family of edit distances, evaluating the minimum cost of transformation of one graph into another by means of the usual edit operations (insertion and deletion of links), the family of common network subgraphs, looking for shared structures between the graphs and the family of spectral measures, relying on functions of the eigenvalues of one of the graph connectivity matrices. Similarly, graph classification can be tackled by a number of different techniques, for instance nearest neighbours on Euclidean distance of the features’ vectors of the graphs [14, 15, 16], or Support Vector Machine with the graph Laplacian as a regularization term [17], or via different subgraph-based lerning algorithms [18]. However, in general the most efficient techniques use a kernel machine, where the kernel itself corresponds to a scalar product (and hence a distance) in a suitable Hilbert space [19, 20, 21, 22, 23, 24, 25, 26, 27]. For more recent advances, we cite the Weisfeiler-Lehman graph kernel [28], and its use in neuroimaging classification for discriminating mild cognitive impairment from Alzheimer’s disease [29]. This last citation stands as an example of the increasing interest for these techniques recently appearing in neurosciences [30, 31].

In the present work we propose a novel solution to both the comparison and the classification tasks by introducing the novel HIM metric for comparing graphs (even directed and weighted) and a graph kernel induced by the HIM measure. The HIM distance is defined as the one-parameter family of product metrics linearly combining – by a non-negative real factor – the normalized Hamming distance H [32, 33, 34, 35] and the normalized Ipsen-Mikhailov distance IM [36]; the product metric is normalized by the factor to set its upper bound to 1. In absence of a gold standard driving the search for the optimal weight ratio, we decided for an equal contribution from the two components as the most natural choice. The Hamming distance is the simplest member of the family of edit distances, evaluating the occurrence of matching links in the compared networks: by definition, it is a local measure of dissimilarity between graphs, because it only focusses on the links as independent entities, disregarding the overall structure. On the other hand, the spectral distances are global measures, evaluating the differences between the whole network structures: however, they cannot discriminate between isospectral non-identical graphs: for a recent spectral approach, see [37]. In the comparative review [38], the properties of the existing graph spectral distances were studied, and the Ipsen-Mikhailov metric emerged as the more reliable and stable. The combination of the two components within a single metric allows overcoming their drawbacks and obtaining a measure which is simultaneously global and local. Moreover, the imposed normalization limits the values of the HIM distance between zero (reached only by comparing identical networks) and one (attained when comparing a clique and the empty graph), regardless of the number of vertices. Finally, the HIM distance can also be applied to multilayer networks [39, 40], since a rigorous definition of their Laplacian has just been proposed [41, 42]. By a Gaussian-like map [43], the HIM distance generates the HIM kernel. Plugging the HIM kernel [44] into a Support Vector Machine gives us a classification algorithm based on the HIM distance, to be used as is or together with other graph kernels in a Multi-Kernel Learning framework to increase the classification performance and to enhance the interpretability of the results [45]. Note that, although positive definiteness does not hold globally for the HIM kernel, this property can be guaranteed on the given training data, thus leading to positive definite matrices suitable for the convergence of the SVM optimizer.

To conclude with, we present some applications of the HIM distance and the HIM kernel to some real datasets belonging to different areas of science. These examples support the positive impact of the HIM suite as general analysis tool whenever it is required to extract information from the quantitative evaluation of the difference among diverse instances of a complex system.

We also provide for analysis the R [46] package nettools including functions to compute the HIM distance. The package is provided as a working beta version and it is accessible on GitHub at https://github.com/filosi/nettools.git. To reduce computing time, the software can be used on multicore workstations and on high performance computing (HPC) clusters.

2 The HIM family of distances

2.1 Notations

Let and be two simple networks on nodes, described by the corresponding adjacency matrices and , with , where for unweighted graphs and for weighted networks. Let then be the identity matrix , let be the unitary matrix with all entries equal to one and let be the null matrix with all entries equal to zero. Denote then by the empty network with nodes and no links (with adjacency matrix ) and by the clique (undirected full network) with nodes and all possible links, whose adjacency matrix is . For an undirected network, its adjacency matrix is symmetric. For a directed network , following the convention in [47], a link is represented by setting in the corresponding adjacency matrix , which thus is, in general, not symmetric.

For instance, the matrix represents the full directed network , with all possible directed links .

2.2 The Hamming distance

The Hamming distance is one of the most common dissimilarity measures in coding and string theory, recently used also for (biological) network comparison [32, 33, 35, 34]. Since the Hamming measure basically evaluates the presence/absence of matching links on the two networks being compared, it has a simple expression in terms of the neworks’ adjacency matrices. This is not the case for many other members of the edit distance family, whose computation is known to be a NP-hard task. The definition of the normalized Hamming distance H is in fact the following:

(1)

where the normalization factor bound the range of the function H in the interval . The lower bound is attained only for identical networks , while the upper bound is reached whenever the two networks are complementary . When and are unweighted networks, is just the fraction of different matching links over the total number of possible links between the two graphs.

2.3 The Ipsen-Mikhailov distance

Originally introduced in [36] as a tool for network reconstruction from its Laplacian spectrum, the definition of the Ipsen-Mikhailov IM metric follows the dynamical interpretation of an nodes network as an molecules system connected by identical elastic strings as in Fig. 1(a-b), where the pattern of connections is defined by the adjacency matrix of the corresponding network.

(a) (b) (c)
Figure 1: A five nodes network as a oscillatory system (a) and the corresponding adjacency matrix (b), with two different edge weights 1 and , represented by different springs. In panel (c), the product metric as a function of .

The dynamical system is described by the set of differential equations

(2)

We recall that the Laplacian matrix of an undirected network is defined as the difference between the degree and the adjacency matrices , where is the diagonal matrix with vertex degrees as entries. is positive semidefinite and singular [48, 49, 50, 51], so its eigenvalues are . The vibrational frequencies for the network model in Eq. 2 are given by the square root of the eigenvalues of the Laplacian matrix of the network: , with . In [48], the Laplacian spectrum is called the vibrational spectrum. Estimates (actual and asymptotic) of the eigenvalues distribution are available for complex networks [52], while the relations between the spectral properties and the structure and the dynamics of a network are discussed in [53, 54, 55]. The spectral density for a graph as the sum of Lorentz distributions is defined as

where is the common width and is the normalization constant defined by the condition , and thus

The scale parameter specifies the half-width at half-maximum, which is equal to half the interquartile range. An example of Lorentz distribution for two networks is shown In Fig. 5. Then the spectral distance between two graphs and on nodes with densities and can then be defined as

The highest value of is reached, for each , when evaluating the distance between and . Denoting by the unique solution of

(3)

the normalized Ipsen-Mikahilov distance between two undirected (possibly weighted) networks can be defined as

(4)

so that IM is bounded between 0 and 1, with upper bound attained only for . A detailed description of the uniqueness of the solution of Eq. 3 is described in Appendix A. Isospectral networks (and thus also isomorphic networks) cannot be distinguished by this class of measures, so this is a distance between classes of isospectral graphs. Although the number of isospectral networks is negligible for large number of nodes [56], their fraction is relevant for smaller networks. The case of directed networks is discussed in a later paragraph.

Figure 2: Representation of the HIM distance in the Ipsen-Mikhailov (IM axis) and Hamming (H axis) distance space between networks A versus B, E and F, where E is the empty network and F is the clique.

2.4 The HIM distance

Consider now two copies of the space of all simple undirected networks on nodes, and endow the first copy with the Hamming metric H and the second copy with the Ipsen-Mikhailov distance IM. Then the two obtained pairs and are metric spaces. Define now on their Cartesian product the one-parameter HIM function as the (Euclidean) product metric [57] combining H and IM, normalized by the factor , for . Via the natural correspondence of the same network in the two spaces, the HIM function becomes a distance on :

(5)

where in what follows we will omit the subscript when it is equal to one. Obviously, and (see Fig. 1(c)); apart from values of close to the bounds where the prevalence of one of the factors becomes dominant, the qualitative impact of is minimal in practice when using as a distance. In what follows, when no a priori hypothesis supports unbalancing the metric towards one of the two components, will be assumed. However, the impact of is definitely more relevant when is used to generate a kernel function to be used for classification purposes, as we will show in a later section. The metric is bounded in the interval , with lower bound attained for every couple of identical networks, and upper bound attained only on the pair . Moreover, all distances will be nonzero for non-identical isomorphic/isospectral graphs.

Consider now the Hamming/Ipsen-Mikhailov (H/IM) space, where a point has coordinates , and the distance of from the origin is . If we (roughly) split the Hamming/Ipsen-Mikhailov space into four main zones I,II,III,IV as in Fig. 2, two networks whose distances correspond to a point in zone I are quite close both in terms of matching links and of structure, while those falling in the zone III are very different with respect to both measures. Networks corresponding to a point in zone II have many common links, but their structure is rather different (for instance, they have a different number of connected components), while a point in zone IV indicates two networks with few common links, but with similar structure (e.g., isospectral non-identical graphs). In Fig. 2 we show some examples of points in the Hamming/Ipsen-Mikhailov space.

2.5 The directed network case

In this situation, the connectivity matrices are not symmetric, thus the Laplacian spectrum lies in . Hence, computing the Ipsen-Mikhailov distance would require extending the Lorentzian distribution to the complex plane. A simpler solution can be obtained by transforming the directed network into an undirected (bipartite) one , as in [47]. For each node in , the graph has two nodes and (where I and O stand for In and Out respectively) and for each directed link in there is a link in . If the adjacency matrix for is , the corresponding matrix for is , with respect to the node ordering . An example of the above transformation is shown in Fig. 3.

Figure 3: A directed network on three nodes and the equivalent undirected network on six nodes, together with their adjacency matrices.

Thus it is possible to define as after substituing the normalizing factors and with the corresponding and derived by imposing the conditions and , so that by using Eq. (5). It is immediate to compute , while can be numerically computed as for : details are given in Appendix B.

Figure 4: Adjacency matrix and graphical representation of and .

3 The HIM kernel

Following [43], a kernel can be naturally derived from a distance by means of a Gaussian (Radial Basis Function) map (see also [58]). Thus, given two graphs and on the same nodes and a positive real number , the HIM kernel can be defined as

Whenever a novel kernel is introduced, one has to check whether it is positively defined.

A function is a kernel of condionally negative type if

  1. ;

  2. ;

  3. .

A variant of Schoenberg’s theorem [59] (proved in [60, 61]) states that

Theorem 3.1

For a function , the following are equivalent:

  1. is of conditionally negative type;

  2. is a positive semidefinite kernel for all .

The above theorem describes the correspondence between negative-type distances and positive definite kernel, which is also equivalent to embeddability [62]. Hence, is a positive semidefinite kernel if and only if is a symmetric function of conditionally negative type. Although the square of many distances are condionally negative type functions, cannot be proven to be of conditionally negative type (actually, it is probably not of negative type, as it is the case for many edit distances [43, 63, 64, 65, 66], the HIM kernel is not positively defined in general for all . Nevertheless, this problem can be overcomed by using Prop. 1.3.4 in [67] (see also [58, 65]):

Theorem 3.2

Suppose the data and the kernel ) are such that the matrix

is positive. Then it is possible to construct a map into a feature space such that

Conversely, for a map Φ into some feature space , the matrix is positive.

Note that Th. 3.2 does not even require to belong to a vector space. This theorem implies that, even though the kernel is not positive definite, it is still possible to use it in Support Vector Machines or other algorithms requiring to correspond to a dot product in some space if the kernel matrix is positive for the given training data. This condition can be obtained by choosing a suitable value of : in the experiments shown hereafter, the HIM kernel is always positively defined on the given training data, leading to positive definite matrices, and thus posing no difficulties for the SVM optimizer, as in [68].

4 Applications

4.1 A minimal example

Consider the two networks with corresponding adjacency matrices shown in Fig. 4. The Hamming distance between and is

From the spectral point of view, the corresponding Laplacian matrices and eigenvalues are

From the above spectra, we can compute the corresponding Lorentz distributions , where : their plots are shown in Fig. 5.

Figure 5: Lorentzian distribution of the Laplacian spectra for (left) and (center) with vertical lines indicating eigenvalues, and in the Hamming/Ipsen-Mikhailov space (right).

The resulting Ipsen-Mikhailov distance is

so that the HIM distance results

The situation can be graphically represented as in Fig. 5: the two networks are quite different in terms of matching links, but their structures are not so diverse.

4.2 Small networks

Fixed the number of nodes , there are exactly different simple undirected unweighted networks, which can be grouped into isomorphism classes. As anticipated before, isomorphic graphs cannot be distiguished by spectral metrics, while their mutual Hamming distances are non zero, since their links are in different positions. As an example, for there are 8 networks grouped in 4 isomorphism classes, for there are 11 isomorphism classes including a total of 64 graphs and for 34 classes with 1024 networks (for , the number of classes is respectively 156 e 1044).

To give an overview of a broader situation, we compute a number of mutual distances between networks with a given number of nodes (all possible couples for and a subset of them for ) and we display the results in Fig. 6. To select a good range of variability for the networks with 15 nodes, we select the empty graph, the full graph (with 105 nodes) and 10 different graphs with edges each, for .

(a) (b) (c) (d)
Figure 6: Mutual distances between (a) all 28 couples of networks with 3 nodes, (b) all 2016 couples of networks with 4 nodes, (c) all 523776 couples of networks with 5 nodes and (d) the 542361 mutual distances between a set of 1042 networks with 15 nodes.

As shown by the plots, all possible situations can occur, apart from points in the northwest corner of zone II which are the rarest. For instance, the point in Fig. 6(b) corresponds to 6 different pairs of networks with nodes with maximal Hamming distance and minimal spectral distance: as an example, we show one of these pairs in Fig. 7.

4.3 Comparison with Matthews Correlation Coefficient

When assessing performances in a link prediction task (for instance, in the series of DREAM challenges [69, 70, 71]), the standard strategy following the machine learning approach, is to rely on functions of the confusion matrix, i.e., the table collecting the number of correct and wrong predictions with respect to the ground truth. Classical measures of this kind are the pairs Sensitivity/Specificity and Precision/Recall, and the derived Area Under the Curve.

A reliable alternative is the Matthews Correlation Coefficient (MCC for short) [72], summarizing into a single value the confusion matrix of a binary classification task. This is a measure of common use in the machine learning community [73], recently accepted as an effective metric also for network comparison [74, 75]. Also known as the -coefficient, for a contingency table MCC corresponds to the square root of the average statistic

where is the total number of observations. In the binary case of two classes positive P and negative N, for the confusion matrix , where T and F stand for true and false respectively, the Matthews Correlation Coefficient has the following shape:

MCC lives in the range , where is perfect classification, is reached in the complete misclassification case while corresponds to coin tossing classification, and it is invariant for scalar multiplication of the whole confusion matrix.

Here we want to provide a quick comparison of MCC and HIM distances in a few cases. First of all, some considerations on the extreme cases:

  • only for , which has .

  • only for ; in this case, for , and in all other cases.

  • only for , and thus HIM=0.

  • The two values or can correspond to a landscape of quite different pairs of networks, for which the HIM distance can assume very diverse values.

To investigate the last case in the above list, we randomly generated 250,000 pairs of networks of different size, and we compared the MCC with the H, IM and HIM distances: the corresponding scatterplots are shown in Fig. 8. Since MCC is a similarity measure, for a direct comparison we displayed it as the -normalized dissimilarity measure .

As predictable, since the confusion matrix is unaware of the network structure but it takes into account only matching and mismatching links, the MCC is well correlated with the Hamming distance (Pearson Coefficient PC=0.92) and poorly correlated with the Ipsen-Mikhailov distance (PC=0.01), resulting in an good global correlation with the HIM distance (PC=0.79). Nonetheless, the plots in Fig. 8 show that the relevant variability of one measure for a given value of the other one supports the claim of a strong independency between MCC and HIM. Finally, as an example giving a quantitative basis to the last claim of the above list, for all the pairs with we obtain values of HIM ranging in , with median 0.37 and mean 0.39, while when the range of the HIM values is , with mean and median equal to 0.74.

Figure 7: The six pairs of networks on four nodes with Hamming distance one and Ipsen-Mikhailov distance zero.

4.4 Dynamical networks

In what follows we show the evolution of the Hamming, the Ipsen-Mikhailov and the HIM distance during the evolution of the following dynamical processes moving through consecutive steps:

  • Random Addition is obtained from by randomly adding a not already present link.

  • Random Removal is obtained from by randomly removing an existing link.

  • Sequential Addition is obtained from by adding a new link in the same row and in the next available column of the last added link, if possible, or in the following row, starting from the first available column. The whole process starts from the first available row with the smallest index. As an example, if , then the process evolves inserting ones in the adjacency matrix following the sequence in .

  • Sequential Removal : as in , but removing one link at each step.

  • Highest Degree Addition is obtained from by adding a previously not existing link connecting the node with the highest degree.

  • Highest Degree Removal is obtained from by removing an existing link connecting the node with the highest degree.

As a first example, consider the processes and with the empty graph as starting network . They both end at the N-nodes clique after steps: . The corresponding inverse processes and evolve in the opposite direction: and . In Fig. 9 we show the curves of for =H, IM and HIM in the cases 10, 25 and 100 nodes.

For the representation of the curves, we use two different spaces: the already introduced Hamming/Ipsen-Mikhailov space, with the metric H on the axis and the metric IM on the axis, and the Fraction-of-nodes/HIM space, with the ratio between the number of newly added or removed links over the total number of links on the axis and the HIM distance on the axis; in this representation, the -th step of a process has coordinates . In all cases, since one edge is removed or added at each step, in both spaces the evolution of the processes proceeds from left to right in both graphs, and the trend of the curves representing the distances of the same process in the Hamming/Ipsen-Mikhailov space and in the Fraction-of-nodes/HIM space are similar, varying only for a scaling factor.

(a) (b) (c)
Figure 8: Scatterplot of versus Hamming (a), Ipsen-Mikhailov (b) and HIM (c) distances when comparing 250,000 random pairs of networks of different size 3-100.

For the random processes and we show the means of the distances computed on 100 runs; no standard deviation or confidence intervals are plotted, because they are negligible at the scale of the plot. For instance, in the case 25, the order of magnitude of the standard deviation for HIM at each step is , and the span of the 95% boostrap confidence intervals is in the range of . As a first observation, all curves are monotonically increasing and the bigger the graph, the larger the distances, but in the empty-to-clique case, where, in the second half of the process, induces distances which are smaller than and which are smaller for larger graphs. The most interesting observation is the different shape of the curves between the empty-to-clique process and the clique-to-empty: for the same Hamming distance (or fraction of links), the corresponding Ipsen-Mikhailov (or HIM, respectively) distance is larger when the nodes are added rather than removed, because adding links quickly generates degree correlation. Furthermore, in the empty-to-clique case, not much difference occurs between the random and the sequential process, while this difference is much wider (with the random one inducing larger distances) for the clique-to-empty case.

 
(a) (b) (c) (d)
Figure 9: Distances between and for the processes evolving from the empty network to the clique (a and b) or vice versa (c and d), in the Hamming/Ipsen-Mikhailov space (a and c) or plot of the HIM distance as a function of the ratio of added/removed links (b and d), for 10 (black), 25 (blue), 100 (red) nodes. Solid lines denote the average of distances for 100 runs of random evolution, while dashed lines denote the sequential processes and . In all cases, the process evolves from the left-bottom corner to the right-top corner.

An analogous experiment was carried out within the family of Poissonian graphs, with Erdös-Rényi model [76, 77] . In particular, for 10, 25 and 100, let be a sparse network , with 2, 11 and 230 edges respectively and let be a dense network , with 39, 275 and 4462 edges respectively. Consider the following four processes, of which we represent the initial steps in Fig. 10:

  • , for , with , for =10, 25, 100.

  • , for , with , for =10, 25, 100.

  • , for , with , for =10, 25, 100.

  • , for , with , for =10, 25, 100.

In this case, too, results on the random processes are averaged over 100 runs, with negligible confidence intervals. To better highlight the differences of the resulting distances in the various processes, in Fig. 10 (c) and (d) we show the ratio of some pairs of HIM distances as a function of the removed/added links. In particular, in subfigure (c), for each step , we show the quotient of HIM distances for over , in the three cases 10, 25 and 100. The three curves show that HIM distances for are larger than the HIM distances for for =25 and 100, and their difference is higher in the first steps of the process , while they tend to get closer as far as the processes evolve. In the other cases and (not shown here), the differences are smaller and they converge faster to one, but in this case the process accounts for the smaller values of HIM distances. In the plot (c) of Fig. 10, we show the curves for as a function of . All the three curves are monotonically decreasing and converging to one after the first stages of the processes, yielding that, for all values of , adding links produces higher values of HIM distance. In the case of the evolution targeting higher degree nodes first (not shown here), the trend is the same, only scaled down to smaller ratios.

(a) (b) (c) (d)
Figure 10: Plot of H, IM and HIM distances between and for (a) and (b), for 10 (black), 25 (blue), 100 (red) nodes. Solid lines denote the average of distances for 100 runs of and , while dashed lines identify and . In all cases, the process evolves from the left-bottom corner to the right-top corner. In panel (c), plot of as a function of and, in panel (d), plot of as a function of .

The final examples consider processes having scale free networks as starting graphs. For =10, 25 and 100, let be a scale free sparse network generated following the Albert-Barabasi model [78], with power law exponent 2.3 and with 9, 24 and 99 edges rispectively, and a dense network with the same exponent 2.3 but with 35, 300 and 4150 edges respectively. The same four processes of the previous case were tested for the initial steps:

  • , for , with , for =10, 25, 100.

  • , for , with , for =10, 25, 100.

  • , for , with , for =10, 25, 100.

  • , for , with , for =10, 25, 100.

The corresponding curves are plotted in Fig. 11 (a) and (b). We recall here that scalefree networks are not invariant for percolation, i.e., they do not remain scalefree when links are randomly removed or added. However, the evolution of the processes is not very different from the Erdös-Rényi case, especially for the processes removing links as shown in Fig. 11, panel (b). Some differences emerge for the processes adding links, and a few peculiarities that are also present in the Poissonian case here become more evident. In particular, for all and for both and the derivative of the curves are larger than those in panel (b), and it is not true anymore that the larger the number of nodes, the larger the distances. For instance, in the case , both the processes quickly modify the network structure, resulting in a fast increment of the Ipsen-Mikhailov distance for , while later the curves grow at a much smaller rate. To better study this behaviour, a larger starting network was generated following the scale free model in [79], with 200 nodes and 1000 edges, power law exponent 2.001 and degree distribution as in the histogram of Fig. 11, panel (c). The following processes were started from , and they were carried on until they reach either the empty network or the clique:

  • , with and .

  • , with and .

  • , with and .

  • , with and .

The curves corresponding to HIM distances from in the four aforementioned processes are plotted in Fig. 11, panel (d), versus the percentage of progress of the process, i.e., , with and , . The HIM distance for the processes and are monotonically and similarly increasing when evolving from to , slower at the beginning and much faster in the last steps of the process. The two other processes instead show the same effect previously noted: and change rapidly in the initial 10% of the processes, yielding a fast increase in the Ipsen-Mikhailov distance, due to the quick modification in the network structure. After this initial period, the growth of both curves proceed with a smaller derivatives until they reach their maximum at the end of the process.

(a) (b) (c) (d)
Figure 11: Plot of Hamming versus Ipsen-Mikhailov distance for and for (a) and (b), for 10 (black), 25 (blue), 100 (red) nodes. Solid lines denote the average of distances for 100 runs of and , while dashed lines identify and . In all cases, the process evolves from the left-bottom corner to the right-top corner. (c) Histogram of the node degrees of the network . (d) Hamming distances versus percentage of processes steps for , , and .

4.5 Graph families

In this section we investigate the distribution of the distances from the empty network of a set of graphs randomly extracted from five families. In particular, for each = 10, 20, 50, 100 and 1000 we extracted 1000 networks on nodes from each of the following class of graphs:

  • BA Barabasi-Albert model [78], with power of preferential attachment extracted from the uniform distribution between 0.1 and 10.

  • ER Erdös-Rényi model [76, 77], with link probability extracted from the uniform distribution between 0.1 and 0.9.

  • WS Watts-Strogatz model [80], with neighborhood within which the vertices of the lattice will be connected uniformly sampled in and rewiring probability extracted from the uniform distribution between 0.1 and 0.9.

  • PL Scale-free random graphs from vertex fitness scores [79], with number of edges uniformly sampled between 1 and and power law exponent of the degree distribution extracted from the uniform distribution between 2.005 and 3.

  • KR Random regular graphs, with all possible values of node degree.

In Tab. 2 we list mean and standard deviation of for all combinations of node size and network type: note that we do not report the corresponding median, because its distance from the mean is always smaller than 0.02 nor the bootstrap confidence intervals, whose range is always smaller than 0.02 from either side of the mean. In Fig. 12 we also show the corresponding boxplots, while in Fig. 13(a) we display the scatterplot in the Hamming/Ipsen-Mikhailov space of all the aforementioned distances. In the Hamming/Ipsen-Mikhailov space all the BA nets are confined in the narrow rectangle , while all other classes of graphs span a much wider area. In particular, the points corresponding to distances of the PL nets occupy densely all the upper left triangle of the H/IM plane, and the same happens, with , also for the ER networks, while WS and KR points lie in the upper rectangle . Thus, different PL networks show very different stucture, while the BA nets are very homogeneous. Notably, no point occurs in the lower right corner of the H/IM space. Moreover, in average, the standard deviation decreases inversely with the network size, showing larger homogeneity in bigger networks.

BA PL
ER KR
WS All
Figure 12: Boxplots of the for all combinations of node size and network type.

Finally, to better highlight the difference among the diverse families, we randomly extracted 100 networks with 100 nodes for the four families BA, ER, WS and PL and we computed the mutual distances between all possible pairs of these 400 graphs. A few statistics of these HIM distances are reported in Tab. 1, while the planar multidimensional scaling plot [81] is displayed in Fig. 13(b). Apart from the PL networks, the three families BA, ER and WS can be mutually well separated as shown in the multidimensional plot; moreover, the graphs in the BA and in the WS families are mutually quite similar, as supported by the small interclass mean HIM distance. On the other side, the PL networks have essentially the same distance from all other groups, so they cannot be easily distiguished.

BA ER WS PL
BA 0.05 (0.06) 0.69 (0.10) 0.47 (0.13) 0.65 (0.17)
ER 0.50 (0.13) 0.56 (0.15) 0.51 (0.14)
WS 0.29 (0.12) 0.56 (0.18)
PL 0.50 (0.17)
Table 1: Mean and standard deviation of the HIM distances between the 100-nodes graphs in the four families BA, ER, WS and PL; each family includes 100 networks.
N 10 20 50 100 1000 All
T
BA 0.53 0.01 0.51 0.01 0.50 0.02 0.49 0.02 0.50 0.02 0.51 0.02
ER 0.69 0.18 0.73 0.12 0.77 0.09 0.77 0.08 0.76 0.08 0.74 0.12
WS 0.91 0.14 0.76 0.15 0.64 0.08 0.62 0.06 0.62 0.05 0.71 0.16
PL 0.72 0.19 0.72 0.19 0.75 0.15 0.74 0.14 0.72 0.11 0.73 0.16
KR 0.60 0.11 0.54 0.10 0.50 0.05 0.49 0.00 0.48 0.00 0.52 0.08
All 0.69 0.19 0.65 0.17 0.63 0.15 0.62 0.14 0.62 0.13 0.64 0.16
Table 2: Mean and standard deviation of HIM distances from the empty network for all combinations of network type T and network size N, and cumulatively across node sizes and graph classes.
(a) (b)
Figure 13: (a) Scatterplot in the Hamming/Ipsen-Mikhailov space of the for all combinations of node size and network type; (b) Multidimensional Scaling of the mutual HIM distances of 400 networks with 100 nodes in the BA, ER, WS and PL families.

4.6 The D. melanogaster development dataset

In [82], the authors used the Keller algorithm to infer the gene regulatory networks of Drosophila melanogaster from a time series of gene expression data measured during its full life cycle, originally published in [83]. They followed the dynamics of 588 development genes along 66 time points spanning through four different stages (Embryonic – time points 1-30, Larval – t.p. 31-40, Pupal – t.p. 41-58, Adult – t.p. 59-66), constructing a time series of inferred networks , publicly available at http://cogito-b.ml.cmu.edu/keller/downloads.html. Hereafter we evaluate the structural differences between and the initial network , as measured by the HIM distance: the resulting plot is displayed in Fig. 14. The largest variations, both between consecutive terms and with respect to the initial network , occur in the embrional stage (E): in particular, the HIM distance grows until time points 23, then the following networks start getting closer again to , showing that the interactions of the selected 588 genes in the adult stage are more similar to the corresponding net of interaction in the embrional stage, rather than in the other two stages. Moreover, while Hamming distance ranges between and , the Ipsen-Mikhailov distance has as its maximum, indicating an higher variability of the networks in terms of structure rather than matching links. Finally, using a Support Vector Machine with HIM kernel built in the kernlab package in R, a 5-fold Cross Validation with and reached accuracy 0.97 in discriminating Embryonic and Adult networks from Larval and Pupal, while, in the same setup, we reach perfect separation between Embryonic and Adult stages for all values of larger than 1000.

(a) (b) (c)
Figure 14: (a) Evolution of distances of the D. melanogaster development gene network time series in the Hamming/Ipsen-Mikhailov space with zoom (b) on the final timepoints and (c) evolution of same HIM distances along 66 time points in the 4 stages Embryonic (E), Larval (L), Pupal (P) and Adult (A).

4.7 The HCC dataset

Publicly available at the Gene Expression Omnibus (GEO) http://www.ncbi.nlm.nih.gov/geo, at the Accession Number GSE6857, the HepatoCellular Carcinoma (HCC) dataset [84, 85] collects 482 tissue samples from 241 patients affected by HCC, a well-studied pathology [86, 87] where the impact of microRNA (miRNA) is notably relevant [88, 89]. For each patient, a sample from cancerous hepatic tissue and a sample from surrounding non-cancerous hepatic tissue are available, hybridized on the Ohio State University CCC MicroRNA Microarray Version 2.0 platform collecting the signals of 11,520 probes of 250 non-redundant human and 200 mouse miRNA. After a preprocessing phase including imputation of missing values [90] and discarding probes corresponding to non-human (mouse and controls) miRNA, we consider the dataset HCC of 240+240 paired samples described by 210 human miRNA, with the cohort consisting of 210 male and 30 female patients. We thus parted the whole dataset HCC into four subsets combining the sex and disease status phenotypes, collecting respectively the cancer tissue for the male patients (MT), the cancer tissue for the female patients (FT) and the corresponding two datasets including the non cancer tissues (MnT, FnT). Then we first generated the four co-expression networks on the 210 miRNA as vertices, inferred via absolute Pearson’s correlation and corresponding to the combinations of the two binary phenotypes, and we computed all mutual HIM distances. In particular, to show the possible effects due to the different sample size, we computed 30 instances of the MT and MnT networks, inferred using only 30 matching samples and then averaging all the mutual HIM distances. One instance of MT and MnT is displayed as an hairball in Fruchterman-Reingold layout [91] together with the nets FT and FnT. The corresponding two-dimensional scaling plot [81] in the right panel of the Figure 15. The four networks are widely separated, with orthogonal separations for the two phenotypes, but the values of the HIM distances between the network support the known different development of HCC in male and female: for instance, the FT network is closer to the MnT net (HIM=0.08), rather than to the MT and FnT (HIM=0.13 and 0.16, respectively). Note that the largest distance (HIM=0.23) is detected between the two non-tumoral networks MnT, FnT.

MT a MnT g
FT a FnT g
Figure 15: (left) Fruchterman-Reingold layout of the networks MT, MnT, FT, FnT where the male networks are inferred by one random extraction of 30 samples out of the whole cohort of 210 patients. Node size is proportional to node degree. (right) Multidimensional scaling plot, with mutual HIM distances, of the networks FT, FnT, MT, MnT.

An expanded version of the example is shown in [92, 93], where more networks are generated from the same dataset using different inference algorithms and a stability analysis is performed.

4.8 The Gulf Dataset

Part of the Kansas Event Data System, available at http://vlado.fmf.uni-lj.si/pub/networks/data/KEDS/, the Gulf Dataset collects, on a monthly bases, political events between pairs of countries focusing on the Gulf region and the Arabian peninsula for the period 15 April 1979 to 31 March 1999, for a total of 240 months. Political events belong to 66 classes (including for instance ”pessimist comment”, ”meet”, ”formal protest”, ”military engagement”, etc.) and involve 202 countries. This dataset formally translates into a time series of 240 unweighted and undirected graphs with 202 nodes, for which we computed all the mutual HIM distances. These distances are then used to project the 240 networks on a plane through a multidimensional scaling [81]: the resulting plot is displayed in Fig. 16. The months corresponding to the First Gulf War months (July 1990 - April 1991) are close together and confined in the lower left corner of the plane, showing both a mutual high degree of homogeneity and, at the same time, a relevant difference to the graphs of all other months. This shows that, at the onset of the conflict, the diplomatic relations worldwide changed consistently and their structure remained very similar throughout the whole event. Note that the blue point (closer to the war-like period) corresponds to February 1998, the time of Iraq disarmament crisis: Iraqi President Saddam Hussein negotiates a deal with U.N. Secretary General Kofi Annan, allowing weapons inspectors to return to Baghdad, preventing military action by the United States and Britain.

Figure 16: Planar HIM distance based multidimensional scaling plot of the monthly Gulf Dataset. Red dots corresponds to the First Gulf War months (July 1990 - April 1991), while grey points correspond to months outside that temporal window and the blue point corresponds to February 1998, the month of the Iraq disarmament crisis.

4.9 The International Trade Network data

As an application of the HIM distance on directed and weighted networks, we show four examples based on the International Trade Network (ITN) data, version 4.1, by Gledisch [94] available at http://privatewww.essex.ac.uk/∼ksg/exptradegdp.html, collecting estimates of trade flows between independent states (1948-2000) and GDP per capita of independent states (1950-2000). As noted by [95], due to differences in reporting procedures between countries, incongruences occur between exports from to and imports from to : to avoid this issues, in our analysis we only use the figures reported as export in the dataset.

In what follows, we extract four sets of countries, and we study the evolution of their trade subnetworks during the aforementioned period. In each example, chosen the set of countries , we construct, for every year, the weighted directed network having as nodes. A link between country and country represents the export from to , and its weight corresponds to the volume of the export flow. Then we compute all mutual HIM distances among these networks, first rescaling link weights in the unit interval. Finally, using these HIM distances we construct a planar classical Multidimensional Scaling plot, transforming the networks in a set of points such that the distances between the points are approximately equal to the mutual HIM dissimilarities, using the methods in [96, 97, 98, 81] as implemented in R. The aim here is to connect the structural changes in yearly trade networks with time periods and events having a role in explaing such changes. Note that in [95], the authors show that bilateral trade fulfills fluctuation-response theorem [99], stating that the average relative change in import (export) between two countries is a sum of relative changes in their GDPs. This result yields that directed connections, i.e., bilateral trade volumes, are only characterized by the product of the trading countries’ GDPs.

As a first example we present the BRICS countries case. Introduced in 2001, the acronym BRICS collects the five nations Brazil, Russia, India, China and South Africa (Fig. 17(a)) which, although developing or newly industrialized countries, are distinguished by their large and fast-growing economies and by their significant influence on regional and global affairs. To such aim, in Fig. 17 we show the bidimensional scaling of their trade networks for the years 1950–2000, with the HIM matrix as the distance constraint. As shown by the plot, three groups of years can be clearly divided, thus yielding that the corresponding networks are similar within each group, but diverse across different groups: the early years recovering after WWII (until about 1963), the seventies and eighties, where the economies of the involved countries started to develop, and the nineties, where their growth begun to accelerate.

BRA RUS IND CHN SAF
(a) (b)
Figure 17: (a) Maps and flags of the BRICS countries. (b) Multidimensional Scaling of the HIM distances among the intertrade networks of the BRICS countries in the periods 1950–1963 (black), 1964–1990 (blue), 1991–2000 (red).

A very similar situation occurs in the regional trade network among the South American countries (Fig. 18), where the global behaviour is essentially controlled by the two local giants Brazil and Argentina, and for which the larger differences between the nets can be appreciated between the economic growth of the 90s and the suffering economies in the late 70s / early 80s due to the struggling political situations.

PAN COL VEN GUY SUR ECU PER
BRA BOL PAR CHI ARG URU
Figure 18: Maps (left) and flags (bottom) of the Latin America countries. (right) Multidimensional Scaling of the HIM distances among the intertrade networks of the countries in the Latin America in the periods 1948–1959 (black), 1960–1969 (green), 1970–1990 (blue), 1991–2000 (red).

Not much different is the case of the larger trade subnetworks of the top 20 world economies ranked by Gross Domestic Product 2012 (PPP) (Top20 for short) as listed by the World Bank http://data.worldbank.org and shown in Fig. 19, with the notable difference that the networks for the 60s are more homogeneous to those of the 70s and 80s, supporting a faster recovery of these economies after WWII than the BRICS or t