Characterization of graphs for protein structure modeling and recognition of solubility
This paper deals with the relations among structural, topological, and chemical properties of the E.Coli proteome from the vantage point of the solubility/aggregation propensity of proteins.
Each E.Coli protein is initially represented according to its known folded 3D shape.
This step consists in representing the available E.Coli proteins in terms of graphs.
We first analyze those graphs by considering pure topological characterizations, i.e., by analyzing the mass fractal dimension and the distribution underlying both shortest paths and vertex degrees. Results confirm the general architectural principles of proteins.
Successively, we focus on the statistical properties of a representation of such graphs in terms of vectors composed of several numerical features, which we extracted from their structural representation.
We found that protein size is the main discriminator for the solubility, while however there are other factors that help explaining the solubility degree.
We finally analyze such data through a novel one-class classifier, with the aim of discriminating among very and poorly soluble proteins. Results are encouraging and consolidate the potential of pattern recognition techniques when employed to describe complex biological systems.
Index terms— Protein analysis; Graph representation; Descriptors and complexity measures for graphs; One-class classification.
The delicate balance between solubility and aggregation of protein molecules is a key factor in both protein physiology and disease causation . Solubility properties allow proteins to work as single molecular systems in cell without exiting from equilibrium in the form of large aggregates as happens with artificial polymers, so being at the very basis of their work as natural and incredibly efficient “nano-machines” in the cell. Highly soluble proteins reach easily their native 3D state (the thermodynamically stable state [31, 25]). On the other hand, individual protein molecules must interact with other proteins in order to generate highly coordinated supra-molecular systems carrying on complex tasks such as DNA duplication and biosynthetic pathways involving ordered reactions sequences. From a basic chemico-physical perspective, we can consider solubility as driven by the formation of intra-molecular links (i.e., the molecule folds in order to acquire an energetically favorable configuration in solution ) while aggregation is mainly driven by inter-molecular links. Both intra- and inter- molecular links have the same chemico-physical nature (hydrogen bonds, van der Waals force etc.). Besides the physiological aggregation needed for generating supra-molecular systems, proteins can undergo a “pathological” aggregation, which is linked to the onset of various diseases . The above points illustrate the hardness of the problem to discriminate solubility and aggregation propensities together with its theoretical and applicative importance.
Niwa et al. [52, 53] produced a dataset that we consider by far the most unbiased factual basis for facing the solubility/aggregation on a global perspective. This dataset is made by the entire proteome of E.Coli whose single elements (protein molecules) were assayed as for their relative solubility in cell-free standardized conditions, thus offering for each protein an unbiased measure of solubility. The more a protein is soluble, the less the fraction that precipitates into insoluble aggregates: this implies that we can consider the Niwa et al. solubility data as a measure of the solubility/aggregation properties of single molecules. This was aptly stated by Agostini et al. , approaching the same dataset by means of a sequence-based methodology. Here we try to go further the purely discriminative task: the quest for a “global signature” for protein solubility/aggregation balance was considered as a vantage point for a testing of different network-based representations of protein 3D structures, going from pure topology to labeled graphs enriched with a chemico-physical description of amino acid residues. The emerging correlation structure among different graph invariants confirmed some established architectural principles of proteins as their fractal and modular nature [24, 7, 27, 41, 65, 64, 23, 17]. On the other hand, the strict dependence of protein structural, topological, and solubility properties on their size confirmed a general principle of nanomaterials [37, 60], whose collocation in-between molecular and bulk material worlds makes their physico-chemical behavior strictly dependent on size. Proteins and supra-molecular aggregates live at the nanoscale and the recognition of a strict size dependence of their properties is an evidence for the possible cross-fertilization of protein and nanomaterials science.
Providing descriptive characterizations of complex systems has a long history in the field of (complex) dynamical systems and chaos theory [68, 50, 32]. Describing (complex) systems by means of graphs is ubiquitous in modern science and engineering disciplines [51, 19, 18, 12, 56, 26, 69, 15, 22, 33]. In fact, graphs offer a sound mathematical framework to describe the relations/causality among the interacting elements of the system under analysis. Characterizing a graph by numerical values (e.g., numerical features/descriptors) is usually based on suitable graph-theoretic results, adaptation of information-theoretic concepts, or by exploiting interpretations in terms of dynamical processes (such as diffusion, percolation, and state transition). The profound multi-disciplinary character of the topic produced a number of interpretations of similar concepts, although by exploiting several different techniques. It is possible to cite approaches involving fractal analysis, percolation, and (anomalous) diffusion [10, 36, 61, 20, 62]. It is worth citing also the more recently-proposed graph characterization of Escolano et al.  (see also  for related material), called flow complexity (based on thermodynamic depth ). Random walk based approaches are also highly popular. For instance, random walks are used to model the interaction of “information waves” in a graph , which is useful to describe at the same time the spread and the interaction of information over time. Analogously, the concepts of hitting (average probability that two random walkers are in the same state at the same time) and commute time (average return time to the initial state) in random walks have been used by Qiu and Hancock  to characterize graphs for pattern recognition purpose. Other approaches include computation of Ihara coefficients and Laplacian spectrum [58, 43]. Finally, various interpretations of the concept of network entropy are studied , such as entropy of continuous time quantum walks [49, 4], network ensemble entropy [11, 4], network transfer entropy , von Neumann (quantum) entropy [35, 72], and fuzzy entropy .
In this paper, we exploit several graph-based techniques to model protein structures and recognize the important solubility property of the E.Coli proteins from Niwa et al. . The analysis is based on a dataset previously elaborated by us . Initially, we provide a pure topological characterization of the graph structures by exploiting fractal analysis techniques and by considering the underlying distribution of both shortest paths and degrees. Successively we elaborate a new, vector-based, representation of the graphs. For this purpose, we extract 15 different features (graph characteristics). Such a new protein representation is hence analyzed in this paper with the aim of understanding the statistical relations among the structure and the solubility degree of the E.Coli proteins. First, we provide a factor analysis of the dataset of 15 features here elaborated. Results confirm the well-known fact that protein size is the most important factor. At the same time, we complement the linear correlation structure by analyzing also the non-linear relations via the estimation of the mutual information. Finally, we face the problem of discriminating among very soluble and poorly soluble proteins. Notably, we deal with the problem in the one-class classification setting . From a bioinformatics viewpoint, the paper demonstrates how graphs are powerful modeling and computational frameworks, allowing for a hybrid style of analysis linking purely statistical and content-related features of complex systems.
The remainder paper is structured as follows. Section 2 describes the protein representation in terms of labeled graphs. In Section 3 we introduce the features (characteristics) that we have extracted to represent each graph as real-valued vector. In Section 4 we discuss the experimental results. Section 5 concludes the paper. Finally, Appendix A provides the essential details of the considered graph characteristics.
2 Graph Representation of E.Coli Proteins
In our previous work , we elaborated the E.Coli data of Niwa et al. [52, 53] by constructing a graph representation. We were able to retrieve the 3D structure from the Protein Data Bank (PDB)  of 454 proteins among the 3173 originally provided by Niwa et al. Let us refer to this dataset of graphs () as DS-G-454. The contact graph of a protein is constructed by mapping each amino acid residue to a vertex of the resulting graph. An edge is added among two amino acids if the Euclidean distance between the two centers of mass in the 3D space is between 4 and 8 Å, so to filter out trivial contacts among neighboring residues along the sequence. Vertices are equipped with suitable attributes (labels) that are defined as the three principal components (PCs) derived from the analysis of several chemico-physical properties of amino acids . Edges are equipped with the Euclidean distance among residues.
Originally, such proteins are associated to a continuous solubility degree, which, after a straightforward normalization, can be considered as a number in ; 0 indicating the lowest solubility, 1 the highest in the original data. Fig. 1 shows the solubility degree of the proteins arranged in increasing order. Although the solubility is a chemical property defined within a continuous domain, it is possible to observe that in DS-G-454 there is an evident demarcation among two classes of proteins: those that are very soluble and those are highly insoluble. In particular, we have 77 highly-soluble proteins and 377 with low solubility propensity. This fact allows us to make the reasonable assumption that the proteins in DS-G-454 are members of two distinct classes: soluble and insoluble. This diversification will be studied in the following with the aim of providing a statistical and structural characterization of the E.Coli proteins with respect to the solubility property.
3 The Considered Graph Features
This section briefly introduces the considered features (graph characteristics) that we extracted from the graphs in DS-G-454. Technical details are reported in Appendix A.
Each graph in DS-G-454 is described as a real-valued vector of 15 components, forming hence a dataset that we refer to as DS-C-454. The features are: number of vertices (V), number of edges (E), number of chains in the molecule (C), radius of gyration (RG), porosity (P), modularity of the graph (M), average closeness centrality (ACC), average degree centrality (ADC), and average clustering coefficient (ACL), energy (EN) and Laplacian energy (LEN) of the graph, heat trace (HT) and heat content invariant (HCI), the ambiguity (A), and finally the entropy of the stationary distribution (H).
All such features taken together cover different aspects of the proteins when represented as graphs. Of course, number of vertices/edges (i.e., number of residues and related contacts) accounts for the most basic characterization of proteins: their size. However, as we will see in the following, protein size is one of the most important factors when considering the solubility/aggregation propensity. Number of chains is an important protein characteristic, since in fact the different chains of a protein fold separately and only at end of the process they bind together to function. RG and P are two features describing the “shape” of a protein. In fact, RG is a measure describing the distribution of residues around the center of mass of the whole molecule, while P describes the compactness of the structure (it is a measure of the empty space within the molecule boundaries). M, ACC, ADC, and CL are either local and global structural descriptors that characterize the topology of the graphs (in terms of paths and local structure). In particular, M has been computed by using the k-means algorithm (best k strategy is used, with ). EN, LEN, HT, and HCI are more sophisticated features that are extracted from the spectrum of the matrix representation of the graphs (adjacency and Laplacian matrices). HT is computed for (as a compromise among local and global description) while HCI for (we consider the first coefficient only). Those features have a long history in chemistry and the physical sciences [66, 34, 40], since they have been used to model real processes such as diffusion and percolation . Finally, a measure of irregularity of the graph (how much the structure of a graph differs from a regular graph) is provided by A, and H synthetically describes the unpredictability of a Markovian random walk on a graph – a complete graph induces a transition matrix following the uniform distribution; the most uncertain distribution. The unpredictability is quantified as the 2-order Rényi entropy of the stationary distribution.
4 Protein Structure Analysis and Recognition of Solubility
In this section, we study different aspects of the 454 E.Coli proteins represented according to both DS-G-454 and DS-C-454. First, we study two important topological properties of the E.Coli graphs in DS-G-454: (i) the fractal dimension of the embedded graphs and (ii) the distributions of both shortest paths and vertex degrees (Sec. 4.1). In Sec. 4.2, we study the statistical relations among the elaborated 15 graph features (data in DS-C-454). First we look for a more interpretable linear correlation structure by means of a factor analysis (4.2.1). Then, we move to a non-linear setting by estimating the mutual information among the variables (4.2.2). Finally, in Sec. 4.3 we analyze DS-C-454 in terms of recognition/discrimination capability of soluble/insoluble proteins.
4.1 Topological Structure of E.Coli Proteins
4.1.1 Fractal Dimension of Proteins from Radius of Gyration
Here we discuss the mass fractal dimension (MFD) determined from the scaling among the mass of residues and RG of the molecule [24, 27]. MFD can be computed for a single protein or for a collection of proteins. In the second case, it is possible to derive the MDF, , by studying the scaling among RG, , and the mass/length, , of the polymer: . Fig. 2 shows the log-log plots that we used to determine the MFDs. We performed the experiments by either separating soluble and insoluble proteins, and also by considering at the same time all proteins in DS-G-454. MFD of soluble proteins is ; for the insoluble is ; considering all proteins we have . At first, one might conclude that MDF offers an interesting mechanism for discriminating soluble and insoluble proteins. However, the low coefficients of determination ( for soluble; for insoluble) suggest us that such a feature cannot be considered as a reliable class discriminator – intra-class agreement on such a measure is not strong. In fact, although RG conveys important information (see Sec. 4.2.1 and 4.2.2), it does not offer a striking discrimination rule for the solubility.
We computed the MFD also for each single protein (data not shown) by analyzing the scaling , considering hence the mass, , of the atoms falling within concentric spheres of characteristic length . Spheres are centered at the center of mass of the molecule. In average, we obtained a MFD of , which is fairly smaller than the one obtained with the RG. This difference could be explained by the not so strong fitting precision observed in the scaling among vertices and RG. Nonetheless, considering the average size of the proteins at hand, this value is in agreement with the calculations reported in the literature [24, 27].
4.1.2 Distance and Degree Distributions
As already observed by Tasdighian et al. , proteins exhibit a typical topology that is at the same time highly modular, fractal, and usually hierarchical. This results in networks that are also optimized in terms of shortest paths, while however the characteristic path length of proteins does not scale with high agreement as a small-world topology [9, 23, 27]. As a demonstration of this fact for our dataset, DS-G-454, in Fig. 3(a) we show the scaling between the number of vertices and the related ACC. As it is possible to observe, the scaling is not very consistent with an exponential decay – please note that an analogue result can be obtained by considering the average shortest path, but observing an increasing logarithmic-like trend. This is in agreement with the fact that the topology is not entirely small-world.
Proteins topology should not be considered scale-free as well, since in fact the presence of hubs is limited [5, 13, 71]. This is due to the fact that the physical arrangement of neighbor residues is limited by steric effects. As a demonstration of this fact, in Fig. 3(b) we show the sample degree distribution, which is clearly not consistent with an inverse power-law.
4.2 Statistical Analysis of Features
4.2.1 Factor Analysis
The loadings of the first five factors (components extracted by PCA on the correlation matrix) with the original descriptors are reported in Tab. 1. The variables relevant for component evaluation (higher loading in module) are in bold. The extracted components are consistent with the correlation structure made evident in Ref. . By far, the most important order parameter shaping the protein universe is their relative size, Factor 1, which is heavily correlated with explicit size features, such as V, E, and RG. Moreover, as demonstrated in Sec. 4.1.2, ACC scales negatively with the size, i.e., larger proteins have a lower average closeness centrality factor. It is worth pointing out the strong (linear) relation of some size-related variables with descriptors such as EN, LEN, HT, and HCI. Although these features obviously depend also on the number of vertices/edges, they convey higher level information that characterize also the global arrangement of the network. This result confirms us that all proteins share common mesoscopic architectural principles. The consequences of these architectural principles in terms of protein properties are in turn modulated by the effect of size as expected with nanoscale objects, as protein molecules are .
It is interesting to note that the number of chains of a protein (C) is independent of the size component (Factor 1), while it needs a dedicated component (Factor 4). On the same component, it is worth mentioning the weaker contribution of CL, pointing out the fact that the local clustering coefficient is lower in multiple-chains molecules.
Factor 2 is the most relevant order parameter after the predominant size component (result in accordance with Ref. Di Paola et al. ). In particular, it points to peculiar topological properties of proteins (ADC, CL). It is worth noting the fact that A is almost equally loaded on the size component and the second factor. Notably, A decreases with the network size and it increases as the network becomes more irregular (in terms of ADC and CL).
Again in accordance with the general results presented in , the third factor emerges as a “global shape” factor describing the arrangements of the amino acids in the molecule. In fact, P is almost entirely loaded in this component. It is worth noting that the local topological properties (Factor 2) are independent from the global architecture (Factor 3). We already commented Factor 4 (number of chains). Factor 5 allows a less clear interpretation, while however it is worth noting that H scales almost identically here and with the size component (Factor 1).
|Factor 1||Factor 2||Factor 3||Factor 4||Factor 5|
4.2.2 Non-linear Correlations via Mutual Information
The linear correlation structure discussed in Sec. 4.2.1 confirms the distinguishing factor of size in proteins. In fact, the first factor is highly characterized by either common size descriptive attributes (such as V, E, and RG) and other related characteristics, such as EN, LEN, HT, and HCI (although these features account also for more sophisticated characteristics, such as the global network structure and the flow of information).
Here we elaborate over those results by analyzing the pairwise non-linear correlations. We compute the non-linear correlation among two features/characteristics via mutual information estimation . Results are provided in Tab. 2. Overall, the correlation structure is very similar to the one discussed in (4.2.1). In fact, V, E, RG, EN, LEN, HT, are HCI are highly correlated to each other. In addition, P exhibits weak relations with respect to all other features, which justifies the fact that in the factor analysis this feature was loaded alone in a dedicated component (Factor 3). Individual correlations with respect to the solubility degree (SOL) show that there is no predominant strong non-linear relation. However, results in bold seems to share more information with this target property.
4.3 Discrimination of Solubility Classes
In this section, we focus on DS-C-454 with the aim of providing a robust discrimination (i.e., classification) rule for soluble and insoluble proteins. As stressed before, we faced this objective multiple times in previous studies [59, 45], but always considering a multi-class approach, i.e., by modeling explicitly soluble and insoluble classes. In this paper, we depart from this approach by casting the problem in terms of one-class classification. To this end, we will make use of a one-class classifier that we recently proposed . In the following, we refer to this classifier as EOCC. We consider the very soluble proteins as the “targets”, while those that are insoluble are considered as non-target. This choice is justified by the fact that all proteins for which it is available the 3D structure must be solubilized. Therefore, a low level of solubility, measured according to the Niwa et al.  conditions, can be understood as an anomaly; in practice, those proteins with lower solubility levels are stabilized in the E.Coli organism by the so-called chaperones. The aim of the test is to provide consistent acceptance/rejection decisions considering the fact that proteins are originally described by a continuous variable: the solubility degree. EOCC provides both hard and soft decisions, which perfectly suits this scenario. We will see that the soft decisions convey reasonable information with respect to the original solubility degree.
4.3.1 Data Preprocessing and Organization
Fig. 4 shows the PCA of the data in DS-C-454. In (4(a)) and (4(b)) we show, respectively, the plot of first–second and first–third PCs. Data in DS-C-454 has been normalized by considering the component-wise mean a standard deviation. As already demonstrated in Sec. 4.2.1 by the factor analysis, PCA of the first two components, PC1–PC2, clearly shows some possibility of separating the patterns by the first component (i.e., the “size” of proteins). According with the data in Tab. 1, in the experiments we will consider the original features transformed as the first five components obtained from the PCA of DS-C-454.
Split of disjoint training, validation, and test sets has been performed as follows. For the training set, we randomly select 50 soluble proteins. The remaining 27 are used for validating and testing the model, by using respectively 7 and 20 patterns. The 377 insoluble proteins are considered in the validation and test sets, by using, respectively, 50 and 327 proteins.
In the case of soft decisions, EOCC outputs the membership degree (a number in ) of a test pattern to the target class. In fact, the model of EOCC consists in different decision regions (DRs) modeled as fuzzy sets . The performance measure that we consider in this case is the Area Under the ROC Curve  (AUC), which can be interpreted as the average probability of ranking a target pattern higher than a non-target one. For the hard decision case, we consider the confusion matrix and related common statistics: accuracy, precision, recall, F-measure, and false positive rate. Finally, we provide an indicator of model complexity, which in our case is reported as the average number of DRs synthesized by EOCC. Results are intended as the average of 30 different runs with different initialization seeds (we report standard deviations).
4.3.2 Classification Results
Test set results of EOCC are reported in Tab. 4, considering both hard and soft decisions of membership to the target class (i.e., the class of very soluble proteins). As the reader would notice, there are two rows in Tab. 4: the first one referring to a “Normal” and the other to a “Randomized” scenario. The one indicated as “Normal” provides the results considering exactly the split configuration as discussed above. The row labeled “Randomized” is added to demonstrate the robustness of the results obtained in the “Normal” scenario.
Let us start by first discussing the results in the normal case. Average AUC is 0.74, denoting a fairly robust statistics when considering the soft decisions. Results from the confusion matrix (hard decisions) are less appealing. In fact, although the accuracy is fairly high (explained by the very low false positive rate), precision and recall are definitely low. In average, EOCC synthesized 28.05 DRs, which divided by the training set size (i.e., 50) gives a compression factor of 0.44 (almost a DR every two training patterns). Such a fairly high number of DRs can be explained by the high difficulty of the underlying classification problem. Considering the hard decisions, results are not convincing for what concerns the recognition of target patterns (they recall some of those obtained in our previous studies [45, 59]).
Let us focus now on the randomized scenario. In this case, we randomly defined 77 out of all 454 proteins as the target patterns, i.e., without considering their true association with respect to the soluble/insoluble class. The remaining 377 patterns are accordingly labeled as non-target. We preserved the same split percentages for training, validation, and test, as those explained before. EOCC is then executed on such a randomized target/non-target assignment with the aim of evaluating the robustness of the results obtained in the normal case. The second row of Tab. 4 shows the obtained results. Not surprisingly, the results for the hard classification are almost the same of those obtained in the normal setting. This is explainable by the fact that the problem is highly unbalanced (there are more non-target instances that affect the statistics derived from the confusion matrix). However, it is possible to note two very important facts: (i) the AUC in this case is consistent with a random classifier (i.e., 0.5) and the average model complexity is now much higher. In fact, in average 40.10 DRs are used, denoting a compression factor of roughly 0.19, which is considerably lower than the one obtained in the normal case (2.3 times lower). This is a first indicator that the solution obtained in the normal case can be considered as robust, i.e., a model that effectively provides a reasonable explanation of the underlying process.
As a second demonstration of this conclusion, let us analyze Fig. 5. In these two figures, we compare the continuous outputs calculated by EOCC in the two considered scenarios. In Fig. 5 we report the membership degrees assigned to the test patterns, while Fig. 5 reports the average of those values. As it is possible to note, in the figures we differentiate among those of the normal and randomized setting, and also among those patterns that are considered as target and non-target. From these results it is possible to deduce that the membership degrees of the target patterns are highly affected by the randomization of the target class, while those of the non-target patterns are practically left invariant. However, membership values, in the normal case, denote a good average class discrimination, i.e., the tested target patterns have an average membership degree fairly higher than the non-target instances (although it is possible to note different errors for the non-target patterns). This is in in clear accordance with the fact that, originally, proteins are characterized by a continuous solubility degree. To conclude, it is worth stressing that EOCC is a classifier, and so the soft decisions must be intended as a way to assign a score (ranking) to the decisions, and not as an approximation of the original solubility signal.
|Soft Decision||Hard Decision|
|AUC||Accuracy||Precision||Recall||F-Measure||False Positive Rate||# DRs|
5 Conclusions and Future Directions
The relation among the structure and solubility property of proteins is highly non-linear and hence very hard to predict. However, solubility/aggregation propensity of proteins is a very important topic, which justifies such a research effort. In fact, aggregation of proteins is at the basis of many misfolding diseases, such as Parkinson and Alzheimer.
The herein presented study analyzed three important factors of a dataset of proteins recently elaborated from the E.Coli proteome. First, we checked pure topological characteristics, by analyzing properties related to the shape (fractal dimension of the embedded graphs) and connections (distribution of shortest paths and vertex degrees). We verified that also this dataset of E.Coli proteins adhere to general principles describing proteins architecture. Then we moved to the statistical analysis of a collection of 15 features that we extracted from the dataset of graphs under analysis. We studied either linear and non-linear relations among such features. Overall, our findings confirmed that the solubility of proteins is mainly distinguished by their size. In fact, the size affects the density of the connections, the global shape, and thus necessarily also the propensity of being soluble. This last fact is due to evident energetic constraints, which favors the (completely autonomous) folding of small-sized proteins. Lastly, we analyzed the proteins described in terms of those 15 features, for the purpose of recognizing very soluble and poorly soluble proteins automatically. We faced the problem in the one-class classification setting by means of a novel method (EOCC). In the case of hard discrimination, results are not convincing. However, results concerning the soft decisions yielded interesting performances in terms of discrimination of the solubility class.
Notwithstanding we reached encouraging recognition performances (in the soft decision case), an ultimate solution in this sense is still missing. The extreme difficulty of this problem may be due to a variety of reasons: (i) a highly non-linear relation among structure and solubility, (ii) experimental errors in the measurements of the original solubility degree, (iii) a too small sample that is not sufficient to synthesize an effective recognition model, and/or (iv) a wrong modeling and computational approach. All in all, we were aware of the intrinsic difficulty of the problem, given solubility and aggregation can be considered as the “two-faces of the same coin” (being the coin the ability to form non-covalent bonds) more than a neat categorization into two non-overlapping behaviors. We want to stress how the quest for solving a difficult problem allowed us to find along the way a novel style of graph-based data analysis approach that, in the case of proteins, allowed us to highlight some very important and general properties of protein structure and function. Protein molecules are an almost unique case in which graphs have an immediate physical counterpart, but the same approach can be equally applied to more abstract / less material network structures, such as gene expression  or metabolic  networks.
Appendix A Graph Characteristics
a.1 Radius of Gyration and Porosity
Radius of gyration of a molecule (graph) of amino acids (vertices) is computed as ,
where is the mass of the ith amino acid and is the Euclidean distance between the amino acid center of mass and the center of mass of the whole structure.
The porosity (or void fraction) of a molecule is defined as
where is the sum of the volumes of the residues constituting the molecule. is the total volume, which is computed as the average of the three spherical volumes , each one calculated by considering a diameter equal to the maximum distance in the respective dimension.
a.2 Modularity of a Graph
A partition , , of order of a graph , is commonly intended as a partition of the vertex set into disjoint subsets (clusters, modules), . A well-established measure to determine the quality of is the so-called modularity measure , which basically quantifies how well groups the vertices of into compact and separated clusters. Intuitively, in a graph a cluster of vertices is compact if the number (the weight) of the intra-cluster edges is considerably greater than the one of the inter-cluster edges. The modularity measure is formally defined as follows,
where is the number of edges incident to . Eq. 3 can be rewritten as:
is the number of intra-cluster edges and is the sum of degrees of the vertices in the lth cluster (considering all edges, i.e., also those with one end-point outside ). The modularity of a graph is equal to the modularity of the partition of that maximizes Eq. 4. Finding such an optimal partition (4) is NP-complete , and therefore many heuristics has been proposed .
a.3 Closeness and Degree Centrality
In a graph , the closeness centrality factor of a vertex is defined as:
where computes the shortest path in . The closeness centrality provides a measure of how much a vertex is close to all other vertices of the graph (in terms of shortest paths). Such measure can be extended to the whole graph by taking the average among all vertices.
The degree centrality of a vertex is defined as its degree, i.e., the number of incident edges. Analogously, it can be normalized by using the maximum degree of the graph.
a.4 Clustering Coefficient
Let be the adjacency matrix of a graph . The clustering coefficient  is defined as,
where is the number of triangles in the graph, while is the number of connected triplets:
a.5 Energy and Laplacian Energy of a Graph
Let , be a graph, and let be its adjacency matrix. The energy  of the graph is defined as
where is the ith eigenvalue of the adjacency matrix .
Let us define the Laplacian matrix as , where is a diagonal matrix containing the vertex degrees. The Laplacian energy , LE, is defined as
where is the ith eigenvalue of the Laplacian.
a.6 Heat Kernel
Let be a graph with vertices and edges, respectively. Let and be the corresponding adjacency and Laplacian matrices. Let us define the normalized Laplacian matrix as . is symmetric and positive semi-definite, and therefore it has non-negative eigenvalues only. The spectral decomposition of the Laplacian is given by , where is the diagonal matrix containing the eigenvalues arranged as ; contains the corresponding (unitary) eigenvectors.
where is the heat kernel (a doubly-stochastic matrix) and is the time variable. It is well-known that the solution for (11) is,
which can be solved by exponentiating the spectrum of :
Eq. 11 describes the diffusion (i.e., the flow) of heat/information across the graph over time. In fact,
where denotes the value related to in the ith eigenvector. It is important to note () that when ; conversely, when is large we have (note that , i.e., the eigenvector associated to smallest non-zero eigenvalue, is called Fiedler vector). This means that the large-time behavior of the diffusion depends on the global structure of the graph, while its short-time characteristics are determined by the local connections.
The heat trace (HT) of is the sum of the diagonal entries,
which takes into account only the eigenvalues of . The heat content of is defined by considering also the eigenvectors:
Eq. 16 can be described in terms of power series,
The McLaurin series expansion for the negative exponential reads as:
which substituted in Eq. 16 gives:
The coefficients in (17) are graph invariants (called heat content invariants, HCI) that be obtained in closed-form via the following expression:
a.7 Ambiguity of a Graph
The ambiguity of a graph , gives a measure of uncertainty elaborated according to a fuzzy set based interpretation . The ambiguity of the graph is calculated by embedding the graph into a fuzzy hypercube , which, in short, encodes the membership values of the vertices. A graph is mapped to a type-1 fuzzy set , defined as
by generating the membership function of the graph vertices, . Such a membership function is constructed by considering a partition, , of the graph:
The partition is then “fuzzified” by computing the t-conorm among all fuzzy sets associated to each (one-to-one mapping), yielding the resulting fuzzy set, , describing the graph as a whole:
In the following, let us write . The membership function describing the fuzzy set is generated according to the following expression:
accounts for the degree concentration in , while gives the importance of the vertex in in terms of centrality. Given the fuzzy set representing the uncertainty of the whole graph , the measure of ambiguity of , denoted with , is obtained by computing (any monotonic and non-decreasing transformation of) the fuzzy entropy of . Since there are exponentially-many fuzzy set representations for a single graph , the actual ambiguity value is calculated as the solution of the following combinatorial optimization problem:
assumes values within the range, approaching one as the graph is maximally ambiguous (i.e., maximally irregular). It has been proved that is zero when the graph is regular (e.g., complete) . Accordingly, can be used as a global complexity descriptor characterizing the regularity of .
a.8 Entropy of a Markovian Random Walk
A Markovian random walk in a graph  is a first-order Markov chain that generates sequences of vertices (the vertices of the graph should be interpreted as the states of chain). Transition among vertices are regulated by the transition matrix, which is given by . Let be a probability vector describing the probability of the states at time ; when , the vector describes the initial distribution. The stationary distribution is a probability vector, , such that . The stationary distribution of a random walk can be intended as the limiting distribution, i.e., the distribution of the vertices/states when . If the graph is undirected and non-bipartite, then the random walk always admits a stationary distribution, which can be easily computed from the degree distribution:
A stationary random walk is hence completely described by . The entropy of can be used to characterize in terms of predictability of the corresponding stationary random walk. In fact, if is regular, then is uniform, which could be interpreted as the maximum degree of unpredictability.
-  Amino acid Physical-chemical property Database. URL http://www.rfdn.org/bioinfo/APDbase/index.html.
-  Protein Data Bank. URL http://www.rcsb.org/pdb/home/home.do.
- Agostini et al.  F. Agostini, M. Vendruscolo, and G. G. Tartaglia. Sequence-Based Prediction of Protein Solubility. Journal of Molecular Biology, 421(2-3):237–241, 2012. ISSN 0022-2836. doi: 10.1016/j.jmb.2011.12.005.
- Anand and Bianconi  K. Anand and G. Bianconi. Entropy measures for networks: Toward an information theory of complex topologies. Physical Review E, 80:045102, Oct 2009. doi: 10.1103/PhysRevE.80.045102.
- Bagler and Sinha  G. Bagler and S. Sinha. Network properties of protein structures. Physica A: Statistical Mechanics and its Applications, 346(1):27–33, 2005.
- Bai and Hancock  L. Bai and E. R. Hancock. Depth-based complexity traces of graphs. Pattern Recognition, 47(3):1172–1186, 2014. ISSN 0031-3203. doi: 10.1016/j.patcog.2013.09.010.
- Banerji and Ghosh  A. Banerji and I. Ghosh. Fractal symmetry of protein interior: what have we learned? Cellular and Molecular Life Sciences, 68(16):2711–2737, 2011. doi: 10.1007/s00018-011-0722-6.
- Banerji et al.  C. R. S. Banerji, S. Severini, and A. E. Teschendorff. Network transfer entropy and metric space for causality inference. Physical Review E, 87(5):052814, May 2013. doi: 10.1103/PhysRevE.87.052814.
- Bartoli et al.  L. Bartoli, P. Fariselli, and R. Casadio. The effect of backbone on the small-world properties of protein contact maps. Physical Biology, 4(4):L1, 2007. doi: 10.1088/1478-3975/4/4/L01.
- Ben-Avraham and Havlin  D. Ben-Avraham and S. Havlin. Diffusion and Reactions in Fractals and Disordered Systems. Cambridge University Press, Cambridge, UK, 2000.
- Bianconi  G. Bianconi. Entropy of network ensembles. Physical Review E, 79(3):036114, 2009.
- Boccaletti et al.  S. Boccaletti, V. Latora, Y. Moreno, M. Chavez, and D. Hwang. Complex networks: Structure and dynamics. Physics Reports, 424(4-5):175–308, Feb. 2006. ISSN 03701573. doi: 10.1016/j.physrep.2005.10.009.
- Böde et al.  C. Böde, I. A. Kovács, M. S. Szalay, R. Palotai, T. Korcsmáros, and P. Csermely. Network analysis of protein dynamics. Febs Letters, 581(15):2776–2782, 2007. doi: 10.1016/j.febslet.2007.05.021.
- Brandes et al.  U. Brandes, D. Delling, M. Gaertler, R. Gorke, M. Hoefer, Z. Nikoloski, and D. Wagner. On modularity clustering. IEEE Transactions on Knowledge and Data Engineering, 20:172–188, Feb. 2008. ISSN 1041-4347. doi: 10.1109/TKDE.2007.190689.
- Bullmore and Sporns  E. T. Bullmore and O. Sporns. Complex brain networks: graph theoretical analysis of structural and functional systems. Nature Reviews Neuroscience, 10(3):186–198, 2009. doi: 10.1038/nrn2575.
- Chiti et al.  F. Chiti, N. Taddei, F. Baroni, C. Capanni, M. Stefani, G. Ramponi, and C. M. Dobson. Kinetic partitioning of protein folding and aggregation. Nature Structural & Molecular Biology, 9(2):137–143, 2002.
- Clune et al.  J. Clune, J.-B. Mouret, and H. Lipson. The evolutionary origins of modularity. Proceedings of the Royal Society B: Biological Sciences, 280(1755), 2013.
- Costa et al.  L. d. F. Costa, F. A. Rodrigues, G. Travieso, and P. R. Villas Boas. Characterization of complex networks: A survey of measurements. Advances in Physics, 56(1):167–242, 2007. doi: 10.1080/00018730601170527.
- Csermely et al.  P. Csermely, T. Korcsmáros, H. J. M. Kiss, G. London, and R. Nussinov. Structure and dynamics of molecular networks: A novel paradigm of drug discovery: A comprehensive review. Pharmacology & Therapeutics, 138(3):333–408, 2013. doi: 10.1016/j.pharmthera.2013.01.016.
- Daqing et al.  L. Daqing, K. Kosmidis, A. Bunde, and S. Havlin. Dimension of spatially embedded networks. Nature Physics, 7(6):481–484, 2011. doi: 10.1038/nphys1932.
- Dehmer and Mowshowitz  M. Dehmer and A. Mowshowitz. A history of graph entropy measures. Information Sciences, 181(1):57–78, 2011. ISSN 0020-0255. doi: 10.1016/j.ins.2010.08.041.
- Dehmer et al.  M. Dehmer, L. A. J. Mueller, and F. Emmert-Streib. Quantitative network measures as biomarkers for classifying prostate cancer disease states: a systems approach to diagnostic biomarkers. PLoS ONE, 8(11):e77602, 2013.
- Di Paola et al. [2012a] L. Di Paola, M. De Ruvo, P. Paci, D. Santoni, and A. Giuliani. Protein contact networks: an emerging paradigm in chemistry. Chemical Reviews, 113(3):1598–1613, 2012a. doi: 10.1021/cr3002356.
- Di Paola et al. [2012b] L. Di Paola, P. Paci, D. Santoni, M. De Ruvo, and A. Giuliani. Proteins as sponges: a statistical journey along protein structure organization principles. Journal of Chemical Information and Modeling, 52(2):474–482, 2012b. doi: 10.1021/ci2005127.
- Dill et al.  K. A. Dill, K. Ghosh, and J. D. Schmit. Physical limits of cells and proteomes. Proceedings of the National Academy of Sciences, 108(44):17876–17882, 2011.
- Donner et al.  R. V. Donner, M. Small, J. F. Donges, N. Marwan, Y. Zou, R. Xiang, and J. Kurths. Recurrence-based time series analysis by means of complex network methods. International Journal of Bifurcation and Chaos, 21(04):1019–1046, 2011. doi: 10.1142/S0218127411029021.
- Enright and Leitner  M. B. Enright and D. M. Leitner. Mass fractal dimension and the compactness of proteins. Physical Review E, 71:011912, Jan 2005. doi: 10.1103/PhysRevE.71.011912.
- Escolano et al.  F. Escolano, E. R. Hancock, and M. A. Lozano. Heat diffusion: Thermodynamic depth complexity of networks. Physical Review E, 85(3):036206, 2012. doi: 10.1103/PhysRevE.85.036206.
- Fawcett  T. Fawcett. An Introduction to ROC Analysis. Pattern Recognition Letters, 27(8):861–874, June 2006. ISSN 0167-8655. doi: 10.1016/j.patrec.2005.10.010.
- Fortunato  S. Fortunato. Community detection in graphs. Physics Reports, 486(3–5):75–174, 2010. ISSN 0370-1573. doi: 10.1016/j.physrep.2009.11.002.
- Ghosh and Dill  K. Ghosh and K. A. Dill. Cellular proteomes have broad distributions of protein stability. Biophysical Journal, 99(12):3996–4002, 2010.
- Giuliani et al.  A. Giuliani, M. Colafranceschi, C. L. Webber Jr., and J. P. Zbilut. A complexity score derived from principal components analysis of nonlinear order measures. Physica A: Statistical Mechanics and its Applications, 301(1):567–588, 2001. doi: 10.1016/S0378-4371(01)00427-7.
- Giuliani et al.  A. Giuliani, S. Filippi, and M. Bertolaso. Why network approach can promote a new way of thinking in biology. Frontiers in Genetics, 5(83), 2014. doi: 10.3389/fgene.2014.00083.
- Gutman and Zhou  I. Gutman and B. Zhou. Laplacian energy of a graph. Linear Algebra and its Applications, 414(1):29–37, 2006. doi: 10.1016/j.laa.2005.09.008.
- Han et al.  L. Han, F. Escolano, E. R. Hancock, and R. C. Wilson. Graph characterizations from von Neumann entropy. Pattern Recognition Letters, 33(15):1958–1967, 2012. ISSN 0167-8655. doi: 10.1016/j.patrec.2012.03.016.
- Havlin  S. Havlin. Complex Networks. Cambridge University Press, Cambridge, UK, 2010. ISBN 9780521841566.
- Klabunde and Richards  K. J. Klabunde and R. Richards. Nanoscale Materials in Chemistry, volume 1035. John Wiley & Sons, Hoboken, New Jersey, USA, 2001.
- Kondor and Lafferty  R. I. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete structures. In In Proceedings of the International Conference on Machine Learning, pages 315–322, San Francisco, CA, USA, 2002.
- Leitner et al.  D. M. Leitner, M. Gruebele, and M. Havenith. Solvation dynamics of biomolecules: Modeling and terahertz experiments. HFSP Journal, 2(6):314–323, 2008.
- Lervik et al.  A. Lervik, F. Bresme, S. Kjelstrup, D. Bedeaux, and J. M. Rubi. Heat transfer in protein–water interfaces. Physical Chemistry Chemical Physics, 12(7):1610–1617, 2010.
- Liang and Dill  J. Liang and K. A. Dill. Are proteins well-packed? Biophysical Journal, 81(2):751–766, 2001.
- Livi and Rizzi  L. Livi and A. Rizzi. Graph ambiguity. Fuzzy Sets and Systems, 221:24–47, 2013. ISSN 0165-0114. doi: 10.1016/j.fss.2013.01.001.
- Livi et al.  L. Livi, G. Del Vescovo, and A. Rizzi. Combining graph seriation and substructures mining for graph recognition. In P. Latorre Carmona, J. S. Sánchez, and A. L. N. Fred, editors, Pattern Recognition - Applications and Methods, volume 204, pages 79–91. Springer, Berling, Germany, 2013. doi: 10.1007/978-3-642-36530-0_7.
- Livi et al.  L. Livi, A. Sadeghian, and W. Pedrycz. Entropic one-class classifiers. IEEE Transactions on Neural Networks and Learning Systems, Apr. 2015. ISSN 2162-237X. doi: 10.1109/TNNLS.2015.2418332.
- Livi et al.  L. Livi, A. Giuliani, and A. Rizzi. Toward a multilevel representation of protein molecules: Comparative approaches to the aggregation/folding propensity problem. Information Sciences, 326:134–145, 2016. ISSN 0020-0255. doi: 10.1016/j.ins.2015.07.043.
- Lloyd and Pagels  S. Lloyd and H. Pagels. Complexity as thermodynamic depth. Annals of Physics, 188(1):186–213, 1988.
- Mirshahvalad et al.  A. Mirshahvalad, A. V. Esquivel, L. Lizana, and M. Rosvall. Dynamics of interacting information waves in networks. Physical Review E, 89(1):012809, 2014. doi: 10.1103/PhysRevE.89.012809.
- Moon et al.  Y.-I. Moon, B. Rajagopalan, and U. Lall. Estimation of mutual information using kernel density estimators. Physical Review E, 52:2318–2321, Sep. 1995. doi: 10.1103/PhysRevE.52.2318.
- Mülken and Blumen  O. Mülken and A. Blumen. Continuous-time quantum walks: Models for coherent transport on complex networks. Physics Reports, 502(2):37–87, 2011.
- Nakayama et al.  T. Nakayama, K. Yakubo, and R. L. Orbach. Dynamical properties of fractal networks: Scaling, numerical simulations, and physical realizations. Reviews of Modern Physics, 66(2):381, 1994. doi: 10.1103/RevModPhys.66.381.
- Newman  M. E. J. Newman. Networks: An Introduction. Oxford University Press, Oxford, UK, 2010.
- Niwa et al.  T. Niwa, B.-W. Ying, K. Saito, W. Jin, S. Takada, T. Ueda, and H. Taguchi. Bimodal protein solubility distribution revealed by an aggregation analysis of the entire ensemble of Escherichia coli proteins. Proceedings of the National Academy of Sciences, 106(11):4201–4206, 2009. doi: 10.1073/pnas.0811922106.
- Niwa et al.  T. Niwa, T. Kanamori, T. Ueda, and H. Taguchi. Global analysis of chaperone effects using a reconstituted cell-free translation system. Proceedings of the National Academy of Sciences, 109(23):8937–8942, 2012. doi: 10.1073/pnas.1201380109.
- Pawar et al.  A. P. Pawar, K. F. Dubay, J. Zurdo, F. Chiti, M. Vendruscolo, and C. M. Dobson. Prediction of “aggregation-prone” and “aggregation-susceptible” regions in proteins associated with neurodegenerative diseases. Journal of Molecular Biology, 350(2):379–392, 2005.
- Pinna et al.  A. Pinna, N. Soranzo, I. Hoeschele, and A. de la Fuente. Simulating systems genetics data with SysGenSIM. Bioinformatics, 27(17):2459–2462, 2011. doi: 10.1093/bioinformatics/btr407.
- Porfiri et al.  M. Porfiri, D. J. Stilwell, and E. M. Bollt. Synchronization in random weighted directed networks. IEEE Transactions on Circuits and Systems I: Regular Papers, 55(10):3170–3177, May 2008. doi: 10.1109/TCSI.2008.925357.
- Qiu and Hancock  H. Qiu and E. R. Hancock. Graph simplification and matching using commute times. Pattern Recognition, 40(10):2874–2889, 2007. ISSN 0031-3203. doi: 10.1016/j.patcog.2006.11.013.
- Ren et al.  P. Ren, R. C. Wilson, and E. R. Hancock. Graph Characterization via Ihara Coefficients. IEEE Transactions on Neural Networks, 22(2):233–245, Feb. 2011. ISSN 1045-9227. doi: 10.1109/TNN.2010.2091969.
- Rizzi et al.  A. Rizzi, F. Possemato, L. Livi, A. Sebastiani, A. Giuliani, and F. M. Frattale Mascioli. A dissimilarity-based classifier for generalized sequences by a Granular Computing approach. In Proceedings of the 2013 International Joint Conference on Neural Networks, pages 2397–2404, Dallas, USA, Aug 2013. ISBN 978-1-4673-6129-3. doi: 10.1109/IJCNN.2013.6707041.
- Sanchez et al.  A. Sanchez, S. Abbet, U. Heiz, W.-D. Schneider, H. Häkkinen, R. N. Barnett, and U. Landman. When gold is not noble: nanoscale gold catalysts. The Journal of Physical Chemistry A, 103(48):9573–9578, 1999.
- Song et al.  C. Song, S. Havlin, and H. A. Makse. Self-similarity of complex networks. Nature, 433(7024):392–395, 2005. doi: 10.1038/nature03248.
- Song et al.  C. Song, L. K. Gallos, S. Havlin, and H. A. Makse. How to calculate the fractal dimension of a complex network: the box covering algorithm. Journal of Statistical Mechanics: Theory and Experiment, 2007(03):P03006, 2007.
- Suau et al.  P. Suau, E. R. Hancock, and F. Escolano. Analysis of the Schrödinger operator in the context of graph characterization. In Similarity-Based Pattern Recognition, pages 190–203. Springer, 2013.
- Tasdighian et al.  S. Tasdighian, L. Di Paola, M. De Ruvo, P. Paci, D. Santoni, P. Palumbo, G. Mei, A. Di Venere, and A. Giuliani. Modules identification in protein structures: the topological and geometrical solutions. Journal of Chemical Information and Modeling, 54(1):159–168, 2013. doi: 10.1021/ci400218v.
- Tejera et al.  E. Tejera, A. Machado, I. Rebelo, and J. Nieto-Villar. Fractal protein structure revisited: Topological, kinetic and thermodynamic relationships. Physica A: Statistical Mechanics and its Applications, 388(21):4600–4608, 2009. doi: 10.1016/j.physa.2009.07.015.
- Trinajstić  N. Trinajstić. Chemical Graph Theory. CRC Press, Boca Raton, FL, 1983.
- Tun et al.  K. Tun, P. Dhar, M. Palumbo, and A. Giuliani. Metabolic pathways variability and sequence/networks comparisons. BMC Bioinformatics, 7(1):24, 2006. ISSN 1471-2105. doi: 10.1186/1471-2105-7-24.
- Wackerbauer et al.  R. Wackerbauer, A. Witt, H. Atmanspacher, J. Kurths, and H. Scheingraber. A comparative classification of complexity measures. Chaos, Solitons & Fractals, 4(1):133–173, 1994.
- Weskamp et al.  N. Weskamp, E. Hullermeier, D. Kuhn, and G. Klebe. Multiple graph alignment for the structural analysis of protein active sites. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 4(2):310–320, Apr 2007. ISSN 1545-5963. doi: 10.1109/TCBB.2007.1024.
- Xiao et al.  B. Xiao, E. R. Hancock, and R. C. Wilson. Graph characteristics from the heat kernel trace. Pattern Recognition, 42(11):2589–2606, Nov. 2009. ISSN 0031-3203. doi: 10.1016/j.patcog.2008.12.029.
- Yan et al.  W. Yan, J. Zhou, M. Sun, J. Chen, G. Hu, and B. Shen. The construction of an amino acid network for understanding protein structure and function. Amino Acids, 46(6):1419–1439, 2014. doi: 10.1007/s00726-014-1710-6.
- Ye et al.  C. Ye, R. C. Wilson, C. H. Comin, L. Costa, and E. R. Hancock. Approximate von Neumann entropy for directed graphs. Physical Review E, 89(5):052804, 2014.