Network-based protein structural classification

Network-based protein structural classification

Khalique Newaz Arash Rahnama Mahboobeh Ghalehnovi Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, USA Panos J. Antsaklis Department of Electrical Engineering, University of Notre Dame, Notre Dame, IN 46556, USA Tijana Milenković
Abstract

Experimental determination of protein function is resource-consuming. As an alternative, computational prediction of protein function has received attention. In this context, protein structural classification (PSC) can help, by allowing for determining structural classes of currently unclassified proteins based on their features, and then relying on the fact that proteins with similar structures have similar functions. Existing PSC approaches rely on sequence-based or direct 3-dimensional (3D) structure-based protein features. In contrast, we first model 3D structures of proteins as protein structure networks (PSNs). Then, we use network-based features for PSC. We propose the use of graphlets, state-of-the-art features in many research areas of network science, in the task of PSC. Moreover, because graphlets can deal only with unweighted PSNs, and because accounting for edge weights when constructing PSNs could improve PSC accuracy, we also propose a deep learning framework that automatically learns network features from weighted PSNs. When evaluated on a large set of CATH and SCOP protein domains (spanning 36 PSN sets), our proposed approaches are superior to existing PSC approaches in terms of accuracy, with comparable running time.

1 Introduction

1.1 Motivation and related work.

Proteins are major molecules of life, and thus understanding their cellular function is important. However, doing so experimentally is costly and time-consuming [1]. Instead, computational approaches are often used for this purpose, which are much more efficient because they leverage on the fact that (sequence or 3-dimensional (3D)) structural similarity of proteins often indicates their functional similarity [2]; note that here we refer to as broad notion of protein structural similarity as captured by any existing method. One type of such computational approaches is protein structural classification (PSC) [3]. PSC uses structural features of proteins with known labels (typically CATH [4] or SCOP [5] structural classes) to learn a classification model in a supervised manner (i.e, by including the labels into the process of training the model). Then, the structural feature of a protein with unknown label can be used as input to the classification model to determine the structural class of the protein. This information can in turn be used to predict function of a protein based on functions of other proteins that belong to the same structural class as the protein of interest. In this paper, we focus on the PSC problem.

Note that there exists a related computational problem which can help with protein function prediction – that of protein structural comparison [6]. However, unlike PSC: 1) protein structural comparison uses structural features of proteins with known or unknown labels in unsupervised rather than supervised manner (i.e., it ignores any potential label information), and 2) it uses the features to compute pairwise similarities between the proteins in hope that highly-similar proteins will have the same label (where the labels are used only after the fact), rather than predicting the label of a single protein. In other words, both the goals and working mechanisms of PSC and protein structural comparison are different. Hence, the two approach categories are not comparable to and cannot be fairly evaluated against each other.

Since proteins with high sequence similarity typically also have high 3D structural and functional similarity, traditional PSC approaches have relied only on sequence-based protein features [7, 8]. A popular baseline sequence feature is the amino acid composition (AAComposition), which measures the relative composition of the different amino acid types in a protein sequence [6]. Some other, more comprehensive sequence features include the position-specific scoring matrix [9], three-state secondary structure profile [10], and HMM profile [11], all of which were recently used by a PSC approach called SVMfold [7]. SVMfold integrates the above three sequence features to represent a protein sequence and then uses support vector machine as the classification algorithm to perform PSC.

Although sequence features have been extensively used for the purpose of PSC, it has been argued that proteins with low sequence similarity can still show high 3D structural and functional similarity [12]. On the other hand, proteins with high sequence similarity can have low 3D structural and functional similarity [13]. Hence, PSC based on 3D-structural as opposed to (or in addition to) sequence features could more correctly identify the structural class of a protein [14].

Typically, 3D-structural approaches extract features directly from the 3D structures of proteins and then use these direct 3D structural features (i.e., coordinate positions of the atoms in the 3D space) to compare proteins [14, 15]. Interestingly, recent 3D-structural PSC approaches have focused on classification based on protein pairs [16, 17, 3]. For example, they consider a pair of protein structures – one with known label (class) and the other one with unknown label, and if the proteins are similar enough in terms of their 3D-structural (and possibly also sequence) features, they assign the known label of the currently classified protein to the currently unclassified protein. As such, these approaches fall somewhere in-between PSC (because both are supervised, but PSC analyzes a single protein at a time) and protein structural comparison (because both focus on protein pairs, but protein structural comparison is unsupervised). Therefore, they are not comparable to and cannot be directly evaluated against approaches that solve the PSC problem as defined in our study.

In addition, several fully unsupervised approaches have also been proposed that use 3D-structural features to compare protein structures [18, 19]. For example, recently, a 3D-structural feature called Tuned Gauss Integrals (GIT) was used to cluster proteins into structurally similar groups [19].

In contrast to the direct 3D-structural features, protein 3D structures can first be modeled using protein structure networks (PSNs), in which nodes are amino acids and edges link amino acids that are spatially close enough to each other. Then, network-based features can be extracted from the PSNs and used in the task of PSC. A popular concept in this regard is the notion of protein contact maps [20], which are nothing but an alternative representation of PSNs. A contact map is the representation of an amino acid-long 3D protein structure into an 2-dimensional matrix . In a contact map , has a value of 1 if the amino acids and are within a pre-defined distance cutoff, i.e., they are in contact, and 0 otherwise. A recent approach that used contact map-based features for PSC is the cutoff scanning matrix (CSM) [21].

Unlike contact maps that are “simple” network representations, there exists a different category of PSN features that are based on graph-theoretic concepts, i.e., that measure different network properties. One such baseline PSN feature is Existing-all, which integrates seven network properties to represent a PSN [6]. Another popular PSN feature that counts different types of network patterns is the concept of graphets; graphlets are subgraphs or small lego-like building blocks of complex networks [22].

We believe that the graph-theoretic PSN-based PSC is promising. This is because we recently proposed an unsupervised protein structural comparison approach called GRAFENE that relies on graphlets as PSN features of a protein [6]. Given a set of PSNs as input, GRAFENE first extracts different versions of graphlet features from each PSN. Then, it quantifies structural similarity between each pair of the PSNs by comparing their features in an unsupervised manner. GRAFENE outperformed other state-of-the-art 3D-structural protein comparison approaches, including DaliLite [23] and TM-align [24].

In this work, we use the graphlet-based PSN features for the first time in the task of supervised PSC, with a hypothesis that they will improve upon state-of-the-art non-graphlet PSN features and non-PSN features that have traditionally been used in this task. Note that there exists a supervised approach that used graphlets to study proteins [25]. However, it did so in the task of functional classification of amino acids, i.e., nodes in a PSN, rather than in our task of structural classification of proteins, i.e., PSNs. Also, this approach only used the concept of regular graphlets, while we also test a newer concept of ordered graphlets [26] (see Methods), which outperformed regular graphlets in the GRAFENE study [6].

In general, a PSC approach comprises of two key aspects: 1) a method to extract features from a protein structure and 2) selection of a classification algorithm to be trained based on the features (and protein labels). Hence, existing PSC approaches can be divided into two broad categories. The first category includes approaches that focus on improving a classification algorithm by relying on existing features [27, 28, 29]. The second category includes approaches that extract novel features to predict the structural class of a protein by relying on existing classification algorithms [7, 30, 31]. Our study belongs to the second category, since our goal is to evaluate graphlet features against other state-of-the-art PSC features in a fair evaluation framework, i.e., under the same (representative) classifier, without necessarily aiming to find the best classifier.

1.2 Our contributions

We propose a PSC framework called NETPCLASS (network-based protein structural classification). As one part of our framework, we propose the use of graphlet- and thus PSN-based protein features in the PSC task under an existing classification algorithm. As another part of our framework, we aim to achieve the following. Graphlets can deal only with edge-unweighted networks. Yet, we hypothesize that the existing PSN definition, which links with unweighted edges those pairs of amino acids whose 3D spatial distance is below some predefined cutoff, can benefit from including as edge weights the actual spatial distances, and by doing so for all pairs of amino acids in the 3D structure rather than only for those pairs that are below the given distance cutoff. So, we model a PSN as a weighted adjacency matrix. Because extracting features from such a matrix is a non-trivial task, we propose a deep learning-based PSC approach that achieves this automatically.

More details about our study are as follows:

  1. We evaluate, in the task of PSC, eight versions of graphlet features that were already used for unsupervised protein structural comparison [6]. In addition, in the same task, we evaluate a non-graphlet network (Existing-all) feature [6], a recent contact map-based (CSM) feature [21], a recent 3D-structural (GIT) feature [19], a state-of-the-art sequence (SVMfold) feature, and a baseline sequence (AAComposition) feature [6]. We use the same classification algorithm for each of the above 13 features to learn (train) their classification models, in order to fairly compare their performance. Here, as a proof-of-concept, we use a simple yet powerful logistic regression (LR) classifier, whose output indicates, for the given input protein and each class, the likelihood that the protein belongs to the given class. Note that in initial stages of our evaluation, in addition to LR, we considered a support vector machine (SVM) classifier. We found that using SVM showed no improvement compared to using LR (under same features and on same PSN data) in terms of accuracy, while at the same time it was slower. So, we had no reason to continue using SVM.

  2. Since the different categories (i.e., sequence, 3D structural, contact-map, or PSN-based) of protein structural features can provide complementary information, we combine the individual features to form new integrated features and evaluate these against each of the individual features.

  3. Because graphlets, which are state-of-the-art network features, are currently designed only for unweighted PSNs, and because the current literature lacks knowledge on how to efficiently extract meaningful features from a weighted network, we aim to extract such features automatically via deep learning (DL).

  4. We evaluate the considered approaches on a large set of CATH and SCOP protein domains. We transform protein domains to PSNs with labels corresponding to CATH and SCOP structural classes, where we study each of the four levels of CATH and SCOP hierarchies [6]. Our evaluation is based on measuring how correctly the trained classification models can predict the classes of labeled proteins in the test data using -fold cross-validation.

Our key findings are as follows.

In terms of PSC accuracy, when we compare the individual features, we observe that the best of our graphlet features outperform all of the other individual features except GIT and SVMfold. However, regarding GIT, while it shows only marginally superior performance to the graphlet features, GIT is only applicable to proteins, while graphlets are general-purpose network features and are applicable to many other complex systems that can be modeled as networks. Further, we show that integrating GIT with the best graphlet features improves accuracy of each of GIT and the graphlet features, which means that each of the two individual features contributes to the superior performance of their integrated version. Additionally, we observe that integrating all of the individual features (not just GIT and the best graphlet features) further improves accuracy over each individual feature, yielding the best proposed approach. Regarding SVMfold, while this approach performs well (comparable to the best of our proposed approaches), SVMfold is orders of magnitude slower than our proposed approaches. In fact, SVMfold is so slow that we were able to run it only on 5.5% of our data. In terms of running time, SVMfold is the slowest, followed by the feature that integrates all of the individual features, followed by CSM, and followed by the rest of the features (which are mostly comparable to each other).

Accounting for edge weights in PSNs via DL achieves accuracy that is relatively comparable to performance of the individual unweighted network-based methods. Note that here we are comparing as simple as possible weighted network information (the weighted adjacency matrix) against highly sophisticated unweighted network information (graphlet features, which are the state-of-the-art in network science). So, a comparable accuracy of the former and the latter implies a promise of future weighted network-based analyses of protein 3D structures (such as developing and using weighted graphlet-based features).

2 Methods

2.1 Data and protein structure network (PSN) construction

First, we use a set of 17,036 proteins that was previously used in the large-scale unsupervised protein structural comparison GRAFENE study [6]. Per this study, this dataset contains all proteins from the PDB [32] that are at most sequence similar. To identify protein domains, we use two protein domain categorization databases: CATH and SCOP. Note that we can only use those CATH and SCOP protein domains that are present in our set of 17,036 proteins.

To construct a PSN from a protein domain, we use the Protein Data Bank (PDB) files [32], which contain information about the 3D coordinates of the heavy atoms (i.e., carbon, nitrogen, oxygen, and sulphur) of the amino acids in the domain. In a PSN, nodes are amino acids of a protein domain and there is an edge between any two nodes if they are sufficiently close in the 3D space. Clearly, given a protein domain, its corresponding PSN construction depends on 1) the choice of atom(s) of an amino acid to represent it as a node in the PSN and 2) a distance cutoff between a pair of nodes to capture their spatial proximity. It was recently shown, in the task of unsupervised protein structural comparison, by considering four different combinations of atom choice and distance cutoff definitions (any heavy atom with 4Å, 5Å, and 6Å distance cutoffs, and -carbon with 7.5Å distance cutoff), that the choice of atom and distance cutoff does not significantly affect the overall protein structural comparison performance [6]. Hence, for most of our analyses, i.e., unless explicitly indicated otherwise, we consider only one of these PSN construction strategies in our study: we define an edge between two amino acids if the spatial distance between any of their heavy atoms is within 4Å. In order to evaluate the choice of PSN construction strategy on our results, for a subset of our analyses (specifically, in Section 3.7), we consider two additional PSN construction strategies: (i) we define an edge between two amino acids if the spatial distance between any of their heavy atoms is within 5Å and (ii) we define an edge between two amino acids if the spatial distance between any of their heavy atoms is within 6Å.

Additionally, for most of our analyses, i.e., unless explicitly indicated otherwise, in order to only keep “meaningful” PSNs for further analysis, we filter the PSNs using an established guideline that is based on network properties of a PSN [6]. Namely, we only keep a PSN if it has 1) a single connected component, 2) a diameter of at least six, and 3) at least 100 nodes (amino acids) (Supplementary Section S1). Following these criteria, we obtain 9,509 and 11,451 PSNs corresponding to CATH and SCOP, respectively. Note that in order to fairly compare the considered protein features to each other, we want to focus on only those protein domains (i.e., PSNs) that can be processed by each of the features. Removal from the above data of those protein domains that cannot be processed by at least one considered feature results in 9,440 CATH and 11,352 SCOP PSNs for further analysis. These are the final data that are used throughout the study, unless explicitly told otherwise (i.e., except in Sections 3.7 and 3.8).

In order to evaluate the effect of the above three PSN filtering criteria on our results, for a subset of our analyses, we consider even those connected PSNs that have diameter less than six or fewer than 100 nodes. We use the resulting data only in Section 3.8.

Given the CATH PSN data, we do the following. First, we test the power of the considered PSC approaches to predict the top hierarchical level classes of CATH: alpha (), beta (), alpha/beta (/), and few secondary structures. For few secondary structures, none of the CATH PSNs belongs to this class, so we do not consider this class further. Hence, we take all 9,440 CATH PSNs and identify them as a single PSN set, where the PSNs have labels corresponding to three top level CATH classes: , , and . Second, we compare the approaches on their ability to predict the second level classes of CATH, i.e., within each of the top-level classes, we classify PSNs into their sub-classes. To ensure enough training data, we focus only on those top-level classes that have at least two sub-classes with at least 30 PSNs each. Three classes satisfy this criteria. For each such class, we take all of the PSNs belonging to that class and form a PSN set, which results in three PSN sets. Third, we compare the approaches on their ability to predict the third level classes of CATH, i.e., within each of the second level classes, we classify PSNs into their sub-classes. Again, we focus only on those second-level classes that have at least two sub-classes with at least 30 PSNs each. Nine classes satisfy this criteria. For each such class, we take all of the PSNs belonging to that class and form a PSN set, which results in nine PSN sets. Fourth, we compare the approaches on their ability to predict the fourth level classes of CATH, i.e., within each of the third level classes, we classify PSNs into their sub-classes. We again focus only on those third level classes that have at least two sub-classes with at least 30 PSNs each. Six classes satisfy this criteria. For each such class, we take all of the PSNs belonging to that class and form a PSN set, which results in six PSN sets.

Thus, in total, we analyze 1+3+9+6=19 CATH PSN sets. For further details on the number of PSNs and the number of different protein structural classes in each of the PSN sets, see Supplementary Tables S1-S3. We follow the same procedure for the SCOP PSN data and obtain 1+5+6+4=16 SCOP PSN sets. For more details, see Supplementary Section S2 and Supplementary Tables S1-S3.

Given the CATH and SCOP PSN sets, we group them into four PSN set groups, corresponding to the four hierarchy levels of CATH and SCOP: group 1 (all PSN sets), group 2 (all PSN sets), group 3 (all PSN sets), and group 4 (all PSN sets) (Figure 1).

Figure 1: Hierarchical representation of the 35 CATH and SCOP PSN sets that we use. Each oval shape represents a PSN set. The top line in the given oval indicates the name of the PSN set. The bottom line in the given oval contains two numbers represented as “; ”, where is the number of classes (i.e., labels) that are present in the PSN set and is the number of PSNs averaged over all classes in the PSN set. For example, for the SCOP database, PSN set SCOP-primary has seven classes, which on average have 1,635 PSNs. All of the PSN sets at a given level form a PSN set group. For example, PSN sets CATH-primary and SCOP-primary form PSN set group 1. A given category of a PSN set in group may be present as a PSN set in group . For example, five of the seven classes of PSN set SCOP-primary (in group 1), i.e., , , /, +, and multidomain, are present as a PSN set in group 2 as SCOP-, SCOP-, SCOP-/, SCOP-+, and SCOP-multidomain. Note that since we select a PSN set if and only if it has at least two classes each with at least 30 PSNs, not all of the categories of a PSN set in group are necessarily present as PSN sets in group . For example, PSN set SCOP-primary has seven classes in group 1, but only five of its classes exist as PSN sets in group 2. Also note that because of our PSN set selection criteria, it is not necessary that a PSN set in group has to be present as a class within a PSN set in group . For example, PSN set SCOP-c.2.1, which is present in group 4, is not present as a class within any PSN set in group 3. This is because SCOP-c.2 contains only one class that has at least 30 PSNs (i.e., c.2.1) and hence we do not consider SCOP-c.2 as a PSN set in our analysis. This figure has been adapted from the GRAFENE paper [6]. Note that the numbers of PSNs in this figure are different than the numbers of PSNs in the corresponding figure of the GRAFENE paper, because in this study we focus only on those PSNs that can be analyzed by each of the considered protein features (Section 2.1), where not all of our considered features are necessarily the same as all of the methods considered in the GRAFENE paper.

Second, in addition to the 35 CATH and SCOP PSN sets from the GRAFENE study as described above, we use a different dataset because of the following reason. Typically, high sequence similarity of proteins indicates their high structural similarity. Hence, given a set of proteins in which proteins in the same structural class have high sequence similarity (typically ), a “simple” protein sequence comparison might be sufficient to perform PSC [33]. So, we aim to evaluate how well our considered protein features (that are based on different aspects of a protein structure) can identify proteins in the same structural category when all of the proteins (within and across structural categories) show low () sequence similarity.

In order to do this, we download the dataset called Astral from the SCOPe 2.04 database [34]. This dataset has 14,666 protein domains, where each domain pair is at most sequence similar to each other. Each protein domain in this dataset is annotated by a label (i.e., protein structural class) assigned by the protein domain categorization database SCOP, where the label indicates the protein family to which the domain belongs. We create a PSN corresponding to each of the protein domains as described above. Then, we follow the same criteria as in the GRAFENE study [6] (also described above) to only keep “meaningful” PSNs. This results in 1,677 PSNs belonging to 32 different protein structural classes. We name this set of 1,677 PSNs as Astral.

Taken together, in our study, we use PSN sets (35 CATH and SCOP PSN sets and one Astral PSN set) that contain 9,440 protein domains annotated by the CATH database and 12,820 protein domains (union of the above SCOP-related 11,352 protein domains and the Astral-related 1,677 protein domains) annotated by the SCOP database.

2.2 Our evaluation framework

2.2.1 Protein features

For each of the protein domains, we extract 16 different protein features that are based on either sequence, 3D structure, contact map, non-graphlet network, graphlet network, or weighted network (Table 1). To understand what aspects of the protein structure each considered feature captures, we categorize the given feature based on whether it (1) captures local or global structure of a protein, (2) is based on the backbone or the side-chain of a protein, (3) is dependent on the sequentiality of amino acids of a protein or not, and (4) is dependent on the 3D structural conformation of a protein or not. We say that a feature is local if it explicitly uses local structural characteristics of a protein to summarize the whole protein structure, while we say that a feature is global if it summarizes the structure of a protein as a whole without explicitly using smaller structural characteristics of a protein. We say that a feature is based on the backbone (side-chain) of a protein if it uses only the heavy atoms of the protein backbone (side-chain). We say that a feature is sequence-dependent if permuting sequence positions of the amino acids of a given protein alters the feature as well. We say that a feature is 3D structural conformation-dependent if altering 3D structural positions of amino acids of a given protein alters the feature as well. By altering the 3D structural position of an amino acid, we mean changing the 3D spatial position of the amino acid to any other position in the 3D space, which can be different than positions of all other amino acids of the protein.

Note that all of the considered sequence-based features are not dependent on the 3D structural conformation of a protein, while all of the other considered features are dependent on this. Because we just covered this, we do not comment any further on whether a feature is 3D conformation-dependent or not.

Also, note that since we construct a PSN using any heavy atoms irrespective of whether the heavy atoms belong to the backbone or the side-chain of a protein (see above), any considered feature that is entirely or partially PSN-based (which is the case for each considered feature except sequence-based AAComposition and SVMfold plus 3D structure-based GIT) is automatically based on both the backbone and the side-chain of a protein. Henceforth, we only comment on whether a feature is either backbone-based or side-chain-based (or neither) if only if the feature is not PSN-based, i.e., we do it only for AAComposition, SVMfold, and GIT.

Below, we define each of the features and outline whether the given feature is local or global, as well as whether it is dependent on the sequentiality of amino acids of a protein (and for AAComposition, SVMfold, and GIT, whether the given feature is either backbone-based or side-chain-based (or neither)).

Category Feature name Measures local/global structure Based on protein backbone/side-chain Dependent on sequentiality of amino acids Dependent on 3D conformation Sequence AAComposition Global None No No SVMfold Local None Yes No 3D structure GIT Local Backbone Yes Yes Contact map CSM Global Both No Yes Non-graphlet network Existing-all Global Both No Yes Graphlet network Graphlet-3-4 Local Both No Yes Graphlet-3-5 Local Both No Yes NormGraphlet-3-4 Local Both No Yes NormGraphlet-3-5 Local Both No Yes OrderedGraphlet-3 Local Both Yes Yes OrderedGraphlet-3-4 Local Both Yes Yes NormOrderedGraphlet-3 Local Both Yes Yes NormOrderedGraphlet-3-4 Local Both Yes Yes Integrated GIT+OrderedGraphlet-3-4 Local Both Yes Yes Concatenate-all Both Both Yes Yes Weighted network Distance matrix Global Both Yes Yes
Table 1: Summary of all protein features that we use in this study. We use the same classifier (logistic regression) for all features except “distance matrix” (colored in gray); for the latter, we use deep learning.

Sequence-based features. We use a baseline sequence feature, AAComposition. Given a protein sequence, AAComposition measures the relative frequency of the 20 types of amino acids in the entire sequence: for each amino acid type , it measures the frequency occurrence of in the sequence divided by the total number of amino acids in the sequence. Because AAComposition considers the entire protein sequence, it is a global feature. Because AAComposition does not use any heavy atom of an amino acid, it is neither a backbone- nor a side-chain-based feature. Because AAComposition only measures the relative frequency of the different amino acid types without looking at the sequence position of a given amino acid, it is not sequence-dependent.

Besides AAComposition, we use a recent state-of-the-art sequence method called SVMfold. Given a protein sequence, SVMfold computes the position-specific scoring matrix (PSSM) [9], three-state secondary structure (SS) profile [10], and HMM profile [11] of the protein sequence, and integrates these three features to obtain a single feature representation of the protein [7]. Because SVMfold relies on HMM profile, which extracts features from subsequences of a protein, SVMfold is a local feature. Because SVMfold does not explicitly use any heavy atom of an amino acid, it is neither a backbone- nor a side-chain-based feature. Because SVMfold relies on PSSM profile, which extracts features from a sequence alignment of proteins, SVMfold is sequence-dependent.

Note that there exists another method called SVM-fold [35] (note that the difference between the names of this method and the above SVMfold method is the presence and absence of ”-”, respectively). However, we do not use this method because the focus of its publication was to propose not a novel protein feature but an improved machine learning algorithm that, given any sequence-based protein feature, can perform PSC. Specifically, the method feeds an existing sequence-based protein feature into string kernels [36], which it then uses to classify proteins. However, as pointed out earlier in the Section 1, the focus of our study is to evaluate different protein features in the task of PSC and not on improving the underlying PSC algorithm. Additionally, this method (i.e., SVM-fold) is more than a decade old approach, while the sequence-based method that we evaluate (i.e., SVMfold) in our paper is a recent approach.

3D structural feature. We use a recent 3D-structural feature, GIT. Given a protein structure, GIT measures how often, according to the amino acid sequence, the -carbon trace of the protein forms different kinds of local patterns in the 3D space. To measure the number of different patterns, GIT computes 31 different Gauss integrals and uses them as the feature representation of a protein [19]. Because GIT measures local patterns formed by the -carbon trace of a protein, it is a local feature. Because GIT is based on only the -carbons of a protein and because -carbons are part of the backbone of a protein, GIT is a backbone-based feature. Because GIT is based on the -carbon trace of a protein and because a change in sequence positions of amino acids might affect the -carbon trace of a protein, GIT is sequence-dependent.

Contact map-based feature. We use a recent contact map-based feature, CSM. Given a protein structure, CSM first computes 151 contact maps, which are based on 151 distance cutoffs (ranging from Å in increments of Å). For a given distance cutoff, two amino acids are considered to be in contact if any of their heavy atoms are within the distance cutoff; then, CSM counts the number of amino acid pairs that are in contact at that cutoff. Finally, CSM uses all 151 counts (for the 151 cutoffs) as the 151-dimensional protein feature vector [21]. Because CSM counts the total number of contacts present in the whole protein, it is a global feature. Because CSM counts the number of contacts and because a change in sequence positions of amino acids without altering 3D positions of the amino acids will not alter the number of contacts, CSM is not sequence-dependent.

Non-graphlet network-based feature. Here, we use a feature that was shown to outperform many other non-graphlet network-based features in an unsupervised protein comparison task [6]. We denote this feature as Existing-all. Given a PSN, Existing-all calculates and integrates seven global network features: average degree, average distance, maximum distance, average closeness centrality, average clustering coefficient, intra-hub connectivity, and assortativity. Because Existing-all uses global network measures to quantify the structure of a protein (i.e., PSN), it is a global feature. Because Existing-all extracts features from a PSN and because a change in sequence positions of amino acids without altering 3D positions of the amino acids will not alter the PSN structure, Existing-all is not sequence-dependent.

Graphlet network-based features. We use eight such features.

Graphlet counts. We use two graphlet-based protein features, i.e., Graphlet-3-4 and Graphlet-3-5. Given a PSN, Graphlet-3-4 and Graphlet-3-5 count the number of 3-4-node and 3-5-node graphlets, respectively. In particular, in the Graphlet-3-4 or Graphlet-3-5 feature vector, position represents the logarithm of the count of graphlets of type [6]. Because graphlets are small network patterns of a PSN and because Graphlet-3-4 and Graphlet-3-5 are based on graphlets, both Graphlet-3-4 and Graphlet-3-5 are local features. Because both Graphlet-3-4 and Graphlet-3-5 extract features from a PSN and because a change in sequence positions of amino acids without altering 3D positions of the amino acids will not alter the PSN structure, both Graphlet-3-4 and Graphlet-3-5 are not sequence-dependent.

Normalized graphlet counts. Since PSNs can be of very different sizes, we use two recent protein features that are based on normalized graphlet counts and that thus account for network size differences [6]. These features are NormGraphlet-3-4 and NormGraphlet-3-5; they are normalized equivalents of Graphlet-3-4 and Graphlet3-5, respectively. In particular, given a PSN, in both NormGraphlet-3-4 and NormGraphlet-3-5 feature vectors, a position represents the total count of graphlets of type divided by the sum of the counts of all graphlet types. Similar to Graphlet-3-4 and Graphlet-3-5, NormGraphlet-3-4 and NormGraphlet-3-5 are local and sequence-dependent features.

Ordered graphlet counts. Graphlets capture 3D structural but not sequence information. To integrate the two, ordered graphlets were proposed [26]. These are graphlets whose nodes acquire a relative ordering based on positions of the amino acids in the sequence. Two ordered graphlet features exist: OrderedGraphlet-3 and OrderedGraphlet-3-4 [26, 6]. For a PSN, in OrderedGraphlet-3 and OrderedGraphlet-3-4 feature vectors, position is the total count of ordered graphlets of type .

In addition, we use two features that are based on normalized counts of ordered graphlets [6]: NormOrderedGraphlet-3 and NormOrderedGraphlet-3-4; these are normalized equivalents of OrderedGraphlet-3 and OrderedGraphlet-3-4, respectively. For a PSN, in NormOrderedGraphlet-3 and NormOrderedGraphlet-3-4 feature vectors, position is the total count of ordered graphlets of type divided by the total count of all ordered graphlet types. Because ordered graphlets are essentially graphlets, all of the ordered graphlet-based features are local. Because ordered graphlets use node order as per sequence positions of amino acids in a protein and because a change in sequence positions of amino acids might change ordered graphlet counts, all of the ordered graphlet-based features are sequence-dependent.

Integrated features. We propose two new features that integrate two different subsets of the above protein features. First, we integrate the best of the non-graphlet features (GIT) and the best of the graphlet features (OrderedGraphlet-3-4) (Section 3) into a new combined feature called GIT+OrderedGraphlet-3-4. Second, we integrate all of the features (except SVMfold) into a new combined feature called Concatenate-all; we do not use SVMfold because of its high running time complexity (Section 3). Formally, given features of size (=1, 2, …, ), we concatenate them to form a feature of size . Clearly, any integrated feature will belong to all of those categories to which the corresponding individual features belong (Table 1).

Principal component analysis (PCA)-transformed features. For each of the 15 features described above (eight graphlet features, one non-graphlet network feature, one contact map feature, one 3D-structural feature, two sequence features, and two integrated features), we generate the corresponding 15 new features using PCA. Recently, PCA transformation of protein features, in order to better capture their (dis)similarity, was proposed [6]. Here, we perform the same PCA transformation. For a given PSN set, for each of the above 15 protein features, we apply PCA to obtain new PCA-transformed features. Specifically, we pick the first principal components such that the value of is at least two or as low as possible so that the PCA-transformation retains at least 90% variation in the dataset.

Weighted network-based feature. We use a weighted adjacency matrix, or distance matrix [37], of a 3D protein structure as a weighted PSN-based feature representation. In particular, given a protein of length , we define a weighted adjacency matrix of size , in which each position contains the minimum 3D spatial distance between the amino acids and , where the minimum is taken over all pairwise distances between any heavy atoms of and . Because weighted adjacency matrix measures pairwise distances among amino acids in the entire protein, it is a global feature. Because a change in sequence positions of amino acids will alter the matrix values, weighted adjacency matrix is sequence-dependent.

Taken together, in our study, we use 31 different protein features (13 individual pre-PCA features, 13 individual post-PCA features, two integrated pre-PCA features, two integrated post-PCA features, and a weighted network-based feature.)

2.2.2 The logistic regression (LR) framework

Given a PSN set, we train an LR classifier corresponding to each of the 30 out of the 31 (all except the weighted network-based feature) pre- or post-PCA protein features (see above). Hence, for each of the PSN sets, we get 30 different trained LR classifiers. In each of the trained classifiers, the input is a feature representation of a protein and output is the structural class to which the protein belongs. Since PSC is a multi-class problem, we use the one-vs-rest scheme to train an LR classifier. Due to space constraints, we provide further details about our LR framework in Supplementary Section S3.

We use -fold cross-validation to evaluate the performance of our LR classifiers. Given a PSN set, we first divide it into 10 equal-sized subsets, such that each subset contains the same proportion of different protein structural classes (i.e., labels) as present in the initial PSN set. Then, using one of the subsets as the test set and the union of the remaining nine subsets as the training set, we measure the percentage of proteins that are classified into their correct protein structural classes in the test set. We do this for each of the 10 subsets. Then, we take average of the 10 percentages (i.e., accuracy values) that correspond to the 10 runs.

2.2.3 The deep learning (DL) framework

In the second part of our study, we design a DL framework, in order to learn features of 3D protein structures using weighted protein structure networks. For each of the 36 PSN datasets, we train a deep neural network, where we use distance matrix-representations of proteins as input. Our DL framework consists of one input layer, seven hidden layers, and an output layer. Due to space constraints, we provide further details about our DL framework in Supplementary Section S4.

We evaluate our DL architecture using 10-fold cross-validation as described above. Given a PSN set, we follow the same procedure as for the LR framework to obtain performance accuracy for the given PSN set.

Note that recently a related deep learning method was proposed that uses 3D-structural information for protein function prediction [38]. Given a protein, the method extracts two types of structural information: the torsional angles for each of the amino acids and the pairwise spatial distances between the -carbons of the amino acids of a protein.

Given these structural information, the method uses a deep convolutional neural network framework along with support vector machine to perform the protein function prediction. The method was applied to classify enzymes (i.e., proteins) into functional categories. Since we only became aware of this very recent method towards the end of our work, we could not include it into our evaluation. However, note that this method is not based on PSNs and the focus of the study was on the task of protein function prediction rather than on our task of PSC.

3 Results and discussion

Throughout this section, unless stated otherwise, we analyze the 35 considered PSN sets that span all four levels (groups) of CATH and SCOP hierarchies, plus the Astral PSN set, totaling to 36 PSN sets. For each considered feature, we report its accuracy as well as running time.

In Section 3.1, we compare the different graphlet features (Section 2.2) to identify the best one(s) for further analyses. In section 3.2, we identify the best of the pre- or the post-PCA versions for each of the considered features. In Section 3.3, we evaluate how well the best of the graphlet feature(s) perform in comparison to the existing baseline or state-of-the-art protein features that we study (considering for each feature the best of its pre- and post-PCA versions). Here, we leave out from consideration the existing SVMfold sequence approach [7], because we were unable to apply this approach to all 36 considered PSN sets due to its extremely high time complexity. Instead, we consider the SVMfold later on, in a smaller-scope analysis of two of the 36 PSN sets on which we were able to run SVMfold (see below). In Section 3.4, we evaluate whether integration of the different features improves upon each of the individual features. In Section 3.5, we compare the performance of the best graphlet-based PSC approaches that deal with unweighted PSNs to the performance of simple weighted PSN-based feature classification via deep learning. In Section 3.6, we analyze two representative PSN sets on which SVMfold could be run, in order to compare our proposed approaches to this state-of-the-art existing sequence-based PSC approach. In the remaining sections, we zoom into the effect of some methodological choices that we have made. Specifically, in Section 3.7, we analyze the effect of the choice of PSN construction strategy on the performance of our best performing graphlet feature, and in Section 3.8, we analyze the effect of not considering our PSN filtering criteria (from Section 2.1) on the performance of all of the considered features for one of the considered PSN sets.

3.1 Comparison of graphlet features

When we compare all graphlet features under the LR classifier, OrderedGraphlet-3-4 is the most accurate of all pre-PCA graphlet features, while NormOrderedGraphlet-3-4 is the most accurate of all post-PCA graphlet features (Figure 2). So, for further analyses we keep these two best-performing graphlet features.

OrderedGraphlet-3-4, i.e., adding sequence-based node (amino acid) order, improves upon its regular (non-ordered) counterpart. This result is in alignment with our past work on unsupervised protein comparison [6]. Unlike in our past unsupervised study, in our current study, graphlet feature normalization does not always improve upon non-normalized features, and sometimes it actually worsens accuracy.

Figure 2: Accuracy of the 16 pre- and post-PCA graphlet features under the LR classifier, for each of the four hierarchy levels (groups) of CATH dataset, averaged over all PSN sets belonging to the given group (vertical lines are standard deviations), plus the Astral PSN set. Results are qualitatively similar for the four groups of the SCOP dataset as well (Supplementary Figure S1).

3.2 Selection of the best of the pre- or the post-PCA features

Here, we consider the following features under the LR classifier: all non-graphlet features except SVMfold, the two top performing graphlet features from Section 3.1, and all integrated features (i.e., Concatenate-all, and GIT+OrderedGraphlet-3-4) (Table 1). Note that since we could not apply SVMfold to all of the 36 PSN sets that we use, we exclude SVMfold from this analysis. When we compare pre- and post-PCA versions of each of the considered features, we find that pre-PCA performs the best for GIT, CSM, OrderedGraphlet-3-4, GIT+OrderedGraphlet-3-4, and Concatenate-all, while post-PCA performs the best for AAComposition, Existing-all, and NormOrderedGraphlet-3-4 (Figure 3).

So, PCA helps of the time. Thus, henceforth, for each feature, we use the best of its pre- and post-PCA versions. Also, for a given feature, if we use its post-PCA version, “*” is shown next to the feature’s name.

Figure 3: Accuracy of pre- and post-PCA versions all non-graphlet features except SVMfold, the two top performing graphlet features from Section 3.1, and all integrated features (i.e., Concatenate-all, and GIT+OrderedGraphlet-3-4) under the LR classification framework (Table 1), for PSN group 4 of CATH. Results for the other groups of CATH, all groups of SCOP, and the Astral PSN set are qualitatively similar (Supplementary Figs. S2, S3, and S4 respectively). Results are averaged over all PSN sets in the given group (horizontal and vertical lines are standard deviations).

3.3 Graphlet feature(s) outperform other protein features

First, we compare the best of our graphlet features to a baseline sequence (AAComposition) feature to see how our graphlet features, which are PSN-based, compare against a naive sequence-based feature. We find that both OrderedGraphlet-3-4 and NormOrderedGraphlet-3-4* significantly (-value ) outperform AAComposition in terms of accuracy and are comparable in terms of running time (Figs. 4, 5, and 6).

Second, we compare the best of our graphlet features with a recent 3D-structural (GIT) feature to see how our graphlet features, which intuitively capture the 3D structure of a protein and are PSN-based, compare against a 3D-structural protein feature that is not PSN-based. We find that on average GIT shows marginally superior performance in terms of accuracy over both OrderedGraphlet-3-4 and NormOrderedGraphlet-3-4*. However, we note the following.

  • While GIT shows only marginally superior performance to the graphlet features, GIT is only applicable to proteins, while graphlets are general-purpose network features and are applicable to many other complex systems that can be modeled as networks.

  • Even in the field of modeling protein structures, GIT has the following limitation. GIT is dependent on -carbons to correctly extract 3D structural feature of a protein and does not process a protein with fewer than 10 -carbons, while graphlet-based features have no such limitation. Note that this is not a problem in our analysis because we only include those protein structures that could be processed by each of the approaches that we consider. Nonetheless, since GIT is dependent on -carbons, it would be interesting to see how noise-tolerant GIT is to the removal of -carbons, compared to the best of our graphlet approaches (i.e., OrderedGraphlet-3-4). We evaluate this using the Astral PSN set as a proof-of-concept. For each protein in the Astral PSN set, we randomly remove % of -carbons, where we vary from 0% to 100% in increments of 10%. Hence, we obtain 11 different Astral PSN sets corresponding to the 11 different values of (i.e., different “noise” levels). For each of these 11 PSN sets, we measure performance of both GIT and OrderedGraphlet-3-4 (just like we have done in the rest of the study). Since we randomly remove -carbons from proteins, we perform the above experiment three times and report the average performance for both GIT and OrderedGraphlet-3-4 over the three runs. We find that as increases, the performance of GIT continues to decrease, while OrderedGraphlet-3-4 has almost constant and typically higher-than-GIT performance (Supplementary Figure S5). It is likely that OrderedGraphlet-3-4 works better (i.e., is both more robust to noise and overall more accurate) because it does not rely solely on -carbons but also considers other heavy atoms to extract protein features (Section 2.1).

  • Furthermore, notice that given a PSN set, we measure the performance of a given approach as the percentage of correctly classified protein structures over all of the protein structural classes, without looking into how the given approach performs with respect to each of the structural classes individually (Section 2.2). So, although on average GIT shows comparable performance to our individual graphlet features, it would be interesting to see whether our individual graphlet features are more suitable for certain protein structural classes as compared to GIT. We find that of all 256 protein classes over all 36 PSN sets, the best of our graphlet approaches, OrderedGraphlet-3-4 and NormOrderedGraphlet-3-4*, show better performance than GIT in 28 and 19 protein classes, respectively, i.e., in a total of 47 classes (Supplementary Tables S4 and S5). This means that for some of the protein structural classes, our graphlet-based protein features identify protein structures more correctly than GIT. We compute the enrichment of the 47 classes for which at least one graphlet approach is better than GIT, as well as of the remaining classes for which GIT is better than any graphlet approach, in each of the level 1 classes of CATH and SCOP (which we summarize into “ only”, “ only”, “both and ”, and “other” classes). We find that the 47 classes where graphlet approaches perform the best are significantly enriched in “ only” (-value of 0.039) and “other” (-value of 0.004) classes, while the 209 classes where GIT performs the best are significantly enriched in “both and ” classes (-value of 0.001); we observe no other significant enrichments (Supplementary Section S5). Hence, the two approaches (graphlets vs. GIT) seem to be favoring different class types. Given the complementary performance of the two approaches, it could happen for our integrated features, which include GIT as well as at least one of the graphlet features, to improve upon each of the individual graphlet features and GIT. This is exactly what we observe (see below). In terms of running time, GIT and our individual graphlet features are comparable (Figure 4).

Figure 4: Accuracy versus running time of the approaches from Figure 3 plus deep learning (DL), for group 4 of CATH. Results are qualitatively similar for all other groups of CATH and SCOP (Supplementary Figs. S6 and S7). For each method except DL, the best of its pre- and post-PCA versions is chosen (DL does not have this option). If the latter is selected, “*” is shown next to the given feature’s name.
Figure 5: Accuracy versus running time of the approaches from Figure 3 plus deep learning (DL), for the Astral PSN set. For each method except DL, the best of its pre- and post-PCA versions is chosen (DL does not have this option). If the latter is selected, “*” is shown next to the given feature’s name.

Third, we compare the best of our graphlet features with a recent contact map-based (CSM) feature to see how our graphlet features, which are more based on graph-theoretic concepts, compare against CSM, which relies on a “simple” concept of contact maps. We find that both OrderedGraphlet-3-4 and NormOrderedGraphlet-3-4* significantly (-value ) outperform CSM both in terms of accuracy and running time (Figs. 4, 5, and 6).

Fourth, we compare the best of our graphlet features with a baseline network (Existing-all) feature to see how our graphlet features, that are more comprehensive network measures, compare against Existing-all that relies on naive network measures. We find that both OrderedGraphlet-3-4 and NormOrderedGraphlet-3-4* significantly (-value ) outperform Existing-all in terms of accuracy and are comparable in terms of running time (Figs. 4, 5, and 6).

These results indicate that, as expected, graphlet-based features perform better than any of the considered baseline features, while showing better or comparable performance in terms of accuracy (possibly providing complimentary information) compared to the considered state-of-the-art contact map-based and 3D-structural features.

3.4 Feature integration improves PSC accuracy

We expect the different categories (i.e., sequence, 3D-structural, contact-map, or PSN-based) of features to capture different aspects of a protein structure. Thus, integrating these features may help capture complementary structural information. Hence, we integrate GIT, the best of the non-graphlet features, and OrderedGraphlet-3-4, the best of the graphlet features, to form a new feature called GIT+OrderedGraphlet-3-4. GIT captures the backbone fold of a protein and OrderedGraphlet-3-4 captures both the PSN structure (that captures the 3D structure) and the protein sequence structure using ordered graphlets. Hence, we expect that GIT+OrderedGraphlet-3-4 will improve upon most, if not all, of the individual features. Indeed, overall, GIT+OrderedGraphlet-3-4 improves upon the individual features in terms of PSC accuracy and is comparable to the individual features in terms of running time (Figs. 4 and 5, and Table 2).

Note that we integrate two of the best performing features and do not use other possible combinations of features because of the following reason. Given all of the features that we use (Table 1), there are more than possibilities in which we can combine those features to form a new feature. Hence, evaluating all of the combinations is not feasible. However, one heuristic is to integrate all of the features (and hence capture as much of the different protein structure information as possible) into a single feature and integrate them to form a new combined feature. We do this in the following manner.

We integrate all individual features that we consider under the LR classifier (all except SVMfold) to form a single feature called Concatenate-all, in hope that Concatenate-all will improve upon each of the individual features (Table 1). Because of the high running time complexity of SVMfold, we could not apply it to all 36 PSN sets and hence we do not use SVMfold as part of our integrated feature. Our results show that Concatenate-all shows significant (-value ) improvement in accuracy compared to each of the individual protein features, although at the expense of higher running times, as expected (Figs. 4, 5, and 6, and Table 2).

Figure 6: Statistical significance of the accuracy difference of the approaches from Figure 4. For each of the 36 PSN sets, we measure raw accuracy values for each of the 10 approaches. Hence, for each approach, there are 36 raw accuracy values (corresponding to the 36 PSN sets). For each pair of approaches, we compare the two given approaches’ 36 raw accuracy values using paired -test. In the figure, every cell () indicates the statistical significance (in terms of -value) of approach being superior to approach .

3.5 Weighted network-based DL classification performs comparable to unweighted graphlet classification

Our proposed DL classifier performs quite well in terms of accuracy (Figs. 4 and 5, and Table 2). Specifically, it is significantly (-value ) superior to AAComposition*, CSM, and Existing-all* (Figure 6). However, the performance of the DL classifier is significantly (-value ) lower than OrderedGraphlet3-4, GIT, GIT+OrderedGraphlet-3-4, and Concatenate-all (Figure 6). Yet, the performance of the DL classifier is comparable to one of the two top performing graphlet features, NormOrderedGraphlet-3-4* (Figs. 4 and 5).

Approach CATH SCOP Group 1 Group 2 Group 3 Group 4 Group 1 Group 2 Group 3 Group 4 Astral OrderedGraphlet-3-4 91.64 78.42 89.54 93.85 81.40 79.15 84.25 91.67 70.45 NormOrderedGraphlet-3-4* 89.03 71.95 87.17 92.87 77.59 71.67 82.20 89.16 65.35 Existing-all* 82.40 62.33 59.44 72.24 63.82 45.31 71.31 74.58 30.32 AAComposition* 63.05 61.33 72.23 83.38 52.85 55.38 82.26 78.96 39.16 CSM 85.49 69.41 74.43 84.69 72.61 56.74 82.19 88.39 44.39 GIT 91.47 81.93 95.58 96 81.92 89.21 92.39 96.72 81.94 Concatenate-all 94.33 86.3 96.18 97.76 85.32 91.65 92.87 96.52 78.84 GIT+OrderedGraphlet-3-4 94.61 87.43 95.81 97.49 84.99 91.03 91.17 95.59 79.03 Deep Learning 85.72 82.94 81.63 90.85 67.67 62.27 84.36 89.96 48.84
Table 2: Accuracy of the approaches from Figure 4 for each CATH and SCOP group plus Astral dataset.

Importantly, unlike the other individual (LR) classifiers that make use of highly sophisticated unweighted network information such as graphlet features, the DL framework utilizes only as simple as possible weighted network information (i.e., weighted adjacency matrix of a network) as its input. This points to a promise of future algorithmic developments for dealing with weighted networks, perhaps even designing weighted graphlet features.

3.6 Our features versus SVMfold

SVMfold has high running time because it needs to extract three sets of very comprehensive features from protein sequence information. This complex information retrieval process needs to be performed for each protein in the considered PSN set, which is not feasible when analyzing large PSN sets containing many proteins (such as those at the higher levels of CATH/SCOP hierarchies) or many PSN sets. Hence, we can compare our approaches to the state-of-the-art SVMfold approach only for two representative PSN sets out of all 36 PSN sets.

Specifically, we choose CATH-3.20.20 and CATH-3.40.50 from group 4 of the CATH data as the representative PSN sets, for the following reasons. These two PSN sets correspond to the fourth level of the CATH hierarchy, i.e., as specific structural classes as possible, which are the most relevant for applied biochemistry scientists. Also, of all fourth-level PSN sets, CATH-3.20.20 is one of the PSN sets in which at least one of our top performing graphlet approaches give low accuracy (), which gives SVMfold the best-case advantage over our approaches, and CATH-3.40.50 is one of the PSN sets in which both of our top performing graphlet approaches give high accuracy (), which gives our approaches the best-case advantage over SVMfold.

Overall, our best performing graphlet features OrderedGraphlet-3-4 and NormOrderedGraphlet-3-4*, and our integrated features GIT+OrderedGraphlet-3-4 and Concatenate-all are comparable (within ) to SVMfold in terms of accuracy (individual graphlet approaches and GIT+OrderedGraphlet-3-4 on CATH-3.40.50, and Concatenate-all on both CATH-3.20.20 and CATH-2.40.50) at a fraction of SVMfold’s running time (Table 3).

Approach Accuracy Running time (in minutes) CATH-3.20.20 CATH-3.40.50 CATH-3.20.20 CATH-3.40.50 OrderedGraphlet-3-4 82.07 97.86 18.77 4.95 NormOrderedGraphlet-3-4* 76.90 96.07 26.61 7.02 Existing-all* 51.90 60.36 6.42 1.65 AAComposition* 79.31 81.07 0.06 0.04 CSM 76.03 80.71 53.74 15.32 GIT 85.86 99.64 2.1 0.36 Concatenate-all 94.66 98.57 90.70 25.05 GIT+OrderedGraphlet-3-4 92.07 97.86 20.83 5.31 Deep Learning 83.79 92.87 5 3.07 SVMfold* 99.31 100 79,365.37 29,859.46
Table 3: Accuracy and Running times (in minutes) of the approaches from Figure 4 plus SVMfold, for CATH-3.20.20 and CATH-3.40.50 PSN sets. Due to SVMfold’s large time, we could not evaluate it on additional PSN sets.

3.7 The effect of the choice of PSN construction strategy on the results

Up to this section, all of our results have been based on the PSN construction strategy that joins two amino acids if any of their heavy atoms are within a distance cutoff of Å from each other (Section 2.1). Different choices of distance cutoffs can result in different PSNs for the same 3D structure, which could affect the performance of a PSN-based PSC approach. So, in this section, we examine the effect of the choice of distance cutoff on our results. Specifically, as a proof-of-concept, we compare the performance of the best of the graphlet approaches (i.e., OrderedGraphlet-3-4) at different cutoffs; we do this for each of the 36 considered PSN sets. We choose the cutoffs of interest as follows.

It has been shown that, when considering any heavy atoms of amino acids, distance cutoffs below 4Å result in highly disconnected PSNs, while distance cutoffs beyond 6.5Å result in random-like PSN structures [39]. That is, meaningful distance cutoffs lie in the range of 4Å and 6.5Å. So, in addition to the 4Å cutoff used elsewhere in the paper, here, we analyze two additional cutoffs in this range, namely 5Å and 6Å. Then, we compare the performance of OrderedGraphlet-3-4 at the 4Å cutoff against its performance at the other two cutoffs.

We find that on average over all PSN sets in the given group (Section 2.1), the three versions of OrderedGraphlet-3-4 corresponding to the 4Å, 5Å, and 6Å cutoffs are typically within each others’ standard deviations (Supplementary Fig S8). Yet, when we quantify the statistical significance of the difference in their accuracy, we find that the 6Å cutoff significantly outperforms the other two cutoffs (with adjusted -values below ). Note that we compute the statistical significance of the difference as follows. For each of the three versions of OrderedGraphlet-3-4, we consider its 36 accuracy values (for the 36 PSN sets). Then, we compare the 36 values of one OrderedGraphlet-3-4 version to the 36 values of each of the other two versions, using paired -test.

The above result means that the choice of distance cutoff for PSN construction could affect the performance of any PSN-based PSC approach. We have shown that it does affect the performance of our best PSN-based approach, OrderedGraphlet-3-4. However, this finding only further strengthens the conclusions of this paper - that ordered graphlet features are a new state-of-the-art for PSC. This is because even OrderedGraphlet-3-4 at the 4Å cutoff is superior or comparable to the existing non-graphlet features, including the state-of-the-art 3D structural and sequence features. Since the latter are not PSN-based, their performance does not depend on PSN construction strategy, including the choice of distance cutoff. So, when considering the 6Å cutoff, only PSN-based approaches may yield even higher accuracy compared to considering the 4Å cutoff. Indeed, we have verified that this holds for OrderedGraphlet-3-4. Hence, this entire analysis at 6Å results in even higher superiority of our ordered graphlet feature compared to the existing 3D structural and sequence approaches.

3.8 The effect of not considering our PSN filtering criteria on the results

Recall that we apply several PSN filtering criteria: we analyze a PSN if and only if it is (1) connected, (2) has at least 100 nodes, and (3) has a diameter of at least six (Section 2.1 and Supplementary Section S1). Here, we examine the effect of these criteria on the relative performance of the considered protein features compared to one another. Specifically, as a proof-of-concept, we analyze a modified Astral PSN set in which we now include a PSN even if it has fewer than 100 nodes or a diameter of less than six (we still focus on connected PSNs only to avoid any bias in our PSC evaluation due to missing data; Supplementary Section S1). The modified Astral set has 2,588 PSNs compared to the original Astral set used elsewhere in this paper that has 1,677 PSNs (Section 2.1). We evaluate accuracy of each feature from Figure 5 on the modified new Astral PSN set. While we find that the performance of each of the features decreases compared to the performance of the same feature on the original Astral PSN set, the relative performance of the different features compared to each other remains the same as on the original Astral PSN set (Supplementary Figure S9). That is, independent on whether we use our PSN filtering criteria, the results in and conclusions of this paper remain unchanged.

4 Conclusion

This study proposes the first ever network-based PSC framework. Specifically, this study is the first to use state-of-the-art network-based features called graphlets in the task of PSC. We comprehensively evaluate our graphlet features against state-of-the-art protein features that are based on various other aspects of protein structure, including sequence, 3D structure, and contact map information (although we again note some similarity between the notions of a PSN and a contact map). We find that the network-based graphlet features are superior to most of the considered existing features. Additionally, we show that integrating different protein features improves the PSC accuracy compared to individual features, possibly by capturing complementary protein structural information. Further, our proposed DL framework, which automatically learns appropriate features from simple weighted PSN adjacency matrices, yields comparable accuracy to many of the sophisticated features that we use, which work on unweighted PSNs. This points to a promising future for algorithms that will rely on weighted network-based features of protein 3D structures, such as weighted graphlets, which currently do not exist.

We show that among all of the considered graphlet features, ordered graphlets perform the best. The superiority of ordered graphlets over the other graphlet and most of the non-graphlet (including 3D structural and sequence) features might come from the following. First, ordered graphlets comprehensively combine both sequence and 3D structural information, by incorporating on top of the network patterns that they capture the relative ordering of nodes in a PSN, where this ordering relies on sequence positions of amino acids of the corresponding protein. On the other hand, all other features are either sequence- or 3D structure-based but not both. Of course, incorporating not just the relative order of amino acids in the protein sequence but even more of the sequence information into the graphlet approach, such as which particular amino acids appear in which graphlet positions, could yield even further improvements. Developing such an approach is non-trivial and is thus beyond the scope of this study. Second, we believe that studying protein 3D structures as networks is more meaningful than studying the 3D structures directly. This is not just because of the power of networks to reveal interesting data patterns that non-network approaches might miss [40, 41, 42, 43], but also because the former allows for a protein structure to be studied with any (current or future) state-of-the-art method for network analysis developed in any field of network science, including the field of protein structural analysis, while the latter allows for using only methods specialized in the field of protein structural analysis. Because the former spans a much larger research community, there likely exist exponentially more network methods that can potentially be used to study protein structures than there exist 3D structural approaches. And network approaches, given a large number of other network approaches acting as their competitors, need to go through a thorough evaluation and continuous improvements. This should continue to result in more and more powerful network approaches. In this study, of all current network approaches, we have focused on graphlets, because they have been proven to be the state-of-the-art network methods, as they, being mathematically sensitive, are able to capture detailed topological information from many different types of complex real-world networks [44].

We evaluate our framework on CATH and SCOP protein domain classes, and in particular on all currently available classes that contain a large enough number of protein domains to have sufficient statistical power in the classification task. As more protein domain data become available, and consequently as more protein domain classes become sufficiently large, our proposed PSC approaches can easily be re-trained on the new data, to allow for classification of protein domains into the new classes as well. Importantly, our framework is a general purpose framework for network classification. That is, although we evaluate our framework in the task of PSC, i.e., when classifying PSNs, our framework can be used to classify networks from any other field, including, but not limited to, other types of biological networks (e.g, protein-protein interaction networks [44]) or social networks (which also have implications for human health[45, 46]).

Authors’ contributions

Conceived the study: KN AR TM. Collected and processed the data: KN. Designed the methodology: AR MG TM. Designed the experiments: KN AR MG TN. Performed the experiments: KN AR MG. Analyzed the results: KN AR MG TM. Wrote the paper: KN AR MG TM. Read and approved the paper: KN AR MG PJA TM. Supervision: PJA TM.

Competing interests

The authors have no competing interests.

Funding

This work is funded by the National Institutes of Health (NIH) 1R01GM120733 grant. The analyses in this study were partly carried out on the computing infrastructure funded by the National Science Foundation [CNS-1629914].

References

  • [1] Kasabov, N. K. Springer Handbook of Bio-/Neuro-Informatics (Springer, 2013), 1 edn.
  • [2] Whisstock, J. C. & Lesk, A. M. Prediction of protein function from protein sequence and structure. Quarterly Reviews of Biophysics 36, 307–340 (2003).
  • [3] Jain, P., Garibaldi, J. M. & Hirst, J. D. Supervised machine learning algorithms for protein structure classification. Computational Biology and Chemistry 33, 216–223 (2009).
  • [4] Greene, L. H. et al. The CATH domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution. Nucleic Acids Research 35, D291–D297 (2006).
  • [5] Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology 247, 536–540 (1995).
  • [6] Faisal, F. E. et al. GRAFENE: Graphlet-based alignment-free network approach integrates 3D structural and sequence (residue order) data to improve protein structural comparison. Scientific Reports 7, 14890 (2017).
  • [7] Xia, J., Peng, Z., Qi, D., Mu, H. & Yang, J. An ensemble approach to protein fold classification by integration of template-based assignment and support vector machine classifier. Bioinformatics 33, 863–870 (2016).
  • [8] Saidi, R., Maddouri, M. & Mephu Nguifo, E. Protein sequences classification by means of feature extraction with substitution matrices. BMC Bioinformatics 11, 175 (2010).
  • [9] Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25, 3389–3402 (1997).
  • [10] Jones, D. T. Protein secondary structure prediction based on position-specific scoring matrices1. Journal of Molecular Biology 292, 195–202 (1999).
  • [11] Remmert, M., Biegert, A., Hauser, A. & Söding, J. HHblits: lightning-fast iterative protein sequence searching by hmm-hmm alignment. Nature Methods 9, 173 (2012).
  • [12] Krissinel, E. On the relationship between sequence and structure similarities in proteomics. Bioinformatics 23, 717–723 (2007).
  • [13] Kosloff, M. & Kolodny, R. Sequence-similar, structure-dissimilar protein pairs in the PDB. Proteins: Structure, Function, and Bioinformatics 71, 891–902 (2008).
  • [14] Cui, C. & Liu, Z. Classification of 3D protein based on structure information feature. In BMEI International Conference on BioMedical Engineering and Informatics, vol. 1, 98–101 (IEEE, 2008).
  • [15] Kalajdziski, S., Mirceva, G., Trivodaliev, K. & Davcev, D. Protein Classification by Matching 3D Structures. In IEEE Frontiers in the Convergence of Bioscience and Information Technologies, 147–152 (2007).
  • [16] Jo, T., Hou, J., Eickholt, J. & Cheng, J. Improving protein fold recognition by deep learning networks. Scientific Reports 5, 17573 (2015).
  • [17] Wang, J., Li, Y., Zhang, Y., Tang, N. & Wang, C. Class conditional distance metric for 3d protein structure classification. In International Conference on Bioinformatics and Biomedical Engineering (iCBBE), 1–4 (2011).
  • [18] Zhi, D., Shatsky, M. & Brenner, S. E. Alignment-free local structural search by writhe decomposition. Bioinformatics 26, 1176–1184 (2010).
  • [19] Harder, T., Borg, M., Boomsma, W., Rogen, P. & Hamelryck, T. Fast large-scale clustering of protein structures using Gauss integrals. Bioinformatics 28, 510–515 (2012).
  • [20] Godzik, A., Kolinski, A. & Skolnick, J. Topology fingerprint approach to the inverse protein folding problem. Journal of Molecular Biology 227, 227 – 238 (1992).
  • [21] Pires, D. E. et al. Cutoff Scanning Matrix (CSM): structural classification and function prediction by protein inter-residue distance patterns. BMC Genomics 12, S12 (2011).
  • [22] Pržulj, N., Corneil, D. G. & Jurisica, I. Modeling interactome: scale-free or geometric? Bioinformatics 20, 3508–3515 (2004).
  • [23] Holm, L. & Rosenström, P. Dali server: conservation mapping in 3D. Nucleic Acids Research 38, W545–W549 (2010).
  • [24] Zhang, Y. & Skolnick, J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Research 33, 2302–09 (2005).
  • [25] Vacic, V., Iakoucheva, L. M., Lonardi, S. & Radivojac, P. Graphlet kernels for prediction of functional residues in protein structures. Journal of Computational Biology 17, 55–72 (2010).
  • [26] Malod-Dognin, N. & Pržulj, N. GR-Align: fast and flexible alignment of protein 3D structures using graphlet degree similarity. Bioinformatics 30, 1259–1265 (2014).
  • [27] Lin, C. et al. Hierarchical classification of protein folds using a novel ensemble classifier. PLOS ONE 8, e56499 (2013).
  • [28] Vipsita, S., Shee, B. K. & Rath, S. K. An efficient technique for protein classification using feature extraction by artificial neural networks. In IEEE India Conference (INDICON), 1–5 (2010).
  • [29] Melvin, I., Weston, J., Leslie, C. S. & Noble, W. S. Combining classifiers for improved classification of proteins from sequence or structure. BMC Bioinformatics 9, 389 (2008).
  • [30] Wei, L., Liao, M., Gao, X. & Zou, Q. Enhanced protein fold prediction method through a novel feature extraction technique. IEEE Transactions on Nanobioscience 14, 649–659 (2015).
  • [31] Dai, H.-L. Imbalanced protein data classification using ensemble FTM-SVM. IEEE Transactions on Nanobioscience 14, 350–359 (2015).
  • [32] Berman, H. M. et al. The protein data bank. Nucleic Acids Research 28, 235–242 (2000).
  • [33] Rost, B. Twilight zone of protein sequence alignments. Protein Engineering, Design and Selection 12, 85–94 (1999).
  • [34] Fox, N. K., Brenner, S. E. & Chandonia, J.-M. SCOPe: Structural Classification of Proteins-extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Research 42, D304–D309 (2014).
  • [35] Melvin, I. et al. SVM-fold: a tool for discriminative multi-class protein fold and superfamily recognition. In BMC Bioinformatics, vol. 8, S2 (2007).
  • [36] Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N. & Watkins, C. Text classification using string kernels. Journal of Machine Learning Research 2, 419–444 (2002).
  • [37] Holm, L. & Sander, C. Protein structure comparison by alignment of distance matrices. Journal of Molecular Biology 233, 123–138 (1993).
  • [38] Zacharaki, E. I. Prediction of protein function using a deep convolutional neural network ensemble. PeerJ Computer Science 3, e124 (2017).
  • [39] Milenković, T., Filippis, I., Lappe, M. & Pržulj, N. Optimized null model for protein structure networks. PLOS ONE 4, e5967 (2009).
  • [40] Rund, S. S. et al. Genome-wide profiling of 24 hr diel rhythmicity in the water flea, Daphnia pulex: Network analysis reveals rhythmic gene expression and enhances functional gene annotation. BMC Genomics 17, 653 (2016).
  • [41] Liu, S. et al. Network analysis of the NetHealth data: exploring co-evolution of individuals’ social network positions and physical activities. Applied Network Science 3, 45 (2018).
  • [42] Greene, L. H. Protein structure networks. Briefings in Functional Genomics 11, 469–478 (2012).
  • [43] Caldera, M., Buphamalai, P., Müller, F. & Menche, J. Interactome-based approaches to human disease. Current Opinion in Systems Biology 3, 88–94 (2017).
  • [44] Newaz, K. & Milenković, T. Graphlets in network science and computational biology. Analyzing Network Data in Biology and Medicine: An Interdisciplinary Textbook for Biological, Medical and Computational Scientists 193 (2019).
  • [45] Liu, S. et al. The power of dynamic social networks to predict individuals’ mental health. arXiv preprint arXiv:1908.02614 (2019).
  • [46] Liu, S. et al. Heterogeneous network approach to predict individuals’ mental health. arXiv preprint arXiv:1906.04346 (2019).
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
398131
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description