Network-based protein structural classification

Network-based protein structural classification

Arash Rahnama Khalique Newaz Panos J. Antsaklis Department of Electrical Engineering, University of Notre Dame, Notre Dame, IN 46556, USA Tijana Milenković
Abstract

Experimental determination of protein function is resource-consuming. As an alternative, computational prediction of protein function has received attention. In this context, protein structural classification (PSC) can help, by allowing for determining structural classes of currently unclassified proteins based on their features, and then relying on the fact that proteins with similar structures have similar functions. Existing PSC approaches rely on sequence-based or direct (“raw”) 3-dimensional (3D) structure-based protein features. Instead, we first model 3D structures as protein structure networks (PSNs). Then, we use (“processed”) network-based features for PSC. We are the first ones to do so. We propose the use of graphlets, state-of-the-art features in many domains of network science, in the task of PSC. Moreover, because graphlets can deal only with unweighted PSNs, and because accounting for edge weights when constructing PSNs could improve PSC accuracy, we also propose a deep learning framework that automatically learns network features from the weighted PSNs. When evaluated on a large set of 9,509 CATH and 11,451 SCOP protein domains, our proposed approaches are superior to existing PSC approaches in terms of both accuracy and running time.

Protein, Keyword2, Keyword3

1 Introduction

Motivation and related work. Proteins are major molecules of life, and thus understanding their cellular function is important. However, doing so experimentally is costly and time consuming [1]. Instead, computational approaches are often used for this purpose, which are much more efficient because they leverage on the fact that (sequence or 3-dimensional) structural similarity of proteins often indicates their functional similarity. One type of such computational approaches is protein structural classification (PSC) [2]. A PSC framework uses structural features of proteins with known labels (typically CATH [3] or SCOP [4] structural classes) to learn a classification model in a supervised manner (i.e, by including the labels into the process of training the model). Then, the structural feature of a protein with unknown label can be used as input to the classification model to determine the structural class of the protein. This information can in turn be used to predict function of a protein based on functions of other proteins that belong to the same structural class as the protein of interest. In this paper, we focus on the PSC problem.

Note that there exists a related computational problem which can help with protein function prediction – that of protein structural comparison [5]. However, unlike PSC: 1) protein structural comparison uses structural features of proteins with known or unknown labels in unsupervised rather than supervised manner (i.e., it ignores any potential label information), and 2) it uses the features to compute pairwise similarities between the proteins in hope that highly-similar proteins will have the same label (where the labels are used only after the fact), rather than predicting the label of a single protein. In other words, both the goals and working mechanisms of PSC and protein structural comparison are different. Hence, the two approach categories are not comparable to and cannot be fairly evaluated against each other.

Since proteins with high sequence similarity typically also have high 3-dimensional (3D) structural and functional similarity, traditional PSC approaches have relied only on sequence-based protein features [6]. However, proteins with low sequence similarity can still show high 3D structural and functional similarity [7]. On the other hand, proteins with high sequence similarity can have low 3D structural and functional similarity [8]. Hence, PSC based on 3D-structural as opposed to (or in addition to) sequence features could more correctly identify the structural class of a protein [9].

Interestingly, recent 3D-structural approaches, although supervised, focused on classification based on protein pairs [10, 11, 2]. For example, they consider a pair of protein structures – one with known label (class) and the other one with unknown label, and if the proteins are similar enough in terms of their 3D-structural (and possibly also sequence) features, they assign the known label of the currently classified protein to the currently unclassified protein. As such, these approaches fall somewhere in-between PSC (because both are supervised, but PSC analyzes a single protein at a time) and protein structural comparison (because both focus on protein pairs, but protein structural comparison is unsupervised). Therefore, they are not comparable to and cannot be directly evaluated against approaches that solve the PSC problem as defined in our study. Our extensive literature search did not reveal any recent 3D-structural approaches that can solve our considered PSC problem, which is why we do not consider such approaches in our evaluation.

Typically, 3D-structural approaches extract features directly from the 3D structures of proteins and then use these “raw” features [9, 12]. In contrast, protein 3D structures can first be modeled using protein structure networks (PSNs), in which nodes are amino acids and edges link amino acids that are spatially close enough to each other. Then, network-based features can be extracted from the PSNs and used in the task of PSC. To our knowledge, no one has done this yet, and we aim to close this gap.

We believe that PSN-based PSC is promising. This is because we recently used PSN-based protein representations in the task of unsupervised protein structural comparison. Specifically, we proposed an approach called GRAFENE that relies on graphlets as PSN features of a protein [5]; graphlets are subgraphs or small lego-like building blocks of complex networks [13]. Given a set of PSNs as input, GRAFENE first extracts different versions of graphlet features from each PSN. Then, it quantifies structural similarity between each pair of the PSNs by comparing their features. GRAFENE outperformed other state-of-the-art 3D-structural protein comparison approaches, including DaliLite [14] and TM-align [15]. Also, GRAFENE outperformed an existing non-graphlet PSN-based protein comparison approach, Existing-all, and a baseline sequence-based approach, AAComposition (see Methods).

Given that graphlet-based PSN features have been successful in the task of unsupervised protein structural comparison, here, we use them for the first time in the task of supervised PSC, with a hypothesis that they will improve upon state-of-the-art non-graphlet and non-PSN features that have traditionally been used in this task, when all features are run under the same classifier. Note that there exists a supervised approach that used graphlets to study proteins [16]. However, it did so in the task of functional classification of amino acids, i.e., nodes in a PSN, rather than in our task of structural classification of proteins, i.e., PSNs. Also, this approach only used the concept of regular graphlets, while we also test a newer concept of ordered graphlets [17] (see Methods), which outperformed regular graphlets in the GRAFENE study [5].

In general, a PSC approach comprises of two key aspects: 1) a method to extract features from a protein structure and 2) selection of a classification algorithm to be trained based on the features (and protein labels). Hence, existing PSC approaches can be divided into two broad categories. The first category includes approaches that extract novel features to predict the structural class of a protein by relying on existing classification algorithms [6, 18, 19]. The second category are approaches that focus on improving a classification algorithm by relying on existing features [20, 21, 22]. Our study belongs to the first category, since our goal is to evaluate graphlet features against other state-of-the-art PSC features in a fair evaluation framework, i.e., under the same (representative) classifier, without necessarily aiming to find the best classifier.

Our contributions. We propose a new PSC framework called NETPCLASS (network-based protein structural classification). As one part of our framework, we propose the use of graphlet- and thus PSN-based protein features in the PSC task under an existing classification algorithm. As another part of our framework, we aim to achieve the following. Graphlets can deal only with edge-unweighted networks. Yet, we hypothesize that the existing PSN definition, which links with unweighted edges those pairs of amino acids whose 3D spatial distance is below some predefined threshold, can benefit from including as edge weights the actual spatial distances, and by doing so for all pairs of amino acids in the 3D structure rather than only for those pairs that are below the given threshold. So, we model a PSN as a weighted adjacency matrix. Because extracting features from such a matrix is a non-trivial task, we propose a deep learning-based PSC approach that achieves this automatically. More details about our study are as follows:

  1. We evaluate nine versions of graphlet features that were already used in the task of unsupervised protein structural comparison [5], to see how they compare to each other in the task of supervised PSC. Also, we use principal component analysis (PCA) to reduce the dimensionality of graphlet features, as well as of baseline sequence (AACompostion) and non-graphlet network (Existing-all) features [5], by keeping only the most important information from the given feature. We use the same classification algorithm to learn (train) classification models for each of the above 22 features, in order to fairly compare their performance. Here, as a proof-of-concept, we use a simple yet powerful logistic regression (LR) classifier, whose output indicates, for the given input protein and each class, the likelihood that the protein belongs to the given class. We use LR rather than e.g., simple regression or even potentially more powerful support vector machine (SVM) because we want as output the likelihood for each class (which is what LR does) rather than only the final class assignment (which is what the latter two classifiers typically do), for the following reason.

  2. We hypothesize that different types of features (e.g., different versions of graphlet features, or graphlet versus sequence features) may contain complementary protein structural information. So, we propose an ensemble learning (EL) classifier that integrates the outputs of individual feature-based LR classification models (and thus implicitly combines the different features) into a new feature and that then utilizes another (final) classifier to predict the structural class of an unlabeled protein based on its integrated feature. Here, we select SVM as the final classifier in our EL framework, because now we can accept as the output the final class assignment (rather than only the likelihood for each class), and because in some initial tests using SVM as the final classifier always performed at least as good as using LR as the final classifier.

  3. Because graphlets, which are state-of-the-art network features [5], are currently designed only for unweighted PSNs, and because the current literature lacks knowledge on how to efficiently extract meaningful features from a weighted network, we aim to extract such features automatically via deep learning (DL).

  4. We compare our top performing graphlet-based approach(es) and our DL approach to two sequence-based approaches (AAComposition and SVMfold) and a non-graphlet network-based approach (Existing-all). We do so fairly, under the same classifiers as discussed above. We compare to AAComposition as a simple baseline sequence approach [5]. We compare to SVMfold as a state-of-the-art sequence PSC approach [6], which integrates three sets of sequence features and uses SVM as a classification algorithm [6]. We compare against Existing-all as both a baseline and state-of-the-art non-graphlet network approach [5]. Note that Existing-all was already used in the task of unsupervised protein structural comparison but not in our task of supervised PSC. Also, we found two other recent PSC approaches, EnFTM-SVM [19] and PFPA [18], both sequence-based. While we wanted to include these methods into our study, we could not access their software, as our emails to the authors remained unanswered. Yet, because PFPA uses two sets of sequence features that are both included into SVMfold, PFPA’s performance is expected to be at most as high as that of SVMfold. So, a superior performance of our approaches over SVMfold would likely indicate their superior performance over PFPA as well. Further, note per our above discussion, we could not find recent 3D structural approaches that solve the PSC problem as defined in our study.

  5. We evaluate the considered approaches on a large set of 9,509 CATH and 11,451 SCOP protein domains. We transform protein domains to PSNs with labels corresponding to CATH and SCOP structural classes, where we study each of the four levels of CATH and SCOP hierarchies [5]. Our evaluation is based on measuring how correctly the trained classification models can predict the classes of unlabeled proteins in the test data using -fold cross-validation.

Our key findings are as follows:

  1. Our graphlet features are superior to the baseline AAComposition sequence or Existing-all network features in terms of accuracy while being relatively comparable in terms of running time. While the state-of-the-art SVMfold sequence approach performs well (though still inferior to the best of our proposed approaches), SVMfold is orders of magnitude slower than our approaches. In fact, SVMfold is so slow that we were able to run it only on 5.7% of our data.

  2. Using PCA on the features often improves performance.

  3. Feature integration using the EL framework considerably improves accuracy compared to individual features, though at higher running time.

  4. Accounting for edge weights in PSNs via DL achieves accuracy that is relatively comparable to performance of the individual unweighted network-based graphlet methods, though at higher running time (due to using DL). Note that here we are comparing as simple as possible weighted network information (the weighted adjacency matrix) against highly sophisticated unweighted network information (graphlet features, which are the state-of-the-art in network science). So, a comparable accuracy of the former and the latter is promising.

2 Methods

2.1 Data and protein structure network (PSN) construction

We use a set of 17,036 proteins that was previously used in a large-scale unsupervised protein structural comparison study [5]. To identify protein domains, we use two protein domain categorization databases: CATH and SCOP.

To construct a PSN from a protein domain, we use Protein Data Bank (PDB) files, which contain information about the 3D coordinates of the heavy atoms (i.e., carbon, nitrogen, oxygen, and sulphur) of the amino acids in the domain. In a PSN, nodes are amino acids of a protein domain and there is an edge between any two nodes if they are sufficiently close in the 3D space. Clearly, given a protein domain, its corresponding PSN construction depends on 1) the choice of atom(s) of an amino acid to represent it as a node in the PSN and 2) a distance threshold between a pair of nodes to capture their spatial proximity. It was recently shown, by considering four different combinations of atom choice and distance threshold definitions (any heavy atom with 4 Å, 5 Å, and 6 Å distance thresholds, and -carbon with 7.5 Å distance threshold), that the choice of atom and distance threshold does not significantly affect the overall protein structural comparison performance [5]. Hence, we consider only one of these PSN construction strategies in our study. Namely, we define an edge between two amino acids if the spatial distance between any of their heavy atoms is within 4 Å. Following this and other established guidelines [5], we obtain 9,509 and 11,451 PSNs corresponding to CATH and SCOP, respectively. We use these two data sets in our study.

Given the CATH PSN data, we first test the power of the considered PSC approaches to predict the top hierarchical level classes of CATH: alpha (), beta (), alpha/beta (/), and few secondary structures. For few secondary structures, none of the CATH PSNs belongs to this class, so we do not consider this class further. Hence, we take all 9,509 CATH PSNs and identify them as a single PSN set, where the PSNs have labels corresponding to three top level CATH classes: , , and .

Second, we compare the approaches on their ability to predict the second level classes of CATH, i.e., within each of the top-level classes, we classify PSNs into their sub-classes. To ensure enough training data, we focus only on those top-level classes that have at least two sub-classes with at least 30 PSNs each. Three classes satisfy this criteria. For each such class, we take all the PSNs belonging to that class and form a PSN set, which results in three PSN sets.

Third, we compare the approaches on their ability to predict the third level classes of CATH, i.e., within each of the second level classes, we classify PSNs into their sub-classes. Again, we focus only on those second-level classes that have at least two sub-classes with at least 30 PSNs each. Nine classes satisfy this criteria. For each such class, we take all the PSNs belonging to that class and form a PSN set, which results in nine PSN sets.

Fourth, we compare the approaches on their ability to predict the fourth level classes of CATH, i.e., within each of the third level classes, we classify PSNs into their sub-classes. We again focus only on those third level classes that have at least two sub-classes with at least 30 PSNs each. Six classes satisfy this criteria. For each such class, we take all the PSNs belonging to that class and form a PSN set, which results in six PSN sets.

Thus, in total, we analyze CATH PSN sets. For further details on the number of PSNs and the number of different protein structural classes in each of the PSN sets, see Supplementary Tables S1-S3.

We follow the same procedure for the SCOP PSN data and obtain SCOP PSN sets. For more details, see Supplementary Section S1 and Supplementary Tables S1-S3.

Given the CATH and SCOP PSN sets, we group them into four PSN set groups, corresponding to the four hierarchy levels of CATH and SCOP: group 1 (all PSN sets), group 2 (all PSN sets), group 3 (all PSN sets), and group 4 (all PSN sets).

2.2 Our evaluation framework

2.2.1 Protein features

For each of the protein domains, we extract the following types of protein features that are based on either sequence, network (non-graphlet or graphlet), integration of sequence and network (graphlet), or weighted network.
Sequence-based feature. We use a popular baseline sequence feature, AAComposition. Given a protein sequence, AAComposition measures the relative frequency of the 20 types of amino acids: for each amino acid type , it measures the frequency occurrence of in the sequence divided by the total number of amino acids in the sequence.
Non-graphlet network-based feature. Here, we use a feature that was shown to out-perform many other non-graphlet network-based features in an unsupervised protein comparison task [5]. We denote this feature as Existing-all. Given a PSN, Existing-all calculates and integrates seven different network features: average degree, average distance, maximum distance, average closeness centrality, average clustering coefficient, intra-hub connectivity, and assortativity.
Graphlet network-based features. We use nine such features.

Graphlet counts. We use two graphlet-based protein features, i.e., Graphlet-3-4 and Graphlet-3-5. Given a PSN, Graphlet-3-4 and Graphlet-3-5 count the number of 3-4-node and 3-5-node graphlets, respectively. In particular, in the Graphlet-3-4 or Graphlet-3-5 feature vector, position represents the count of graphlets of type . There are eight 3-4-node and 28 3-5-node graphlet types [5].

Normalized graphlet counts. Since PSNs can be of very different sizes, we use two recent protein features that are based on normalized graphlet counts and that thus account for network size differences [5]. These features are NormGraphlet-3-4 and NormGraphlet-3-5; they are normalized equivalents of Graphlet-3-4 and Graphlet3-5, respectively. In particular, given a PSN, in both NormGraphlet-3-4 and NormGraphlet-3-5 feature vectors, a position represents the total count of graphlets of type divided by the sum of the counts of all graphlet types.

Ordered graphlet counts. Graphlets capture 3D structural but not sequence information. To integrate the two, ordered graphlets were proposed [17]. These are graphlets whose nodes acquire a relative ordering based on positions of the amino acids in the sequence. Two ordered graphlet features exist: OrderedGraphlet-3 and OrderedGraphlet-3-4 [17, 5]. There are four 3-node and 42 3-4-node ordered graphlet types [5]. For a PSN, in OrderedGraphlet-3 and OrderedGraphlet-3-4 feature vectors, position is the total count of ordered graphlets of type .

In addition, we use two features that are based on normalized counts of ordered graphlets [5]: NormOrderedGraphlet-3 and NormOrderedGraphlet-3-4; these are normalized equivalents of OrderedGraphlet-3 and OrderedGraphlet-3-4, respectively. For a PSN, in NormOrderedGraphlet-3 and NormOrderedGraphlet-3-4 feature vectors, position is the total count of ordered graphlets of type divided by the total count of all ordered graphlet types.

Although ordered graphlets capture the relative positions of amino acids, they fail to capture how far the amino acids are in a protein sequence. Amino acids that are spatially close but that are far apart in the sequence can be more informative than amino acids that are spatially close just because they are close in the sequence. So, a feature called NormOrderedGraphlet-3-4(K), was proposed [5]. Unlike NormOrderedGraphlet-3-4, NormOrderedGraphlet-3-4() counts an ordered graphlet only if every pair of amino acids that are linked by an edge are at least distance apart in the sequence [5].

We integrate all 11 features into a new Combined feature.
Principal component analysis (PCA)-transformed features. Recently, PCA transformation of protein features, in order to better capture their (dis)similarity, was proposed [5]. Here, we perform the same PCA transformation. For a given PSN set, for each of the above mentioned 11 protein features, we apply PCA to obtain new PCA-transformed features. We pick the first principal components, such that the value of is at least two or as low as possible so that it accounts for 90% variation in the data set. Also, we combine the 11 post-PCA protein features as a new post-PCA Combined feature.
Weighted network-based feature. We use a weighted adjacency matrix, or distance matrix [23], of a 3D protein structure as a weighted PSN-based feature representation. In particular, given a protein of length , we define a weighted adjacency matrix of size , in which each position contains the minimum 3D spatial distance between the amino acids and , where the minimum is taken over all pairwise distances between any heavy atoms of and .

2.2.2 The logistic regression (LR) framework

For each of the 35 PSN sets, we train an LR classifier corresponding to each of the 22 different protein features. Hence, for each of the PSN sets, we get 22 different trained LR classifiers. In each of the classifiers, the input is a feature representation of a protein and output is the structural class to which the protein belongs to. Given the data points in a data set with input feature vectors of size i.e., , we have two vectors and of size and the LR classifier which takes the form of:

(1)

where is the inner-product of the vectors and . The training procedure determines the entries of the vector by minimizing the cost function , where is the correct class label for the feature vector of a certain data point. (1) with optimal parameters and given a certain class produces a value in which represents the probability of the data point belonging to the class. The -Ridge regularized loss function to be minimized with respect to is:

(2)

An unconstrained optimization algorithm minimizes the loss functions in (2) and finds the optimal parameters. The performance of the LR classifier is reported based on the -fold cross-validation analysis of the trained classifier.

2.2.3 The ensemble learning (EL) framework

Different protein features provide complementary information in regard to the 3D structure of protein networks therefore, one expects that the classifiers trained on individual feature vectors to be diverse, and that the combination of their decisions should yield high performance for supervised classification of 3D protein structures. Consequently, we propose a hierarchical learning architecture for the supervised classification of 3D protein structures by combining the outputs of different LR classifiers using an EL framework. In particular, we train two such classifiers that correspond to all the 11 LR-based classifiers that are based on pre-PCA protein features (Combined) and all the 11 LR-based classifiers that are based on post-PCA protein features (post-PCA version of Combined), respectively. Hence, for Combined framework, the input is all 11 pre-PCA protein features and for post-PCA Combined framework, the input is all 11 post-PCA protein features. For each of the frameworks, given a protein, the output is the structural class the protein belongs to. Needless to say, our approach is “supervised,” which means that we utilize the data points’ labels in the training phase to predict the class labels for the test data points in the testing phase. Our framework provides a fast and efficient decision making process that ensembles the information attained from the LR classifiers each trained by an individual feature vector. We show that our EL framework outperforms all other classifiers and provides a powerful supervised classification platform. In this vein, our EL architecture consists of two levels:

  1. At the first level, a set of LR classifiers are trained. This set consists of one LR classifier per feature vector. As described before, each feature vector, depending on its definition, represents a specific property of the protein network structure. The feature vectors of protein network structures are the inputs to our classifiers. We perform -fold cross validation in order to report the performance of our EL classifier. In this vein, we divide the input data points into a primary testing fold and a primary training fold. We perform transfer learning in which the feature vectors in the primary training fold are partitioned into two groups of secondary training and testing data points and the LR classifiers are trained by the secondary training data set and tested by the secondary testing partition. To be more specific, the secondary training data set is partitioned into -folds, where of them is used for training by leave-one-out cross validation and the th fold is used to test the classifier. This means that the decision hypothesis of each classifier is validated using the samples belonging to the th chunk of the data. This process leads to trained classifier where each classifier outputs the estimates of the posterior probabilities of the data points belonging to the set not incorporated in the training procedure. These posterior probabilities indicates the probability of an input data point belonging to a particular class. By the end of this procedure, each data point in the secondary training data set will have a set of posterior probabilities assigned to it. The posterior probabilities are placed in a membership class of size for all classes, where is the number of classes. Thereby, each LR classifier maps the variable dimensional feature vectors explained in the previous section to a dimensional decision space. This mapping trains the LR classifiers to become experts for capturing diverse qualities of different protein network structures. Given the data points in the secondary training data set, the resulting vectors of size provide the input data points for a specific feature for the second level classifier i.e. for the feature vector we have which is an matrix. Needless to say, the final input to the second level classifier is the ensemble of all the posterior probabilities given all the feature vectors considered, i.e. which is an matrix with representing the number of features considered.

  2. As mentioned the class posterior probabilities obtained at the output of the LR classifiers are concatenated under a new decision space to form the inputs to the secondary classifier which is trained for the final decision. At the second level, the obtained membership vector of size for each input data point is produced by concatenating the membership vectors obtained from the base classifiers. The secondary classifier is a support vector machine (SVM) classifier with a linear kernel. We train this classifier using the concatenated membership vectors in . For testing the final classifier, first we compute the class membership vectors of the primary test data points using the decision boundaries of the trained classifiers in the first level and feed the results into the second level SVM classifier. In our experiment is for only pre-PCA and only post-PCA analyses. Needless to say, the fully trained EL classifier can be obtained by training all the initial data points, without dividing the data into a primary training and testing folds.

SVM is a supervised learning classifier that seeks to construct a hyper-plane or a set of hyper-planes to separate different classes in higher dimensional spaces by maximizing the margin or the distance between the nearest training data points of any class. This is because in general the larger the margin the lower the generalization error of the classifier is. At the second level, we have data points of size . The hyper-plane separating classes can be written as the set of points satisfying where is the normal vector to the hyper-plane. The separating hyper-plane has the dimension of . As an example, for a two class problem and a data point , the relation given in (3) determines the class membership and the final decision of our EL framework.

(3)

The -Ridge regularized loss function to be minimized with respect to to obtain the soft-margins which maximize the distance between the boundaries of classes is:

(4)

In a similar manner as before, an unconstrained optimization algorithm minimizes the loss functions given in (4) to find the optimal parameters. To solve (2) and (4), the optimization problem is solved using a trust region Newton method [24] and for multi-class data sets, we implement the one-vs-the-rest strategy introduced by Crammer and Singer [25]. Further details on the LR and SVM classification methods may be found in [26]. Similar to before, the accuracy for the EL architecture is reported based on the -fold cross-validation performance analysis of the trained classifiers. We use the LIBLINEAR package in our software implementation of LR and SVM classifiers [27]. Our results reflect that different graphlet and feature vector structures provide supplementary and complementary information for the secondary classifier leading to an efficient EL framework with high supervised classification accuracy.

2.2.4 The deep learning (DL) framework

In the second part of our study, we design a DL framework that can detect the dominant features and patterns in the 3D structure of proteins directly from their weighted protein networks (distance matrix-representation) and consequently accurately classify the unlabeled data points. For each of the 35 PSN data sets, we train a deep neural network classifier. In this framework, the input is a distance matrix-representation of a protein and the output is the class of the protein. This supervised framework can predict the class of unlabeled proteins with a high precision in any of the data sets defined in Section 2. The deep artificial neural network (ANN) is designed in Python using the Google’s TensorFlow package [28].

Our DL framework accepts as its input, the distance matrix-representations in a vector format (the input matrices representing the 3D protein structures are flattened). The distance matrix-representations, however, have different sizes and our DL architecture only accepts vectors of the same size as input. To overcome this, we utilize the zero-padding approach for dealing with vectors of different sizes commonly used in the computer vision research literature [26]. We equalize the lengths of all input vectors resulting from the flattened distance matrix-representations by padding them with zeros. The DL framework consists of an input layer with the size of the padded distance matrices in vector format, seven hidden layers of sizes and an output layer of the size equal to the number of classes in the specific data set under investigation. The general model of the DL is defined as:

with the DL parameter set where is the collection of weights , is the collection of biases at each neuron and represents the activation function . The output of each neuron may be represented as , where and ’s indicate the total number of neurons and the neurons’ outputs in the previous layer. Depending on the number of classes (i.e. the size of the output layer) , the last layer produces a collection of values in with each estimating the probability of the input data point belonging to a particular class. This is fulfilled by utilizing the Softmax transformation in the last layer, as follows:

where ’s represent the signals sent to each neuron in the output layer. Hence, we have . The final decision is made by assigning the input data point to the class with the highest probability. Consequently, the DL is trained to minimize the objective function in the presence of an -Ridge regularization with parameters which add stability and robustness to the learning process:

(5)

The training process is started by ”Xavier” weight initiation rule given in [29]. The loss function given in (5) takes the form of cross-entropy and is as follows:

(6)

where is the number of training data points, and are the true and the predicted class labels for a particular input. is equal to only if the data point belongs to class . is the output probability that the data point belongs to class . The optimization method used to minimize the objective function implements Adam algorithm which supersedes the classical stochastic gradient descent procedure as it is both more computationally efficient and robust to the noise [30]. For performance analysis of our deep learning architecture, for a given data set, the data points are randomly divided into two partitions with of data placed in the train data set and of data points placed in the test data sets. The performance analysis for the deep learning architecture is reported based on the error rate on the test data set and represents the accuracy of the trained deep learning framework in correctly classifying the unlabeled test data points.

3 Results and discussion

Throughout this section, unless stated otherwise, we analyze all 35 considered PSN sets that span all four levels (groups) of CATH and SCOP hierarchies (Section 2.1). For each considered method, we report its accuracy as well as running time. In Section 3.1, we compare the different graphlet features under the LR classifier (Section 2.2.2) to identify the best one(s) for further analyses. In Section 3.2, we examine whether the PCA transformation of the best graphlet feature(s), as well as of the existing baseline AAComposition sequence and Existing-all non-graphlet network features (Section 2.2.1), can improve classification accuracy compared to the corresponding pre-PCA features, all under the LR classifier. Here, we leave out from consideration the existing SVMfold sequence approach [6], because we were unable to apply this approach to all 35 PSN sets due to its extremely high time complexity. Instead, we consider the SVMfold later on, in a smaller-scope analysis of two of the 35 PSN sets (see below). In Section 3.3, we evaluate whether integration of the considered graphlet features, AAComposition, and Existing-all via EL framework (Section 2.2.3) improves compared to using the individual features under the LR classifier. In Section 3.4, we compare the performance of the sophisticated graphlet-based PSC approaches that deal with unweighted PSNs (Section 2.2.1) to the simple weighted PSN-based feature classification via deep learning (Section 2.2.4). In Section 3.5, we analyze two representative PSN sets on which SVMfold could be run, to compare our proposed approaches to this state-of-the-art existing PSC approach.

3.1 Comparison of graphlet features under LR classifier

When we compare all graphlet features under the LR classifier, OrderedGraphlet-3-4 is the most accurate of all pre-PCA graphlet features, while NormOrderedGraphlet-3-4 and NormOrderedGraphlet3-4(K) are the most accurate of all post-PCA graphlet features (Fig. 1). So, for further analyses we keep these three best-performing features.

NormOrderedGraphlet3-4(K), i.e., adding the long-range constraint, improves accuracy of NormOrderedGraphlet-3-4. OrderedGraphlet-3-4, i.e., adding node order, improves upon its regular (non-ordered) counterpart. These results are in alignment with our past work on unsupervised protein comparison [5], even though in our current study, improvement of NormOrderedGraphlet3-4(K) over NormOrderedGraphlet-3-4 is only marginal. Unlike in our past unsupervised study, in our current study, graphlet feature normalization does not improve upon non-normalized features, and sometimes it actually worsens accuracy.

Figure 1: Accuracy of the 18 pre- and post-PCA graphlet features under the LR classifier, for each of the four hierarchy levels (groups) of CATH, averaged over all PSN sets belonging to the given group (vertical lines are standard deviations). Results are qualitatively similar for SCOP groups as well (Supplementary Fig. S1).

3.2 PCA feature transformation improves PSC accuracy

Here, we consider the following features under the LR classifier: the three top performing graphlet features from Section 3.1, Existing-all, AAComposition, and the integrated Combined feature (Section 2.2.1). When we compare pre- and post-PCA versions of each feature, we find that post-PCA versions are generally more accurate than pre-PCA versions (Fig. 2). Only for Combined, the accuracy is almost tied, with only marginal superiority of its post-PCA version, and for OrderedGraphlet-3-4, its pre-PCA version is superior. The overall benefit of using PCA may be attributed to the post-PCA features having more compact representation, which alleviates the negative effects of the “curse of dimensionality” on classification algorithms.

PCA helps most but not all of the time. So, henceforth, for each feature, we use the best of its pre- and post-PCA versions.

Figure 2: Accuracy of pre- and post-PCA versions of the three top performing graphlet features, Existing-all non-graphlet PSN feature, and AAComposition sequence feature under the LR classifier, plus the integrated Combined feature under the EL classification framework, for PSN groups 3 and 4 of CATH. Results for the other groups of CATH and all groups of SCOP are qualitatively similar (Supplementary Fig. S2 and Fig. S3). Results are averaged over all PSN sets in the given group (horizontal and vertical lines are standard deviations).

3.3 Feature integration via EL improves PSC accuracy

Feature integration under the EL framework (Combined) statistically significantly enhances the classification accuracy compared to training any feature individually under the LR classifier (Fig. 34 and Tables 12). Hence, it is likely that Combined efficiently utilizes the complementary information from the different individual features.

However, Combined has larger running time compared to the individual feature approaches (Fig. 3). Combined’s much larger running time compared to Existing-all, AAComposition, and NormOrderedGraphlet3-4 might be justified because of its much higher accuracy compared to these two features. Moreover, Combined’s only slightly larger running time compared to NormOrderedGraphlet3-4(K) might be justified because of its much higher accuracy compared to the latter. However, Combined’s much larger running time compared to OrderedGraphlet3-4 might not be justified, because it has only somewhat higher accuracy compared to the latter, especially for the lower-level CATH/SCOP classes (Fig. 3 and Tables 12).

3.4 Weighted network-based DL classification performs well compared to unweighted graphlet classification

Our proposed DL classifier performs quite well in terms of accuracy (Fig. 3 and Tables 12). Specifically, it is significantly superior to AAComposition and Existing-all, and it is comparable to two of the three top performing graphlet features; only OrderedGraphlet3-4 and Combined are significantly better than DL (Fig. 4). Yet, compared to Combined, DL’s running time is sometimes much faster (Fig. 3 (a)).

These results mean that the DL framework can automatically detect and learn meaningful weighted network features. Importantly, unlike the other individual (LR) or integrative (EL) classifiers that make use of highly sophisticated unweighted network information such as graphlet features, the DL framework utilizes only as simple as possible weighted network information (i.e., weighted adjacency matrix of a network) as its input. This points to a promise of future algorithmic developments for dealing with weighted networks, perhaps even designing weighted graphlet features.

Figure 3: Accuracy versus running time of the approaches from Fig. 2 plus deep learning (DL), for groups 4 of CATH and SCOP. Results are qualitatively similar for all other groups of CATH and SCOP (Supplementary Fig. S4). For each method except DL, the best of its pre- and post-PCA versions is chosen (DL does not have this option). If the latter is selected, “*” is shown next to the given method’s name.
Figure 4: Statistical significance of the accuracy difference of the approaches from Fig. 3, calculated by comparing the approaches’ accuracy scores over all 35 PSN sets using the paired -test.
Approach Group 1 Group 2 Group 3 Group 4 OrderedGraphlet-3-4 83.71 75.97 92.19 94.21 NormOrderedGraphlet-3-4* 76.69 65.81 83.80 89.27 Norm Ordered Graphlet-3-4(K)* 79.09 66.98 84.92 89.88 Existing-all* 70.41 44.13 54.27 65.39 AAComposition* 62.88 50.29 70.70 76.56 Combined* 98.67 91.37 97.81 99.23 Deep Learning 89.00 76.00 78.22 86.17
Table 1: Accuracy of the approaches from Fig. 3 for each CATH group.
Approach Group 1 Group 2 Group 3 Group 4 OrderedGraphlet-3-4 63.80 82.94 89.25 95.46 NormOrderedGraphlet-3-4* 55.35 60.41 82.54 89.42 Norm Ordered Graphlet-3-4(K)* 54.88 68.31 86.87 91.54 Existing-all* 45.63 33.70 67.90 70.96 AAComposition* 44.78 50.68 79.67 75.71 Combined 93.33 88.00 92.48 95.75 Deep Learning 71.00 64.80 83.16 85.25
Table 2: Accuracy of the approaches from Fig. 3 for each SCOP group.

3.5 Our approaches improve upon state-of-the-art SVMfold

When can compare to the state-of-the-art SVMfold approach only for two representative PSN sets out of all 35 PSN sets, because of extremely high running time of SVMfold (Table 3). Specifically, we choose CATH-3.20.20 and CATH-3.40.50 from group 4 of the CATH data as the representative PSN sets, for the following reason. These two PSN sets correspond to the fourth level of the CATH hierarchy, i.e., as specific structural classes as possible, which are the most relevant for applied biochemistry scientists. Also, of all fourth-level PSN sets, these two are the ones in which our proposed approaches perform the best (CATH-3.40.50) – which gives our approaches the best-case advantage over SVMfold, and the worst (CATH-3.20.20) - which gives SVMfold the best-case advantage over our approaches.

Approach CATH-3.20.20 CATH-3.40.50 OrderedGraphlet-3-4 31.36 15.80 NormOrderedGraphlet-3-4* 30.84 15.60 Norm Ordered Graphlet-3-4(K)* 398.65 200.57 Existing-all* 6.42 3.26 AAComposition* 0.06 0.07 Combined* 437.77 220.56 Deep Learning 143.40 121 SVMfold 79365.36 29859.46
Table 3: Running times (in minutes) of the approaches from Fig. 3 plus SVMfold, for CATH-3.20.20 and CATH-3.40.50 PSN sets. Due to SVMfold’s large time, we could not evaluate it on additional PSN sets.
Approach CATH-3.20.20 CATH-3.40.50 OrderedGraphlet-3-4 85.40 97.73 NormOrderedGraphlet-3-4* 77.20 96.36 Norm Ordered Graphlet-3-4(K)* 76 98.18 Existing-all* 37 63.18 AAComposition* 71.20 80 Combined* 98.20 100 Deep Learning 89 72 SVMfold 90.67 99.54
Table 4: Accuracy of the approaches from Fig. 3 plus SVMfold, for CATH-3.20.20 and CATH-3.40.50 PSN sets. Due to SVMfold’s large time (Table 3), we could not evaluate it on additional PSN sets.

SVMfold has high running time because it needs to extract three sets of very comprehensive features from protein sequence information. This complex information retrieval process needs to be performed for each protein in the considered PSN set, which becomes unfeasible when analyzing large PSN sets containing many proteins (such as those at the higher levels of CATH/SCOP hierarchies) or many PSN sets.

Furthermore, we note that one of SVMfold’s three feature sets actually includes structural class information. That is, SVMfold does not just use class information as labels for training purposes, it also uses this information in the features. We argue that because of this, SVMfold has an invalid circular argument, which artificially inflates SVMfold’s accuracy (the label that is to be predicted is already included into the feature based on which the prediction is to be made). This also implies that the SVMfold cannot be applied to the task of classifying proteins with unknown labels, which is the ultimate goal of PSC.

Despite of this bias and unfair advantage of SVMfold, our best approach, Combined, outperforms SVMfold. Also, our individual graphlet features under the LR classifier or our DL approach are comparable to SVMfold in terms of accuracy (DL on CATH-3.20.20 or all graphlet approaches on CATH-3.40.50, Table 4) at a fraction of SVMfold’s running time (Table 3).

4 Conclusion

In a comprehensive evaluation, we demonstrate the power of unweighted graphlet-based PSC, weighted network-based deep learning PSC, and data integrative PSC. Specifically, the LR classifier trained by OrderedGraphlet-3-4 feature provides a strong (accurate yet fast) platform for protein classification. Our integrative EL classifier outperforms all of the other considered classifiers in terms of accuracy, including a state-of-the-art classifier called SVMfold, while at a higher computational cost compared to most of the other approaches (except SVMfold, which is by far the slowest of all approaches). However, the much lower running time of the individual LR classifier trained by OrderedGraphlet-3-4 in comparison to the EL classifier suggests that when running time efficiency is required, one should utilize this graphlet-based approach, which still achieves high accuracy. Further, our proposed DL framework, by automatically learning appropriate features from the initial weighted adjacency matrices, yields comparable accuracy. This points to a promising future for algorithms that will rely on weighted network-based attributes of protein 3D structures.

References

  • [1] Kasabov, N. K. Springer Handbook of Bio-/Neuro-Informatics (Springer, 2013), 1 edn.
  • [2] Jain, P., Garibaldi, J. M. & Hirst, J. D. Supervised machine learning algorithms for protein structure classification. Computational Biology and Chemistry 33, 216–223 (2009).
  • [3] Greene, L. H. et al. The cath domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution. Nucleic acids research 35, D291–D297 (2006).
  • [4] Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. Scop: a structural classification of proteins database for the investigation of sequences and structures. Journal of molecular biology 247, 536–540 (1995).
  • [5] Faisal, F. E. et al. Grafene: Graphlet-based alignment-free network approach integrates 3d structural and sequence (residue order) data to improve protein structural comparison. Scientific Reports 7, 14890 (2017).
  • [6] Xia, J., Peng, Z., Qi, D., Mu, H. & Yang, J. An ensemble approach to protein fold classification by integration of template-based assignment and support vector machine classifier. Bioinformatics 33, 863–870 (2016).
  • [7] Krissinel, E. On the relationship between sequence and structure similarities in proteomics. Bioinformatics 23, 717–723 (2007).
  • [8] Kosloff, M. & Kolodny, R. Sequence-similar, structure-dissimilar protein pairs in the pdb. Proteins: Structure, Function, and Bioinformatics 71, 891–902 (2008).
  • [9] Cui, C. & Liu, Z. Classification of 3d protein based on structure information feature. In BioMedical Engineering and Informatics, 2008. BMEI 2008. International Conference on, vol. 1, 98–101 (IEEE, 2008).
  • [10] Jo, T., Hou, J., Eickholt, J. & Cheng, J. Improving protein fold recognition by deep learning networks. Scientific reports 5, 17573 (2015).
  • [11] Wang, J., Li, Y., Zhang, Y., Tang, N. & Wang, C. Class conditional distance metric for 3d protein structure classification. In Bioinformatics and Biomedical Engineering,(iCBBE) 2011 5th International Conference on, 1–4 (IEEE, 2011).
  • [12] Kalajdziski, S., Mirceva, G., Trivodaliev, K. & Davcev, D. Protein classification by matching 3d structures. In Frontiers in the Convergence of Bioscience and Information Technologies, 2007. FBIT 2007, 147–152 (IEEE, 2007).
  • [13] Pržulj, N., Corneil, D. G. & Jurisica, I. Modeling interactome: scale-free or geometric? Bioinformatics 20, 3508–3515 (2004).
  • [14] Holm, L. & Rosenström, P. Dali server: conservation mapping in 3D. Nucleic Acids Research 38, W545?–W549 (2010).
  • [15] Zhang, Y. & Skolnick, J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Research 33, 2302–09 (2005).
  • [16] Vacic, V., Iakoucheva, L. M., Lonardi, S. & Radivojac, P. Graphlet kernels for prediction of functional residues in protein structures. Journal of Computational Biology 17, 55–72 (2010).
  • [17] Malod-Dognin, N. & Pržulj, N. GR-Align: fast and flexible alignment of protein 3D structures using graphlet degree similarity. Bioinformatics 30, 1259–1265 (2014).
  • [18] Wei, L., Liao, M., Gao, X. & Zou, Q. Enhanced protein fold prediction method through a novel feature extraction technique. IEEE transactions on nanobioscience 14, 649–659 (2015).
  • [19] Dai, H.-L. Imbalanced protein data classification using ensemble ftm-svm. IEEE transactions on nanobioscience 14, 350–359 (2015).
  • [20] Lin, C. et al. Hierarchical classification of protein folds using a novel ensemble classifier. PloS one 8, e56499 (2013).
  • [21] Vipsita, S., Shee, B. K. & Rath, S. K. An efficient technique for protein classification using feature extraction by artificial neural networks. In India Conference (INDICON), 2010 Annual IEEE, 1–5 (IEEE, 2010).
  • [22] Melvin, I., Weston, J., Leslie, C. S. & Noble, W. S. Combining classifiers for improved classification of proteins from sequence or structure. Bmc Bioinformatics 9, 389 (2008).
  • [23] Holm, L. & Sander, C. Protein structure comparison by alignment of distance matrices. Journal of molecular biology 233, 123–138 (1993).
  • [24] Lin, C.-J., Weng, R. C. & Keerthi, S. S. Trust region newton methods for large-scale logistic regression. In Proceedings of the 24th International Conference on Machine Learning, 561–568 (ACM, 2007).
  • [25] Keerthi, S. S., Sundararajan, S., Chang, K.-W., Hsieh, C.-J. & Lin, C.-J. A sequential dual method for large scale multi-class linear svms. In Proceedings of the 14th ACM International Conference on Knowledge Discovery and Data Mining, 408–416 (ACM, 2008).
  • [26] Anzai, Y. Pattern Recognition and Machine Learning (Elsevier, 2012).
  • [27] Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R. & Lin, C.-J. Liblinear: A library for large linear classification. Journal of Machine Learning Research 9, 1871–1874 (2008).
  • [28] Abadi, M. et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016).
  • [29] Glorot, X. & Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 249–256 (2010).
  • [30] Kingma, D. & Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

Acknowledgements

The National Institutes of Health (NIH) 1R01GM120733 grant.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
230632
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description