Generalizable Protein Interface Prediction with End-to-End Learning
Predicting how proteins interact with one another—that is, which surfaces of one protein bind to which surfaces of another protein—is a central problem in biology. Here we present Siamese Atomic Surfacelet Network (SASNet), the first end-to-end learning method for protein interface prediction. Despite using only spatial coordinates and identities of atoms as inputs, SASNet outperforms state-of-the-art methods that rely on complex, hand-selected features. These results are particularly striking because we train the method entirely on a significantly biased data set that does not account for the fact that proteins deform when binding to one another. Nonetheless, our network maintains high performance, without retraining, when tested on real cases in which proteins do deform. This suggests that it has learned fundamental properties of protein structure and dynamics, which has important implications for a variety of key problems related to biomolecular structure.
citehyperref\DeclareFieldAliasbibhyperrefnoformat\bibhyperref#1 \DeclareCiteCommand\parencite \usebibmacroprenote \usebibmacrociteindex[\printtext[citehyperref]\usebibmacrocite] \multicitedelim \usebibmacropostnote
Proteins are large molecules that carry out almost every function in the cell. Their function depends critically on their ability to bind to one another in specific ways, forming larger machines known as protein complexes. Experimentally solving the structure of protein complexes remains expensive, slow, and difficult. As a result, the space of possible complexes remains vastly under-explored: as of 2017, major databases such as Interactome3D \parenciteMosca2012c contain a total of approximately 12,000 complexes whose structures have been experimentally determined, while there are estimated to be 650,000 such interactions in humans alone \parenciteStumpf2008.
Here, we focus on the problem of computational protein interface prediction: given the structures of two proteins (determined experimentally or modeled computationally), we wish to predict which surfaces of the two proteins will come into contact upon binding (Figure 1). Such methods are of great interest as they can help address many related biomedically-motivated problems. Despite notable successes in using proteins, and especially antibodies, in treating diseases such as multiple sclerosis and cancer, the development of protein therapies for certain diseases remains challenging, often due to the need to find proteins that bind to specific interfaces. Thus, prediction of protein interfaces between candidate therapeutics and target molecules could accelerate the discovery of new medicine. Other applications of interface prediction include identification of interactions that small molecule drugs could disrupt, and affinity prediction, where the objective is to assess the strength of a given interaction. Predictors with sufficiently high performance could even be used to design entirely new proteins that interact in desired ways with other proteins, a particularly hard task in the field of structural biology.
In principle, one could solve the protein interface prediction problem using physics-based methods, but in practice this has proven extremely difficult. In a recent benchmark \parenciteVreven2015a, state-of-the-art protein-protein docking software based on physical models yielded a top prediction that was remotely close to the correct answer with very low frequency (only about 15% of the time). Computational power is also several orders of magnitude away from being able to adequately model protein binding using simulations based on classical mechanics. Thus, there is a high degree of interest in predicting protein interfaces through data-driven methods.
In this work, we demonstrate SASNet, the first end-to-end learning method applied to interface prediction. The method is end-to-end because instead of relying on hand-engineered, high-level features, we work at the atomic level and record only atoms’ positions and their element types (i.e., carbon, nitrogen, oxygen, or sulfur). To predict whether an amino acid in one protein interacts with an amino acid in another protein, we voxelize the local atomic environments, or "surfacelets," surrounding each of them and then apply a siamese-like three-dimensional convolutional neural network (CNN) to the resulting grids.
One of the primary reasons why predicting interfaces is so difficult is that proteins change shape upon binding at multiple spatial scales (Figure 1), such that the interacting surfaces will generally look very different when the proteins are bound versus when they are not \parenciteKuroda2016. Thus, when examining two protein surfaces, we wish to determine not only whether they could bind in their current shape (conformation), but whether they have the potential to bind by adopting other conformations that are accessible to them.
To complicate matters further, the number of cases for which we have experimental structures of both the final complex as well as for each protein on its own is small, with a relatively comprehensive dataset \parenciteVreven2015a containing approximately 300 such protein interactions with 21,000 neighboring amino acids. Methods that use high level featurizations have relied on such datasets for training and testing, but their limited size presents a challenge for an end-to-end learning approach that begins with atom positions. We have found, however, that we can mine the Protein Data Bank (PDB) \parenciteBerman2000 to construct a cleaned dataset containing over 44,828 binary protein interactions with only the final complex solved experimentally. This leads to over five million neighboring amino acids, a potential increase of two orders of magnitude in the size of the training set if one is willing to accept the noise associated with the unknown deformation of the protein upon binding. We refer to the former dataset as the complex-unbound set, and the latter as the complex-only set.
We train our end-to-end model on a small subset of our complex-only set, without accounting for the fact the proteins deform upon binding. Notably, when tested on real proteins that do deform upon binding, our method significantly outperforms state-of-the-art methods that exploit hand-engineered features and are trained directly on complex-unbound data. We also leave open the door to substantially more performance improvements not available to competing models, as we only end up training on approximately 3% of our complex-only set (due to computational limitations), whereas standard models are already using most of the complex-unbound data available to them. Finally, we demonstrate that the drop in performance between testing on already formed complexes versus the unbound proteins is minimal.
This is especially exciting as these exact configurations in the complex-only set represent a small and rigid subset of the possible configurations that can be adopted when the individual proteins are not bound to one another. Essentially, protein interfaces must take on a specific configuration upon binding in order to fit together in an energetically favorable manner, whereas they have much more flexibility when not bound (i.e. the atoms are not as restricted to particular positions). This lack of a performance hit indicates that the model has not simply memorized the provided examples, but instead has learned a generalizable representation of the inherent protein structure and dynamics. We argue that the convolutional neural network formulation, while still remaining very flexible, has the appropriate form to be able to exploit this fundamental structure, likely due to its high degree regularity and spatial hierarchy.
2 Related Work
Though comparisons between methods is difficult due to the lack of standardized datasets or metrics, \parenciteFout2017b stands out as having particularly high performance. In their work they make use of graph convolutions built on top of hand-designed structural features and trained only on the complex-unbound set. Other high performing methods are \parenciteJordan2012b and \parenciteHwang2015a who also use high-level structural features to predict interfacial residues, but in a non-partner-specific manner – given a single protein, the task is to predict which of its amino acids are likely to form an interface with any protein. \parenciteXue2015 has determined that partner-specific interface predictors yield much higher performance.
Sequence conservation across species is another source of information for addressing the interface prediction problem. The basic idea is that the portions of the protein that are interfacial are typically highly constrained in how they can evolve, as too much variability can interrupt interactions that might be vital to the protein’s function. For example, \parenciteAhmad2011 uses neural networks trained on such features. Given that all these interfaces are derived from the physics of actual three-dimensional interactions, the relegation of structure to a hidden and unmodeled variable leads to limitations of these approaches. The general consensus in the field is that the performance of purely sequence-based methods is hitting their limit \parenciteEsmaielbeiki2015a. Though we primarily focus on structure-based prediction, we also run certain experiments where we integrate sequence conservation information.
Interface prediction is also of importance to protein-protein docking, a computational technique used to predict the three-dimensional structure of a complex from its individual proteins. There are a wide variety of methods that have been proposed \parenciteDominguez2003, \parenciteTorchala2013a, \parenciteChen2003, \parenciteRezacova2008 which regularly compete in standardized docking assessments \parenciteJanin2003. As noted previously, docking software currently achieves low accuracy \parenciteVreven2015a. The lack of robust interface predictors for ranking candidate complexes has been identified as one of the primary issues preventing better performance \parenciteBonvin2006.
Taking a step back from protein interactions, there has also been significant interest in applying machine learning methods to other tasks involving proteins, small drug-like molecules, DNA, and RNA. Yet the generalizability of these methods to new and unseen tasks has typically not been investigated and is unsuccessful in cases where it is considered at all \parenciteRamsundar2015. Graph-based approaches have been used for deriving properties of small molecules \parenciteKearnes2016a \parenciteDuvenaud2015c. \parenciteGilmer2017a used such networks for quantum mechanical calculations. Another common representation for quantum mechanical calculations is based on \parenciteBehler2007a’s symmetry functions which use manually determined gaussian basis functions, as used in \parenciteSchutt2017c \parenciteSmith2017b. \parenciteGomes2017a uses the symmetry functions for protein-ligand binding affinity prediction. Instead of building in invariances \parenciteZhang2017c canonicalizes the coordinate frame for each atom as well as the ordering of neighbors and trains a fully connected neural network on the result to predict force field potentials and forces. Finally, 3D convolutional networks have been used for protein-ligand binding affinity \parenciteRagoza2016c \parenciteWallach2015c \parenciteJimenez2017a, as well as for protein fold prediction \parenciteDerevyanko2018b, and for filling in missing amino acids \parenciteTorng2017a.
|Dataset||# Binary Complexes||# Amino Acid Interactions|
We construct two separate datasets for training and testing our method. The first is our gold standard dataset which we use for testing performance. It comprises the 230 protein complexes in the Docking Benchmark 5 (DB5) dataset \parenciteVreven2015a. Interfacial amino acids are defined based on the final complex, but the 3D structures used are derived from the individual unbound proteins. The data distribution therefore closely matches that which we would see when predicting interfaces on new examples, which will be provided in their unbound states. Additionally, the range of difficulty and of interaction types in this dataset (e.g. enzyme-inhibitor, antibody-antigen) provides us with good coverage of typical test cases we might see in the wild. We sometimes refer to previous versions of DB5, such as DB3 and DB4, with DB3 DB4 DB5. DB4U and DB5U refer to portions added in the 4th and 5th updates, respectively (so for example DB3 DB4U = DB4).
To construct our complex-only protein interface dataset for training, we start by mining the PDB for pairs of interacting protein subunits (Figure 2A). The PDB contains data of varying quality, so we remove all complexes matching the following criteria: less than 500Å2 buried surface area, worse than 3.5Å resolution, solved via nuclear magnetic resonance, containing any protein of length less than 50 amino acids, and any model beyond the first one in a structure. Furthermore, DB5 is initially derived from the PDB, so we must take care that there is no complex-level cross-contamination between our train and test sets. We therefore employ rigorous sequence-based pruning to ensure there is little overlap between our tasks. Specifically, we exclude any complex that has any individual protein with over 30% sequence identity when aligned to any protein in our DB5 gold dataset. The initial processing as well as the DB5 exclusion yields a dataset of 44,828 binary complexes.
For both of these datasets, once these binary protein complexes are generated, we identify all interacting pairs of amino acids. A pair of amino acids — one from each protein — is determined to be interacting if any of their non-hydrogen atoms (hydrogen atoms are typically not observed in experimental structures) are within 6Å of one another (Figure 2B). We consider each of these pairs as a positive example of interacting surface, leading to a total of over 5 million pairs of positives for the PDB dataset (Figure 2C, see Table 1 for exact counts). For the negatives, we select random pairs of non-interacting amino acids spanning the same protein complexes, ensuring that each complex provides an equal number of positives and negatives (Figure 2D).
As noted previously, the structures in the PDB datasets are already in their bound state. Typical methods to solve the prediction problem are trained on much smaller datasets containing unbound proteins, such as DB5. As mentioned previously, a key point of this work is that we leverage the much larger PDB dataset to solve the problem of protein prediction on the DB5 dataset. The transferability between these two problems is by no means obvious. For example, the PDB dataset has a much higher degree of shape complementarity than the DB5 dataset, as the former comprises pairs of proteins that are all in the correct configuration to interact with each other.
Since sequence conservation features are another important and complementary source of information for protein interface prediction, we also compute position-specific similarity matrices (PSSMs) for the DB5 dataset. We use the blastp program \parenciteAltschul1990 on the nonredundant database of protein sequences \parenciteMcGinnis2004. We have not computed these features for the PDB dataset due to the computational expense. As our structure+conservation models require training on PSSM data, we train directly on DB3, validate on DB4U, and test on DB5U.
Due to the hierarchical and regular structure of proteins, as well as the wealth of data available for the protein interface prediction, we selected a three-dimensional convolutional neural network as SASNet’s underlying model (Figure 2F). We first focus on how to represent our pairs of amino acids in order to provide them to our network. For each amino acid, we encode its surrounding atomic neighborhood – a region of 3D space centered around its alpha-carbon which we call a "surfacelet." The raw form of a surfacelet is a list of the nearby atom positions, , of size , and a corresponding list of element types , of size . This encodes all local data that is typically provided in a given PDB structure, and thus ensures minimal loss of information.
To create a dense, three-dimensional, and fixed-size representation of the input, we choose to voxelize the space (Figure 2D). For each surfacelet, we lay down a grid centered on the generating alpha carbon of the amino acid, and record in each voxel the presence or absence of a given atom. To ensure that at most one atom can occur in each voxel, while also keeping the input representation from getting too large, we chose a voxel resolution of 1Å. A fourth dimension is used to encode the element type of the atom, using 4 channels for carbon, oxygen, nitrogen, and sulfur, the most commonly found atoms in protein structure. In order to build in a notion of rotational invariance, each training example is randomly rotated, every time it is seen, across the 3 axes of rotation. At test time, we perform 20 random rotations for each example and average the predictions.
We feed the voxelized surfacelets to multiple layers of 3D convolution (Conv3D) followed by batch normalization (BN) and rectified linear units (ReLU). Interspersed through the convolutions are layers of 3D max pooling (MaxPool). We then apply several fully-connected (FC) layers followed by more BNs and ReLUs. As we are working with pairs of surfacelets, we use a siamese-like networks where we employ two such towers with tied weights to build a latent representation of the two surfacelets, and then concatenate the results. An important difference compared to siamese approaches arises from the nature of the task we are predicting. Unlike a classical siamese network, we are not attempting to compute a similarity between two objects. This can be shown by considering the nature of protein interactions: a positively charged protein surface is likely to interact with a negatively charged counter-part, even though the two could be considered very dissimilar. Instead of minimizing euclidean distance between the two latent representation as would be done in a classical siamese network, we append a series of fully connected layers on the concatenation of the two latent representations and optimize the binary cross entropy loss with respect to the original training labels.
To determine the optimal model, we ran a large set of manual hyperparameter searches on a limited subset of the full PDB dataset, created based on selection criteria from \parenciteKirys2015. We vary the number of filters, number of convolutional layers, number of dense layers, grid size, use of maxpooling, batchnorm, dropout, and selected our models based on average performance across three or more replicates on DB4. Surprisingly, most of these parameters had little effect on the overall validation performance, with the notable exceptions being the positive effect of batchnorm, pooling, and especially increasing the overall grid size. Our structural-only model with the best validation performance involved featurizing a grid of edge length 35Å (thus starting at a cube size of 35x35x35), and then applying 4 layers of convolution (with filter sizes 32, 64, 128, and 256) and 2 layers of max pooling, as shown in Figure 2F. A fully-connected layer with 512 parameters lays at the top of each tower, and the outputs of both towers are concatenated and passed through two more fully connected layers with 512 parameters each, leading to the final prediction. The number of filters used in each convolutional layer is doubled every time to allow for an increase of the specificity of the filters as the spatial resolution decreases. We use the RMSProp optimizer with a learning rate of 0.0001. The overall network is designed such that the grid feeding into the first dense layer is of not too great a size to cause memory issues yet is not too small to lose all spatial information. All models are trained across 4 Titan X GPUs using data-level parallelism.
Guided by previous interface predictors’ reliance on sequence conservation features, we also incorporate such features into our model. We featurize sequence conservation by considering fixed length amino acid windows centered on the residue of interest, as in \parenciteFout2017b. To add this sequence featurization into our existing top structural models, we concatenate the final hidden layer of a given structural model with our sequence features and train a simple linear model on top of DB3. Using DB4U as a validation set, we perform a 5-replicate hyperparameter search over both sequence conservation window size and overall grid size, finding that a window size of 5 and a grid size of 27Å yielded best mean validation performance across our replicates.
The combination of a dense featurization, large datasets, and a model that exploits the inherent structure of proteins allows us to outperform state-of-the-art results on standardized and well-curated benchmarks, while making almost no assumptions with respect to the problem of protein interface prediction. Furthermore, we demonstrate the model’s generalizability by comparing performance on the same complexes in their final bound form as well as on their individual proteins in their unbound state. Finally, we observe the model’s scalability, noting that there is significant potential for further performance improvements simply via more engineering effort. All reported models were run across 3 or more replicates.
5.1 Comparison to Existing Interface Prediction Methods
|NGF \parenciteDuvenaud2015c||0.843 (0.851 +/- 0.010)||0.869 (0.875 +/- 0.018)|
|DTNN \parenciteSchutt2017d||0.861 (0.861 +/- 0.004)||0.869 (0.870 +/- 0.003)|
|Node+Edge Average \parenciteFout2017b||0.844 (0.850 +/- 0.004)||0.895 (0.899 +/- 0.005)|
|Order Dependent \parenciteFout2017b||0.857 (0.864 +/- 0.006)||0.896 (0.894 +/- 0.004)|
|Node Average \parenciteFout2017b||0.876 (0.877 +/- 0.005)||0.887 (0.888 +/- 0.004)|
|*SASNet||0.899 (0.890 +/- 0.011)||0.921 (0.914 +/- 0.009)|
We start by evaluating the effectiveness of our structural features by comparing to top existing methods applied to the interface prediction problem, as shown in Table 2’s Structure-Only column. These methods were pulled from the comparison in \parenciteFout2017b and include Deep Tensor Neural Networks (DTNN) from \parenciteSchutt2017d and Neural Graph Fingerprints (NGF) from \parenciteDuvenaud2015c. As our model is built to operate primarily on structure, we decide to isolate the effect of the respective structural representations. We remove sequence conservation features from the compared models and re-run their training procedures. For each model, we select from available hyperparameters by choosing those with the best DB4 performance across replicates. At test time we evaluate on DB5U, splitting the predictions by complex and computing the Area Under the Receiver Operating Characteristic (AUROC) for each one. We then calculate the median of those AUROCs. We refer to this as the median per-Complex AUROC (CAUROC). As our final performance metric we report the mean and standard deviation of CAUROC across all replicates, as well as the CAUROC of the replicate with the best DB4 performance. We also ensemble the two best models as evaluated by DB4 performance, using one layer sigmoid output trained on DB4. All our models demonstrate superior performance to all other methods without the use of any hand-engineered features, and without even directly training on the problem of unbound interface prediction.
Next, we compare performance with sequence conservation features added. We select hyperparameters using the same procedure as for structure-only and report mean, standard deviation, and best DB5U CAUROC. Results are shown in Table 2, Structure+Conservation column, and are similar to those originally reported in \parenciteFout2017b. We see that our structural models, combined via a simple linear predictor with PSSM data, continues to outperform all other methods.
A concern for SASNet is that we are training on protein structures as they appear when bound in complexes, whereas the test cases use structures of unbound proteins. We may overvalue the exact configuration presented in the bound complex at the expense of other potential unbound configurations. For example, final complexes are in their bound state, and therefore the model might incorrectly determine that any clashes between surfacelets would be unacceptable.
We measure the effect of this bound-unbound overfitting by taking our best structural-only hyperparameters and evaluating their replicates’ performance on the bound DB5U set, effectively the same interactions as in regular DB5U but in their final complex form. We compare this result to these replicates’ standard unbound DB5U performance. As shown in Table 3, though we do note a small decrease in mean CAUROC from 0.919 to 0.890, the marginal drop in performance indicates that the overfitting to exact bound configurations is not overly detrimental to our protein prediction performance. Furthermore, we show that the replicates from our best structure+conservation model results in even less overfitting, with the bound-unbound gap diminishing to 0.923 versus 0.914.
|Structure-Only||0.919 +/- 0.008||0.890 +/- 0.011|
|Structure+Conservation||0.923 +/- 0.006||0.914 +/- 0.009|
5.3 Dataset and Grid Size Scaling
Given the expense of running 3D convolutions, our best models are limited to fairly small dataset sizes, even though we have a very large dataset available to us. We additionally have to carefully consider the size of the grids due to the cubic relationship between edge size and the total number of voxels. As these are problems that are surmountable through enough engineering effort, we are interested in assessing the potential benefits of scaling up. We do so by running a range of smaller tests, both in terms of the size of the grids used, as well as the number of surfacelet pairs that we train on. We run 5 replicates for each condition and plot the average and standard deviation of the CAUROC across replicates.
Figure 3A shows the results of the grid size scaling tests. We notice consistent performance improvements up to a grid edge size of 27Å, with performance increases becoming noisier and mostly tapering off afterwards. We do note that extra signal is still being gained at very large sizes such as 41Å, implying a long-range contribution of protein interaction signals. In Figure 3B, we see that the dataset size tests yield consistently increasing performance, reflecting the high degree of scalability of our model, and implying further performance gains could be obtained with larger dataset sizes.
In this work we introduced the first end-to-end learning framework to predict protein interfaces and demonstrated the flexibility and power of SASNet. We surpass current state-of-the-art results on the interface prediction problem while only training on complexes already in their bound configuration, without using any sort of expert feature identification. The transferability and generalizability of the learned features are particularly interesting, as we can envision solving many data-poor problems in structural biology through training on much larger tangentially-related datasets. Much like an enhanced version of pre-training on ImageNet for computer vision tasks, we could employ pretrained models in structural biology that have already learned to encode many of the patterns present in biomolecular structures, allowing us to solve new tasks with minimal to no amount of re-adaptation necessary. Additionally, the small number of assumptions made suggests the model is generalizable to a range of structural biology problems, with potential applications to relatively data-poor problems such as protein design and mutagenesis.
One hypothesis as to why SASNet’s CNNs are able to generalize so well on this problem is that proteins are highly spatially hierarchical and regular in nature, as well as all being governed by the same underlying laws of physics, making them an ideal fit for the stacked convolutional framework. Though these properties are well understood at the lowest levels (only 22 amino acids are genetically encoded, and each of these amino acids has a fixed atomic composition), the definitions become less precise as we move up the hierarchy. Amino acids often form common secondary structure elements such as alpha-helices and beta-sheets. At a higher level, parts of the protein sequence can form into independent and stable pieces of 3D structure known as protein domains. Finally, whole proteins can be built out of these domains. Many motifs appear to be shared between different proteins at all levels of this hierarchy. Current schemes for classifying protein structure often rely on manually curated hierarchies such as \parenciteAndreeva2014 that are not able to cleanly capture all possible variations. Thus, CNNs may be able to not only capture the known relationships between structural elements at different scales, but also derive new relations that have not been fully characterized. Further investigation of the exact filters learned could yield insight into the nature of these higher-level structural patterns, allowing for a better understanding of protein structure.
A final point to make is that although the convolutional neural network paradigm seems to be a good fit for problems involving proteins, certain issues remain. Though the model densely featurizes 3D space, we still need to employ heavy doses of data augmentation to explore the full three dimensions of rotational space. Though in theory derivable from the raw atom positions, we also do not directly encode the bonding pattern of atoms in our current formulation. Exciting directions for future work could involve methods to directly express connectivity as well as building in a notion of rotational equivariance.
The authors thank Guy Amdur, Robin Betz, Stephan Eismann, Naomi Latorraca, Reid Pryzant, João Rodrigues, and AJ Venkatakrishnan for their discussions and advice regarding this work. This material is based upon work supported by Intel, Amazon, the National Science Foundation Graduate Research Fellowship Program under Grant No. 1147470, and the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Scientific Discovery through Advanced Computing (SciDAC) program. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or the Department of Energy.