Accurate Protein Structure Prediction by Embeddings and Deep Learning Representations
Abstract
Proteins are the major building blocks of life, and actuators of almost all chemical and biophysical events in living organisms. Their native structures in turn enable their biological functions which have a fundamental role in drug design. This motivates predicting the structure of a protein from its sequence of amino acids, a fundamental problem in computational biology. In this work, we demonstrate stateoftheart protein structure prediction (PSP) results using embeddings and deep learning models for prediction of backbone atom distance matrices and torsion angles. We recover 3D coordinates of backbone atoms and reconstruct full atom protein by optimization. We create a new gold standard dataset of proteins which is comprehensive and easy to use. Our dataset consists of amino acid sequences, Q8 secondary structures, position specific scoring matrices, multiple sequence alignment coevolutionary features, backbone atom distance matrices, torsion angles, and 3D coordinates. We evaluate the quality of our structure prediction by RMSD on the latest Critical Assessment of Techniques for Protein Structure Prediction (CASP) test data and demonstrate competitive results with the winning teams and AlphaFold in CASP13 and supersede the results of the winning teams in CASP12. We make our data, models, and code publicly available.
Keywords:
Protein Structure Prediction Deep Learning Embeddings.1 Introduction
Proteins are necessary for a variety of functions within cells including transport, antibody, enzyme and catalysis. They are polymer chains of amino acid residues, whose sequences (aka primary structure) dictate stable spatial conformations, with particular torsion angles between successive monomers. To perform their functions, the amino acid residues must fold into proper configurations. The sequence space of proteins is vast, 20 possible residues per position, and evolution has been sampling it over billions of years. Thus, current proteins are highly diverse in sequences, structures and functions. The high throughput acquisition of DNA sequences, and therefore ubiquity of known protein sequences stands in contrast to the limited availability of 3D structures, which are more functionally relevant.
From a physics standpoint, the process of protein folding is a search for the minimum energy conformation which happens in nature paradoxically fast [13]. Unfortunately, explicitly computing the energy of a protein conformation, along with its surrounding water molecules, is extremely complex. Forward simulation approaches have been prohibitively slow on even modestlength proteins, with successes reported only for smaller peptides across timescales that are orders of magnitude too short, and even these required specialized hardware [15] or massive crowd sourcing [12]. Inferring local secondary structure [9] consists of linear annotation of structural elements along the sequence [11]. Inferring tertiary structure consists of resolving the 3D atom coordinates of proteins. When highly similar sequences are available with known structures, this homology can be used for modeling. Recently, PSP had been tackled by first predicting contact points between amino acids, and then leveraging the contact map to infer structure. A primary indicator of contact between a pair of amino acids is their tendency to have correlated and compensatory mutations during evolution. The availability of large scale data on DNA, and therefore protein sequences allows detection of such coevolutionary constraints from sets of sequences that are homologous to a protein of interest. Registering such contacting pairs in a matrix facilitates a framework for their probabilistic prediction. This contact map matrix can be generalized to register distances between amino acids [17].
Machine learning approaches garnered recent success in PSP [4, 2] and its subproblems [16]. These leverage available repositories of tertiary structure [5] and its curated compilations [14, 3] as training data for models that predict structure from sequence. Specifically, the recent biannual critical assessment of PSP methods [6] featured multiple such methods. Most prominently, a closedsource, ResNetbased architecture [7] has achieved impressive results in the CASP evaluation settings, based on representing protein structures both by their distance matrices as well as their torsion angles.
In this work, we design a novel representation of biologically relevant input data and construct a processing flow for PSP as shown in Figure 1. Our method leverages advances in deep sequence models and proposes a method to learn transformations of amino acids and their auxiliary information. The method operates in three stages by i) predicting backbone atom distance matrices and torsion angles ii) recovering backbone atom 3D coordinates, and iii) reconstructing the full atom protein by optimization.
We demonstrate stateofart protein structure prediction results using deep learning models to predict backbone atom distance matrices and torsion angles, recover backbone atom 3D coordinates and reconstruct the full atom protein by optimization. Our three key contributions are:

A gold standard dataset of around 75k proteins described in Table 1 which we call the CUProtein dataset which is easy to use in developing deep learning models for PSP.

Competitive results with the winning teams on CASP13 and a comparison with AlphaFold (A7D) [7] and results mostly superseding the winning teams on CASP12.

Publicly available source code for both protein structure prediction using deep learning models and protein reconstruction.
This work explores encoded representation for sequences of amino acids alongside their auxiliary information. We offer full access to data, models, and code, which remove significant barriers to entry for investigators and make publicly available methods for this important application domain.
2 Methods
We address two problems: (i) predicting backbone distance matrices and torsion angles of backbone atoms from amino acid sequences, Q8 secondary structure, PSSMs, and coevolutionary multiple sequence alignment; and (ii) reconstruction of all atom coordinates from the predicted distance matrices and torsion angles. Figure 2 shows examples of ground truth backbone distance matrices and their predictions for CASP13 and CASP12 proteins and the corresponding full atom 3D native structures and reconstructed proteins.
Feature  Notation  Source  Dims  IO  Type 

AA Sequence  PDB  Input  chars  
PSSM  AA/HHBlits  Input  Real [0,1]  
MSA covariance  AA/jackHMMER  Input  Real [0,1]  
Secondary Structure (SS)  DSSP  Input  8 chars  
, Distance Matrices  PDB  Output  Ångstrom  
Torsion Angles (, )  DSSP  Output  Radians 
2.1 Predicting Backbone Distance Matrices and Torsion Angles
Our inputs are AA sequences, Q8 sequences, PSSMs, and MSA covariance matrices. Our outputs are backbone distance matrices and torsion angles.
Embeddings.
We begin with a onehot representation of each amino acid and secondary structure sequence, and real valued PSSM’s and MSA covariance matrices. These are passed through embedding layers and then onto an encoderdecoder architecture. To leverage sequence homology, we compute the covariance matrix of the MSA features by embedding the homology information along a dimensional vector to form a 3tensor and contract the tensor along the dimensional embedding which is then passed as input to the encoder:
(1) 
EncoderDecoder architectures
We use encoderdecoder models with a bottleneck to train prediction models. The encoder receives as input the aggregation (by concatenation) of the embeddings of each input , and two separate decoders and that output distance matrices and torsion angles for as shown in Figure 3 (a). In addition, we also use a model which consists of separate encoders for each embedded input , that are aggregated by concatenation after encoding, and separate decoders and for torsion angles and backbone distance matrices as shown in Figure 3 (b).
(2) 
Using a separate encoder model involves a larger number of trainable parameters. Our models differ in the use of embeddings for the input, their models, and loss functions.
Deep learning models
Building on techniques commonly used in natural language processing (NLP) our models use embeddings and a sequence of bidirectional GRU’s and LSTM’s with skip connections, and include batch normalization, dropout, and dense layers. We experimented with various loss functions including MAE, MSE, Frobenius norm, and distance logarithm to handle the dynamic range of distances. To learn the loss function we have also implemented distance matrix prediction using conditional GAN’s and VAE’s for protein subsequences.
2.2 Reconstructing Proteins
Once backbone distance matrices and torsion angle are predicted we address two reconstruction subproblems: (i) reconstructing the protein backbone coordinates from their distance matrices, and (ii) reconstructing the full atom protein coordinates from the or coordinates and torsion angles.
Backbone coordinate reconstruction.
We employ three different techniques for reconstructing the 3D coordinates between backbone atoms of a protein from the predicted matrix of their pairwise distances [8]. Given a predicted distance matrix our goal is to recover 3D coordinates of points. We notice that depends only on the Gram matrix :
(3) 
Multidimensional scaling (MDS):
(4) 
Semidefinite programming (SDP) and relaxation (SDR):
(5) 
where operates on the Gram matrix .
Alternating direction method of multipliers (ADMM) [4]:
We have found multidimensional scaling to be the fastest and most robust method of the three, without depending on algorithm hyperparameters, which is most suitable for our purposes. To handle the chirality of these coordinates we calculate all alphatorsion angles (the torsion angles defined by four consecutive atoms) and compare their distribution to the characteristic, highly nonsymmetric distribution of protein alphatorsions. If the predicted alphatorsions are unlikely to result from the distribution of the protein dataset, we invert the coordinates chirality.
Full atom coordinate reconstruction.
Given backbone coordinates we assign plausible coordinates to the rest of the protein’s atoms. We begin with an initial guess or prediction for and torsion angles. We maintain a lookup table of mean , values for each combination of two consecutive torsions and three angles (the angles defined by three consecutive atoms). Using these values and the positions we generate an initial model. This model is then relaxed by a series of energy minimization simulations under an energy function that includes: standard bonded terms (bond, angle, plane and outofplane), knowledge based Ramachandran and pairwise terms, torsion constraints on the and angles, and tether constraints on the position. The latter term reduces the perturbations of the initial high forces. Finally, we add sidechains using a rotamer library, and remove clashes by a series of energy minimization simulations. We develop a very similar method for reconstructing the full atom protein from atom distance matrices.
3 Results
PDB  CASP  Tarid  Best RMSD  A7D  Ours 

5Z82  13  T0951  1.01 (Seok)  NA  1.79 
6D2V  13  T0965  1.72 (A7D)  1.72  1.60 
6QFJ  13  T0967  1.13 (BAKER)  NA  1.18 
6CCI  13  T0969  1.96 (Zhang)  2.27  2.53 
6HRH  13  T1003  0.88 (MULTICOM)  2.12  2.95 
6QEK  13  T1006  0.58 (YASARA)  0.78  1.02 
6N91  13  T1018  1.24 (Wallner)  1.77  3.89 
6M9T  13  T1011D1  1.58 (A7D)  1.58  1.64 
5J5V  12  T0861  0.49 (MULTICOM)  NA  1.00 
2N64  12  T0865  1.87 (HHPred)  NA  1.58 
5JMU  12  T0879  1.35 (MULTICOM)  NA  1.31 
5JO9  12  T0889  1.31 (Seok)  NA  1.79 
4YMP  12  T0891  1.10 (GOAL)  NA  1.36 
5XI8  12  T0942D2  1.73 (EdaRose)  NA  1.60 
We have compared our predictions on CASP12 and CASP13 [1] test targets. Deep learning methods, including were widely used only starting from CASP13. AlphaFold was introduced starting from CASP13. The use of deep learning methods in CASP13, due to availability of DL programming frameworks, significantly improved performance over CASP12. Table 2 provides a representative comparison between the winning CASP12 and CASP13 competition models, AlphaFold models for CASP13 for which A7D submitted predictions to CASP, and our models. These proteins were selected such that they do not contain multiple chains and are sufficiently complex. The average length of the CASP13 target proteins selected is 325 residues, and the average for CASP12 is 247 residues. Results of RMSD around 2 Angstrom on test targets are considered accurate in CASP.
Our results supersede the winning teams of CASP12 compared with each best team for each protein which highlights the improvement using deep learning methods. Our approach is on par with the winning teams in CASP13, compared with the winning team for each protein, which highlights that our methos is stateoftheart. We measure the sequence independent RMSD which is consistent with the CASP evaluation reports and matches the deposited structures and our predictions. CASP competitors such as AlphaFold provide predictions for selected proteins. Overall, our performance on CASP is highly competitive. Training our models on a cloud instance takes two days using GPUs. Prediction of backbone distance matrices and torsion angles takes a few seconds per protein, and reconstruction of full atom proteins from distance matrices and torsion angles takes a few minutes per protein, depending on protein length. Limitations of this work are: (i) we only handle single domain proteins and not complexes with multiple chains, (ii) PSSM and MSA data for several of the CASP targets are limited to a subsequence of the full length protein, and (iii) we do not use available methods for detecting and reconstructing betasheets.
4 Conclusions
We present novel deep learning representations and models for protein structure prediction. We provide a new and open dataset which can easily be used, and make our models and code publicly available [10]. We predict accurate structures around 2 Angstrom within the native structures. Our results are competitive with the latest CASP13 and AlphaFold results and supersede CASP12 results.
References
 [1] (2019) A further leap of improvement in tertiary structure prediction in CASP13 prompts new routes for future assessments. Proteins: Structure, Function, and Bioinformatics. Cited by: §3.
 [2] (2019) Endtoend differentiable learning of protein structure. Cell systems 8 (4), pp. 292–301. Cited by: §1.
 [3] (2019) ProteinNet: a standardized data set for machine learning of protein structure. BMC bioinformatics 20 (1), pp. 311. Cited by: §1.
 [4] (2018) Generative modeling for protein structures. In Advances in Neural Information Processing Systems, pp. 7494–7505. Cited by: §1, §2.2.
 [5] (2000) The protein data bank. Nucleic Acids Research 28. External Links: Link Cited by: §1.
 [6] (2018) 13th community wide experiment on the critical assessment of techniques for protein structure prediction. External Links: Link Cited by: §1.
 [7] (2018) AlphaFold: de novo structure prediction with deeplearning based scoring. External Links: Link Cited by: item , §1.
 [8] (2015) Euclidean distance matrices: essential theory, algorithms, and applications. IEEE Signal Processing Magazine 32 (6), pp. 12–30. Cited by: §2.2.
 [9] (2018) High quality prediction of protein q8 secondary structure by diverse neural network architectures. NeurIPS Workshop on Machine Learning for Molecules and Materials. Cited by: §1.
 [10] (2019) GitHub repository for Accurate protein structure prediction by embeddings and deep learning representations. Note: https://github.com/idrori/cutsp Cited by: §4.
 [11] (1983) Dictionary of protein secondary structure: pattern recognition of hydrogenbonded and geometrical features. Biopolymers: Original Research on Biomolecules 22 (12), pp. 2577–2637. Cited by: §1.
 [12] (2009) Folding@ home and genome@ home: using distributed computing to tackle previously intractable problems in computational biology. arXiv preprint arXiv:0901.0866. Cited by: §1.
 [13] (1969) How to fold graciously. In Proceedings of a meeting held at Allerton House. P. Debrunner, JCM Tsibris, and E. Munck, editors. University of Illinois Press, Urbana, IL, Cited by: §1.
 [14] (1997) CATH–a hierarchic classification of protein domain structures. Structure 5 (8), pp. 1093–1109. Cited by: §1.
 [15] (2009) Millisecondscale Molecular Dynamics Simulations on Anton. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC ’09, pp. 39:1–39:11. Cited by: §1.
 [16] (2018) Computational protein design with deep learning neural networks. Scientific reports 8 (1), pp. 6349. Cited by: §1.
 [17] (2019) Distancebased protein folding powered by deep learning. Proceedings of the National Academy of Sciences 116 (34), pp. 16856–16865. Cited by: §1.