Accurate Protein Structure Prediction by Embeddings and Deep Learning Representations

Accurate Protein Structure Prediction by Embeddings and Deep Learning Representations

Iddo Drori 1Columbia University, Department of Computer Science, New York, NY, 10027 12Cornell University, School of Operations Research and Information Engineering, Ithaca, NY 14853 2    Darshan Thaker 1Columbia University, Department of Computer Science, New York, NY, 10027 1    Arjun Srivatsa 1Columbia University, Department of Computer Science, New York, NY, 10027 1    Daniel Jeong 1Columbia University, Department of Computer Science, New York, NY, 10027 1    Yueqi Wang 1Columbia University, Department of Computer Science, New York, NY, 10027 1    Linyong Nan 1Columbia University, Department of Computer Science, New York, NY, 10027 1    Fan Wu 1Columbia University, Department of Computer Science, New York, NY, 10027 1    Dimitri Leggas 1Columbia University, Department of Computer Science, New York, NY, 10027 1    Jinhao Lei 1Columbia University, Department of Computer Science, New York, NY, 10027 1    Weiyi Lu 1Columbia University, Department of Computer Science, New York, NY, 10027 1    Weilong Fu 1Columbia University, Department of Computer Science, New York, NY, 10027 1    Yuan Gao 1Columbia University, Department of Computer Science, New York, NY, 10027 1    Sashank Karri 1Columbia University, Department of Computer Science, New York, NY, 10027 1    Anand Kannan 1Columbia University, Department of Computer Science, New York, NY, 10027 1    Antonio Moretti 1Columbia University, Department of Computer Science, New York, NY, 10027 1   
Mohammed AlQuraishi
3Harvard University, Department of Systems Biology, Harvard Medical School, Boston, MA 02115 3
   Chen Keasar 4Ben-Gurion University, Department of Computer Science, Israel, 8410501 4    Itsik Pe’er 1Columbia University, Department of Computer Science, New York, NY, 10027 1

Proteins are the major building blocks of life, and actuators of almost all chemical and biophysical events in living organisms. Their native structures in turn enable their biological functions which have a fundamental role in drug design. This motivates predicting the structure of a protein from its sequence of amino acids, a fundamental problem in computational biology. In this work, we demonstrate state-of-the-art protein structure prediction (PSP) results using embeddings and deep learning models for prediction of backbone atom distance matrices and torsion angles. We recover 3D coordinates of backbone atoms and reconstruct full atom protein by optimization. We create a new gold standard dataset of proteins which is comprehensive and easy to use. Our dataset consists of amino acid sequences, Q8 secondary structures, position specific scoring matrices, multiple sequence alignment co-evolutionary features, backbone atom distance matrices, torsion angles, and 3D coordinates. We evaluate the quality of our structure prediction by RMSD on the latest Critical Assessment of Techniques for Protein Structure Prediction (CASP) test data and demonstrate competitive results with the winning teams and AlphaFold in CASP13 and supersede the results of the winning teams in CASP12. We make our data, models, and code publicly available.

Protein Structure Prediction Deep Learning Embeddings.

1 Introduction

Proteins are necessary for a variety of functions within cells including transport, antibody, enzyme and catalysis. They are polymer chains of amino acid residues, whose sequences (aka primary structure) dictate stable spatial conformations, with particular torsion angles between successive monomers. To perform their functions, the amino acid residues must fold into proper configurations. The sequence space of proteins is vast, 20 possible residues per position, and evolution has been sampling it over billions of years. Thus, current proteins are highly diverse in sequences, structures and functions. The high throughput acquisition of DNA sequences, and therefore ubiquity of known protein sequences stands in contrast to the limited availability of 3D structures, which are more functionally relevant.

From a physics standpoint, the process of protein folding is a search for the minimum energy conformation which happens in nature paradoxically fast [13]. Unfortunately, explicitly computing the energy of a protein conformation, along with its surrounding water molecules, is extremely complex. Forward simulation approaches have been prohibitively slow on even modest-length proteins, with successes reported only for smaller peptides across timescales that are orders of magnitude too short, and even these required specialized hardware [15] or massive crowd sourcing [12]. Inferring local secondary structure [9] consists of linear annotation of structural elements along the sequence [11]. Inferring tertiary structure consists of resolving the 3D atom coordinates of proteins. When highly similar sequences are available with known structures, this homology can be used for modeling. Recently, PSP had been tackled by first predicting contact points between amino acids, and then leveraging the contact map to infer structure. A primary indicator of contact between a pair of amino acids is their tendency to have correlated and compensatory mutations during evolution. The availability of large scale data on DNA, and therefore protein sequences allows detection of such co-evolutionary constraints from sets of sequences that are homologous to a protein of interest. Registering such contacting pairs in a matrix facilitates a framework for their probabilistic prediction. This contact map matrix can be generalized to register distances between amino acids [17].

Machine learning approaches garnered recent success in PSP [4, 2] and its sub-problems [16]. These leverage available repositories of tertiary structure [5] and its curated compilations [14, 3] as training data for models that predict structure from sequence. Specifically, the recent biannual critical assessment of PSP methods [6] featured multiple such methods. Most prominently, a closed-source, ResNet-based architecture [7] has achieved impressive results in the CASP evaluation settings, based on representing protein structures both by their distance matrices as well as their torsion angles.

In this work, we design a novel representation of biologically relevant input data and construct a processing flow for PSP as shown in Figure 1. Our method leverages advances in deep sequence models and proposes a method to learn transformations of amino acids and their auxiliary information. The method operates in three stages by i) predicting backbone atom distance matrices and torsion angles ii) recovering backbone atom 3D coordinates, and iii) reconstructing the full atom protein by optimization.

Figure 1: High level flow. The method operates in three stages by i) predicting backbone atom distance matrices and torsion angles ii) recovering backbone atom 3D coordinates, and iii) reconstructing the full atom protein by optimization

We demonstrate state-of-art protein structure prediction results using deep learning models to predict backbone atom distance matrices and torsion angles, recover backbone atom 3D coordinates and reconstruct the full atom protein by optimization. Our three key contributions are:

  • A gold standard dataset of around 75k proteins described in Table 1 which we call the CUProtein dataset which is easy to use in developing deep learning models for PSP.

  • Competitive results with the winning teams on CASP13 and a comparison with AlphaFold (A7D) [7] and results mostly superseding the winning teams on CASP12.

  • Publicly available source code for both protein structure prediction using deep learning models and protein reconstruction.

This work explores encoded representation for sequences of amino acids alongside their auxiliary information. We offer full access to data, models, and code, which remove significant barriers to entry for investigators and make publicly available methods for this important application domain.

Figure 2: Visualization of CASP12 and CASP13 results: top row denotes protein 5Z82, center row protein 6D2V, and bottom row protein 5JO9. Columns denote (a) ground truth and (b) predicted distance matrices; (c) native (green) and predicted (cyan) structures using our model.

2 Methods

We address two problems: (i) predicting backbone distance matrices and torsion angles of backbone atoms from amino acid sequences, Q8 secondary structure, PSSMs, and co-evolutionary multiple sequence alignment; and (ii) reconstruction of all atom coordinates from the predicted distance matrices and torsion angles. Figure 2 shows examples of ground truth backbone distance matrices and their predictions for CASP13 and CASP12 proteins and the corresponding full atom 3D native structures and reconstructed proteins.

Feature Notation Source Dims IO Type
AA Sequence PDB Input chars
PSSM AA/HHBlits Input Real [0,1]
MSA covariance AA/jackHMMER Input Real [0,1]
Secondary Structure (SS) DSSP Input 8 chars
, Distance Matrices PDB Output Ångstrom
Torsion Angles (, ) DSSP Output Radians
Table 1: Our CUProtein dataset consists of amino acid sequences, secondary structure, PSSM’s, MSA covariance matrices, backbone distance matrices, and torsion angles.

2.1 Predicting Backbone Distance Matrices and Torsion Angles

Our inputs are AA sequences, Q8 sequences, PSSMs, and MSA covariance matrices. Our outputs are backbone distance matrices and torsion angles.


We begin with a one-hot representation of each amino acid and secondary structure sequence, and real valued PSSM’s and MSA covariance matrices. These are passed through embedding layers and then onto an encoder-decoder architecture. To leverage sequence homology, we compute the covariance matrix of the MSA features by embedding the homology information along a dimensional vector to form a 3-tensor and contract the tensor along the dimensional embedding which is then passed as input to the encoder:


Encoder-Decoder architectures

We use encoder-decoder models with a bottleneck to train prediction models. The encoder receives as input the aggregation (by concatenation) of the embeddings of each input , and two separate decoders and that output distance matrices and torsion angles for as shown in Figure 3 (a). In addition, we also use a model which consists of separate encoders for each embedded input , that are aggregated by concatenation after encoding, and separate decoders and for torsion angles and backbone distance matrices as shown in Figure 3 (b).


Using a separate encoder model involves a larger number of trainable parameters. Our models differ in the use of embeddings for the input, their models, and loss functions.

Figure 3: Model architecture: inputs are embedded followed by (a) aggregation, encoding using sequence models, and decoding; (b) encoding using sequence models, aggregation, and decoding.

Deep learning models

Building on techniques commonly used in natural language processing (NLP) our models use embeddings and a sequence of bi-directional GRU’s and LSTM’s with skip connections, and include batch normalization, dropout, and dense layers. We experimented with various loss functions including MAE, MSE, Frobenius norm, and distance logarithm to handle the dynamic range of distances. To learn the loss function we have also implemented distance matrix prediction using conditional GAN’s and VAE’s for protein sub-sequences.

2.2 Reconstructing Proteins

Once backbone distance matrices and torsion angle are predicted we address two reconstruction sub-problems: (i) reconstructing the protein backbone coordinates from their distance matrices, and (ii) reconstructing the full atom protein coordinates from the or coordinates and torsion angles.

Backbone coordinate reconstruction.

We employ three different techniques for reconstructing the 3D coordinates between backbone atoms of a protein from the predicted matrix of their pairwise distances [8]. Given a predicted distance matrix our goal is to recover 3D coordinates of points. We notice that depends only on the Gram matrix :


Multi-dimensional scaling (MDS):


Semi-definite programming (SDP) and relaxation (SDR):


where operates on the Gram matrix .

Alternating direction method of multipliers (ADMM) [4]:

We have found multi-dimensional scaling to be the fastest and most robust method of the three, without depending on algorithm hyper-parameters, which is most suitable for our purposes. To handle the chirality of these coordinates we calculate all alpha-torsion angles (the torsion angles defined by four consecutive atoms) and compare their distribution to the characteristic, highly non-symmetric distribution of protein alpha-torsions. If the predicted alpha-torsions are unlikely to result from the distribution of the protein dataset, we invert the coordinates chirality.

Full atom coordinate reconstruction.

Given backbone coordinates we assign plausible coordinates to the rest of the protein’s atoms. We begin with an initial guess or prediction for and torsion angles. We maintain a look-up table of mean , values for each combination of two consecutive torsions and three -angles (the angles defined by three consecutive atoms). Using these values and the positions we generate an initial model. This model is then relaxed by a series of energy minimization simulations under an energy function that includes: standard bonded terms (bond, angle, plane and out-of-plane), knowledge based Ramachandran and pairwise terms, torsion constraints on the and angles, and tether constraints on the position. The latter term reduces the perturbations of the initial high forces. Finally, we add side-chains using a rotamer library, and remove clashes by a series of energy minimization simulations. We develop a very similar method for reconstructing the full atom protein from atom distance matrices.

3 Results

PDB CASP Tar-id Best RMSD A7D Ours
5Z82 13 T0951 1.01 (Seok) NA 1.79
6D2V 13 T0965 1.72 (A7D) 1.72 1.60
6QFJ 13 T0967 1.13 (BAKER) NA 1.18
6CCI 13 T0969 1.96 (Zhang) 2.27 2.53
6HRH 13 T1003 0.88 (MULTICOM) 2.12 2.95
6QEK 13 T1006 0.58 (YASARA) 0.78 1.02
6N91 13 T1018 1.24 (Wallner) 1.77 3.89
6M9T 13 T1011-D1 1.58 (A7D) 1.58 1.64
5J5V 12 T0861 0.49 (MULTICOM) NA 1.00
2N64 12 T0865 1.87 (HHPred) NA 1.58
5JMU 12 T0879 1.35 (MULTICOM) NA 1.31
5JO9 12 T0889 1.31 (Seok) NA 1.79
4YMP 12 T0891 1.10 (GOAL) NA 1.36
5XI8 12 T0942-D2 1.73 (EdaRose) NA 1.60
Table 2: Comparison between the winning CASP12/13 models for each protein, best AlphaFold (A7D) model for CASP13, and our model for CASP12/13.

We have compared our predictions on CASP12 and CASP13 [1] test targets. Deep learning methods, including were widely used only starting from CASP13. AlphaFold was introduced starting from CASP13. The use of deep learning methods in CASP13, due to availability of DL programming frameworks, significantly improved performance over CASP12. Table 2 provides a representative comparison between the winning CASP12 and CASP13 competition models, AlphaFold models for CASP13 for which A7D submitted predictions to CASP, and our models. These proteins were selected such that they do not contain multiple chains and are sufficiently complex. The average length of the CASP13 target proteins selected is 325 residues, and the average for CASP12 is 247 residues. Results of RMSD around 2 Angstrom on test targets are considered accurate in CASP.

Our results supersede the winning teams of CASP12 compared with each best team for each protein which highlights the improvement using deep learning methods. Our approach is on par with the winning teams in CASP13, compared with the winning team for each protein, which highlights that our methos is state-of-the-art. We measure the sequence independent RMSD which is consistent with the CASP evaluation reports and matches the deposited structures and our predictions. CASP competitors such as AlphaFold provide predictions for selected proteins. Overall, our performance on CASP is highly competitive. Training our models on a cloud instance takes two days using GPUs. Prediction of backbone distance matrices and torsion angles takes a few seconds per protein, and reconstruction of full atom proteins from distance matrices and torsion angles takes a few minutes per protein, depending on protein length. Limitations of this work are: (i) we only handle single domain proteins and not complexes with multiple chains, (ii) PSSM and MSA data for several of the CASP targets are limited to a subsequence of the full length protein, and (iii) we do not use available methods for detecting and reconstructing beta-sheets.

4 Conclusions

We present novel deep learning representations and models for protein structure prediction. We provide a new and open dataset which can easily be used, and make our models and code publicly available [10]. We predict accurate structures around 2 Angstrom within the native structures. Our results are competitive with the latest CASP13 and AlphaFold results and supersede CASP12 results.


  • [1] L. A. Abriata, G. E. Tamò, and M. Dal Peraro (2019) A further leap of improvement in tertiary structure prediction in CASP13 prompts new routes for future assessments. Proteins: Structure, Function, and Bioinformatics. Cited by: §3.
  • [2] M. AlQuraishi (2019) End-to-end differentiable learning of protein structure. Cell systems 8 (4), pp. 292–301. Cited by: §1.
  • [3] M. AlQuraishi (2019) ProteinNet: a standardized data set for machine learning of protein structure. BMC bioinformatics 20 (1), pp. 311. Cited by: §1.
  • [4] N. Anand and P. Huang (2018) Generative modeling for protein structures. In Advances in Neural Information Processing Systems, pp. 7494–7505. Cited by: §1, §2.2.
  • [5] H.M. Berman (2000) The protein data bank. Nucleic Acids Research 28. External Links: Link Cited by: §1.
  • [6] P. S. P. Center (2018) 13th community wide experiment on the critical assessment of techniques for protein structure prediction. External Links: Link Cited by: §1.
  • [7] G. DeepMind (2018) AlphaFold: de novo structure prediction with deep-learning based scoring. External Links: Link Cited by: item , §1.
  • [8] I. Dokmanic, R. Parhizkar, J. Ranieri, and M. Vetterli (2015) Euclidean distance matrices: essential theory, algorithms, and applications. IEEE Signal Processing Magazine 32 (6), pp. 12–30. Cited by: §2.2.
  • [9] I. Drori, I. Dwivedi, P. Shrestha, J. Wan, Y. Wang, Y. He, A. Mazza, H. Krogh-Freeman, D. Leggas, K. Sandridge, et al. (2018) High quality prediction of protein q8 secondary structure by diverse neural network architectures. NeurIPS Workshop on Machine Learning for Molecules and Materials. Cited by: §1.
  • [10] I. Drori, D. Thaker, A. Srivatsa, D. Jeong, Y. Wang, L. Nan, F. Wu, D. Leggas, J. Lei, W. Lu, W. Fu, Y. Gao, S. Karri, A. Kannan, A. Moretti, M. AlQuraishi, C. Keasar, and I. Pe’er (2019) GitHub repository for Accurate protein structure prediction by embeddings and deep learning representations. Note: Cited by: §4.
  • [11] W. Kabsch and C. Sander (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers: Original Research on Biomolecules 22 (12), pp. 2577–2637. Cited by: §1.
  • [12] S. M. Larson, C. D. Snow, M. Shirts, and V. S. Pande (2009) Folding@ home and genome@ home: using distributed computing to tackle previously intractable problems in computational biology. arXiv preprint arXiv:0901.0866. Cited by: §1.
  • [13] C. Levinthal (1969) How to fold graciously. In Proceedings of a meeting held at Allerton House. P. Debrunner, JCM Tsibris, and E. Munck, editors. University of Illinois Press, Urbana, IL, Cited by: §1.
  • [14] C. A. Orengo, A. D. Michie, S. Jones, D. T. Jones, M. B. Swindells, and J. M. Thornton (1997) CATH–a hierarchic classification of protein domain structures. Structure 5 (8), pp. 1093–1109. Cited by: §1.
  • [15] D. E. Shaw, R. O. Dror, J. K. Salmon, J. P. Grossman, K. M. Mackenzie, J. A. Bank, C. Young, M. M. Deneroff, B. Batson, K. J. Bowers, E. Chow, M. P. Eastwood, D. J. Ierardi, J. L. Klepeis, J. S. Kuskin, R. H. Larson, K. Lindorff-Larsen, P. Maragakis, M. A. Moraes, S. Piana, Y. Shan, and B. Towles (2009) Millisecond-scale Molecular Dynamics Simulations on Anton. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC ’09, pp. 39:1–39:11. Cited by: §1.
  • [16] J. Wang, H. Cao, J. Z. Zhang, and Y. Qi (2018) Computational protein design with deep learning neural networks. Scientific reports 8 (1), pp. 6349. Cited by: §1.
  • [17] J. Xu (2019) Distance-based protein folding powered by deep learning. Proceedings of the National Academy of Sciences 116 (34), pp. 16856–16865. Cited by: §1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description