Bridging the exponential gap between the number of unlabeled and labeled protein sequences, a couple of works have adopted semi-supervised learning for protein sequence modeling. They pre-train a model with a substantial amount of unlabeled data and transfer the learned representations to various downstream tasks. Nonetheless, the current pre-training methods mostly rely on a language modeling pre-training task and often show limited performances. Therefore, a pertinent protein-specific pre-training task is necessary to better capture the information contained within the protein sequences.
Results: In this paper, we introduce a novel pre-training scheme called PLUS, which stands for Protein sequence representations Learned Using Structural information. PLUS consists of masked language modeling and a protein-specific pre-training task, namely same family prediction. PLUS can be used to pre-train various model architectures. In this work, we mainly use PLUS to pre-train a recurrent neural network (RNN) and refer to the resulting model as PLUS-RNN. It advances the state-of-the-art pre-training methods on six out of seven tasks, i.e., (1) three protein(-pair)-level classification, (2) two protein-level regression, and (3) two amino-acid-level classification tasks. Furthermore, we present results from our ablation studies and qualitative interpretation analyses to better understand the strengths of PLUS-RNN.
Availability: The codes and pre-trained models are available at https://github.com/mswzeus/PLUS/
Supplementary information: Supplementary data are available at Bioinformatics online.
2020 2020 \accessAdvance Access Publication Date: Day Month Year \appnotesManuscript Category \firstpage1
Pre-Training of Deep Bidirectional Protein Sequence Representations with Structural Information]Pre-Training of Deep Bidirectional
Protein Sequence Representations
with Structural Information Min et al.]Seonwoo Min , Seunghyun Park , Siwon Kim ,
Hyun-Soo Choi , and Sungroh Yoon \correspTo whom correspondence should be addressed. \editorAssociate Editor: XXXXXXX \historyReceived on XXXXX; revised on XXXXX; accepted on XXXXX
Proteins consisting of linear chains of amino acids are one of the most versatile molecules in living organisms. They serve vital functions in prevalent biological mechanisms, e.g., transmitting nerve pulses, storing and transporting other molecules, and providing immune protection (Berg et al., 2006). The versatility of proteins is generally attributed to their diverse structures. Proteins naturally fold up into three-dimensional structures depending on the sequence of amino acids. Then, the structures have a direct impact on their functions.
With the advent of next-generation sequencing technologies, obtaining protein sequences has become relatively more accessible. Nonetheless, annotating a sequence for meaningful attributes still requires time-consuming and resource-intensive processes. Bridging the exponential gap between the number of unlabeled and labeled protein sequences, a variety of in silico approaches have been widely adopted for predicting their structures and numerous characteristics (Holm and Sander, 1996).
Sequence alignment is one of the key techniques in the computational protein biology. Alignment-based methods compare protein sequences using carefully designed scoring matrices (Eddy, 2004) or Hidden Markov Models (HMMs) (Söding et al., 2005). A correct alignment can group similar sequences together, provide information on conserved local regions, and help us investigate uncharacterized proteins. However, not only its computational complexity increases exponentially with the number of proteins, but also it shows difficulties in identifying distantly related proteins. Homologous proteins sharing a common evolutionary ancestor can have high sequence-level variations, resulting in dissimilar sequences having similar structures (Creighton, 1993). Therefore, simply comparing sequence similarities with the alignments often fails to capture global structural and functional similarities of proteins.
Building upon the success of deep learning, a number of works have also proposed deep learning algorithms for computational protein biology (Min et al., 2017). Some of them use raw protein sequences and solely rely on the deep learning to learn high-dimensional representations. Others may also take in extracted features from alignments or domain expertise. While they have advanced the state-of-the-art (SOTA) for various tasks, they have some common limitations. First, they are based on supervised training of randomly initialized models from scratch. Thus, they require a huge curated labeled dataset which is usually not easily obtainable. Second, an ad hoc application of deep learning cannot guarantee great results. They demand careful considerations on the selection of model architectures and their hyperparameters tailored for each task.
Semi-supervised learning, which leverages both unlabeled and labeled data, has been one of the long-standing goals of broad machine learning community (Chapelle et al., 2009). It generally pre-trains a model with a substantial amount of unlabeled data. Then, it transfers learned representations and fine-tunes the model with a small amount of labeled data for each supervised task. The crux of semi-supervised learning is how to define a proper pre-training task. For example, recently, bidirectional encoder representations from Transformers (BERT) has been a new sensation in natural language processing (NLP) (Devlin et al., 2018). BERT enabled more effective use of unlabeled text by proposing novel pre-training tasks for NLP, i.e., masked language modeling (MLM) and next sentence prediction (NSP). The tasks guide a model to learn contextualized representations of words and relationship between sentences.
Now the natural question is that can protein biology also take advantage of semi-supervised learning? According to linguistic hypothesis (AlQuraishi, 2019), naturally occurring proteins are not just random. Evolutionary pressure constrains them to a learnable manifold where indispensable structures and functions are maintained. Thus, by observing many proteins even without any annotations, we can obtain an implicit understanding of the language of proteins. For instance, a couple of works have recently proposed pre-training methods for protein representations (Bepler and Berger, 2019; Alley et al., 2019). They adopted language modeling (LM) from NLP and showed that pre-training helps for various downstream protein tasks. However, as tasks assessing protein embeddings (TAPE) have shown in their benchmark results (Rao et al., 2019), the current pre-training methods are still often outperformed by other task-specific algorithms with non-neural extracted features. It could be because LM alone is not enough, and a pertinent protein-specific pre-training task is necessary to better capture the information contained within the proteins.
In this paper, we introduce a novel pre-training scheme for protein sequence modeling called PLUS, which stands for Protein sequence representations Learned Using Structural information. Taking note of the fact that structural information is essential for understanding the nature of proteins, PLUS consists of MLM and an additional protein-specific pre-training task, namely same family prediction (SFP). SFP leverages computationally clustered protein families (Finn et al., 2014) and trains the model to predict whether a pair of proteins belongs to a same family. PLUS can be used to pre-train various model architectures including a bidirectional recurrent neural network (BiRNN) and the Transformer (TFM), and the resulting models are referred to as PLUS-RNN and PLUS-TFM, respectively. In this work, considering their sequential modeling capability and computational complexity, we mainly use PLUS-RNN. Afterwards, the pre-trained model can be fine-tuned on a variety of downstream tasks without training a randomly initialized task-specific models from scratch. It advances the SOTA pre-training methods on six out of seven protein biology tasks, i.e., (1) three protein(-pair)-level classification, (2) two protein-level regression, and (3) two amino-acid-level classification tasks. Finally, we present results from our ablation studies and qualitative interpretation analyses to better understand the strengths of PLUS-RNN.
2 Related Works
2.1 Pre-training natural language representations
Pre-training natural language representations has been the basis of NLP research for a long time. A number of approaches have been proposed and their shared main component is LM. The key idea is that ideal representations must convey syntactic and semantic information, and thus the representation of a token must be able to predict other tokens around. Note that in such formulation, all it needs is a sequence of tokens without any additional labels. For example, traditional word2vec uses a skip-gram model which is directly trained to predict surrounding words given a representation of a center word (Mikolov et al., 2013).
While early approaches learned context-independent representations, embeddings from language models (ELMo) generalized them to learn contextualized representations by adopting forward and reverse RNNs (Peters et al., 2018). Given a sequence of tokens, the forward RNN sequentially processes the sequence left-to-right, and it is trained to predict the next token given its history. The reverse RNN is similar but processes the sequence in reverse, right-to-left. After the pre-training, hidden states of both RNNs are collapsed into a single vector representation for each token. Thus, unlike the previous word2vec, the same token can be transformed into different representations based on its contexts.
The major limitation of ELMo is that each RNN is trained using unidirectional LM and simply combined afterwards. In contrast, BERT first proposed to pre-train bidirectional natural language representations using a multi-layer bidirectional TFM (Devlin et al., 2018). The key element of the TFM is a self-attention layer composed of multiple individual attention heads (Vaswani et al., 2017). Given an input sequence , an attention head computes the output sequence . Each token is a weighted sum of values, computed by a weight matrix :
Each attention coefficient is the output of a softmax function applied on the dot products of the query with all keys, computed by and :
where is the the output token dimension. Note that the self-attention layer directly performs computations for all the pairwise tokens, whereas a recurrent layer requires sequential computations for the farthest pair. It allows easier traversal for forward and backward signals, and thus, enables better capturing of long-range dependencies.
The main contribution of BERT is that it introduced novel pre-training tasks for a multi-layer bidirectional model. Since its bidirectional conditioning allows each token to indirectly see itself, it cannot be pre-trained with the conventional LM. Instead, BERT resolved the problem by proposing an MLM task. It simply masks some input tokens at random and trains the model to predict them from the contexts. In addition, BERT adopts an NSP task which enables learning sentence relationships by training a model to predict whether a given pair of sentences is consecutive.
2.2 Pre-training protein sequence representations
Taking advantage of similarities to NLP, there is a long history of NLP-based methods adapted to learn protein sequence representations. Early approaches have focused on learning context-independent representations. For example, ProtVec (Asgari and Mofrad, 2015) and doc2vec (Yang et al., 2018) generate non-overlapping 3-mers from protein sequences and pre-train their representations based on a skip-gram model from word2vec.
The most closely related previous works to our paper are recently published P-ELMo (Bepler and Berger, 2019) and UniRep (Alley et al., 2019). P-ELMo proposed a two-phase pre-training scheme. First, it trains tied forward and reverse RNNs using the conventional LM with an unlabeled dataset. Then on top of them, it adopts another BiRNN trained by supervised learning with a small labeled dataset. The supervised pre-training is significant for incorporating structural information. However, it relies on the highly refined small dataset which deviates from the goal of utilizing low human-effort and large datasets. Similarly, UniRep used a unidirectional RNN model with multiplicative long short-term memory (mLSTM) hidden units (Krause et al., 2016) and trained the model using the conventional LM.
The current pre-training methods for protein sequences have two major limitations. First, as in the previous methods in NLP, they still learn unidirectional representations. It is obvious that they are strictly less powerful and sub-optimal for numerous protein biology tasks, where it is crucial to assimilate global information from both directions. Second, they depend solely on LM for the pre-training with an unlabeled dataset. While LM is a simple and effective task, additional pre-training task tailored for each data modality is often the key to further improve the quality of representations. For instance, in NLP, BERT adopted the NSP task; a lite BERT (ALBERT) devised a complementary sentence order prediction (SOP) task to model inter-sentence coherence and showed consistent performance improvements for downstream tasks (Lan et al., 2019). In fact, as shown from the recent TAPE benchmark results (Rao et al., 2019), the current protein pre-training methods are still often outperformed by other task-specific algorithms with non-neural extracted features. There could be a lot of contributing factors such as their unidirectional RNN models, size of unlabeled datasets, and complexity of the models. However, it might also indicate that LM alone might not be enough, and a pertinent protein-specific pre-training task is necessary to better capture information contained within the proteins.
We introduce PLUS, a novel pre-training scheme for protein sequence modeling (Figure 1). In the following, we will explain the details of the pre-training dataset, the model architectures, and the pre-training and fine-tuning procedures.
3.1 Pre-training dataset
As in P-ELMo and TAPE, we use Pfam release 27.0 (Finn et al., 2014) as the pre-training dataset. It contains total 21,827,419 protein sequences clustered into 16,479 families. Each protein family is computationally constructed by comparing sequence similarity of proteins using the alignments or HMMs. Due to the loose connection between sequence and structure similarities, the family labels only provide weak structural information. Nonetheless, we empirically show that the magnitude of the dataset complements the weakness and can help the model to learn structurally contextualized representations.
We use the training and test sets divided by a random 80/20% split and filter out the sequences shorter than 20 amino acids. Additionally, for the training set, we also remove the families containing less than 1,000 proteins. It results in 14,670,860 sequences from 3,150 families used for the following PLUS pre-training. Note that we have not done any ablation studies for the filtering conditions and other conditions may improve the results. Both filtered and unfiltered datasets are available in our repository.
3.2 Model architecture
PLUS can be used to pre-train various model architectures including BiRNN and TFM, and the resulting models are referred to as PLUS-RNN and PLUS-TFM, respectively. In this work, we mainly use PLUS-RNN based on its two advantages over PLUS-TFM. First, it is more effective for learning sequential nature of proteins. The self-attention layer of TFM performs dot products between all pairwise tokens regardless of their positions within the sequence (Equation 2). In other words, it gives equal opportunity to local and long-range contexts to determine the representations. While it facilitates learning long-range dependencies, its downside is that it completely ignores locality bias within a sequence. This is particularly problematic for protein biology, where local amino acid motifs often have more significant structural and functional implications (Bailey et al., 2006). On the contrary, RNN sequentially processes a sequence, and local contexts are naturally more emphasized.
Second, PLUS-RNN provides lower computational complexity. Although it depends on the model hyperparameters, TFMs generally demand a huge scale and have a larger number of parameters than RNNs. Furthermore, the computations between all pairwise tokens in the self-attention layer place a huge computational burden scaling quadratically with the input sequence length. Considering that pre-training typical TFMs handling 512 tokens already requires tremendous resources (Devlin et al., 2018), it is computationally difficult to use TFMs to deal with longer protein sequences even up to a couple of thousand amino acids.
Given a protein sequence where
Then, a BiRNN of -layers obtains bidirectional representations as a function of the entire sequence. We use long short-term memory (LSTM) as the basic unit of the BiRNN (Hochreiter and Schmidhuber, 1997). In each layer, it computes -dimensional forward and backward hidden states ( and ) and combines them into a hidden state with a non-linear transformation:
where ; W and b are weight and bias vectors, respectively. We use the final hidden states as high-dimensional representations r of each amino acid:
We adopt an additional projection layer to obtain smaller -dimensional representations z of each amino acid with a linear transformation:
During pre-training, in order to reduce computational complexity, we use r and z for the MLM and SFP tasks, respectively. During fine-tuning, we can either use r or z which performs the best on the development set or based on computational constraints.
In this work, we primarily use two model sizes while fixing the input embedding dimension and the projection dimension as 21 and 100, respectively:
PLUS-RNN\textsubscriptBASE : = 3, = 512, # of Parameters = 15M
PLUS-RNN\textsubscriptLARGE: = 3, = 1024, # of Parameters = 59M
The former is chosen to match the BiRNN in P-ELMo. However, since P-ELMo also uses the forward and reverse RNNs, PLUS-RNN\textsubscriptBASE has less than half number of parameters of P-ELMo (32M).
3.3 Pre-training procedure
Now, we explain the pre-training procedure of PLUS (Figure 1). In contrast to the previous approaches, it learns bidirectional representations based on two pre-training tasks, i.e., MLM and SFP, designed to assimilate global structural information. For the complete pre-training loss, we use pre-training loss lambda to control their relative importance. To the best of our knowledge, this is also the first work to pre-train a BiRNN with MLM.
Task #1: Masked Language Modeling (MLM)
Given a protein sequence x, we randomly select 15% of the input amino acids. Then, for each selected amino acid , we perform one of the following random masking actions. For 80% of the time, we replace with the token denoting the unspecified amino acid. For 10% of the time, we randomly replace with one of the 20 proteinogenic amino acids. Finally, for the left 10%, we keep intact. The purpose of the last one is to bias the representations towards the true amino acids.
Given a masked protein sequence , PLUS-RNN produces bidirectional representations and the MLM decoder computes log probabilities over 20 amino acid types. The MLM task trains the model to maximize the probabilities corresponding to the masked ones. As PLUS-RNN is asked to predict randomly masked amino acids given their contexts, the MLM task enables the model to learn bidirectional contextual representations throughout the entire protein sequence.
Task #2: Same Family Prediction (SFP)
Although the MLM is simple and effective, this is obviously not a tailored pre-training task for protein biology. Considering that complementary pre-training tasks are often the key to further improve the quality of representations, we devise a pertinent protein-specific pre-training task. The SFP task leverages computationally clustered weak family labels from the Pfam dataset. By training a model to predict whether a given protein pair belongs to a same protein family, we empirically show that the model can better capture global structural information of proteins.
In order to pre-train PLUS-RNN with the SFP pre-training task, we sample two protein sequences and from the Pfam dataset. For 50% of the time, the two sequences are sampled from a same protein family. For the other 50%, they are randomly sampled from different protein families. Note that, in contrast to BERT pre-training, we do not need to consider the lengths of the input sequences during the sampling process, since we use the BiRNN instead of the TFM.
PLUS-RNN transforms a protein pair into sequences of representations and . Then, we use soft-align comparison (Bepler and Berger, 2019) to compute their similarity score as a negative weighted sum of -distances between every and pair:
where the weight of each -distance is computed by
Intuitively, we can understand the soft-align comparison as computing an expected alignment score, where the expectations is over all possible alignments. We suppose that the smaller the distance between representations are, the more likely the pair of amino acids will be aligned. Then, we can consider as a probability that is aligned to considering all the amino acids from (vice versa for ). As a result, is the expected alignment score over all possible alignments with probabilities . Note that the negative signs are for converting distances into scores, and thus, a higher value of indicates the pair of protein sequences is structurally more similar.
Given the similarity score, the output layer finally computes a probability that the pair belongs to a same protein family. The SFP task trains the model to minimize cross-entropy loss between the true label and the predicted probability. As PLUS-RNN is asked to produce higher similarity scores for proteins from the same families, the SFP task enables the model to better assimilate global structural information.
3.4 Fine-tuning procedure
The fine-tuning procedure of PLUS-RNN is straightforward following the conventional usage of BiRNN-based prediction models. For each downstream task, we only add one hidden and one output layers on top of the pre-trained model. Then, all the parameters are fine-tuned with task-specific datasets and loss functions. For the complete fine-tuning loss, we use fine-tuning loss lambda to control the relative importance of classification and regularization losses.
For tasks involving a protein pair, we use the same computations used in the SFP pre-training task. Specifically, we only replace the SFP output layer with a task-specific output layer. For single protein-level tasks, we adopt an additional attention layer to aggregate variable-length representations into a single vector (Bahdanau et al., 2014). Then, the aggregated vector is fed into the hidden and output layers. For amino-acid-level tasks, representations of each amino acid are fed into the hidden and output layers.
All PLUS models are implemented in PyTorch (Paszke et al., 2017) and trained on either NVIDIA V100 or P40 GPUs. Additional pre-training and fine-tuning details are provided in Appendix A.1. In the following, we will explain the compared baselines, pre-training results, and fine-tuning evaluation results on seven protein biology benchmark tasks.
For comparative evaluations, we use several baselines. First, in all of the seven downstream supervised tasks, we benchmark PLUS-RNN models against two alternative pre-training methods, i.e., P-ELMo and PLUS-TFM. They are implemented and pre-trained in the same experimental setup except for the filtering conditions of the pre-training dataset. P-ELMo uses 20% (2.8M) more protein sequences, for the pre-training. PLUS-TFM is analogous to BERT\textsubscriptBASE model consisting of 110M parameters. Due to its huge computational burden scaling quadratically with the input sequence length, we pre-train PLUS-TFM only using the protein sequences shorter than 512 amino acids. Then, for the fine-tuning, since it failed to generalize to long protein sequences, we truncate the longer sequences into 512 amino acids. More details are provided in the ablation studies.
Second, for the TAPE benchmark tasks (Stability, Fluorescence, and SecStr), we also compare the results from their baseline models: the TFM, a RNN, a dilated residual network (ResNet) (Yu et al., 2017), P-ELMo, and UniRep. We note that these comparisons are in their favor, since they were pre-trained with more than twice the number of protein sequences (32,207,059 sequences from Pfam release 32.0). The training, development, and test data splits are identical with those used for PLUS-RNN evaluations.
Finally, we benchmark PLUS-RNN models against task-specific SOTA algorithms with or without non-neural extracted features. As explained, the previous protein pre-training methods are still often outperformed by them. Therefore, we show in which type of tasks the proposed method can help most and outperform the current SOTA algorithms without pre-training.
Excerpted from TAPE.
Results from our implementation.
4.2 Pre-training results
Table 1 shows the test accuracies on the MLM and SFP pre-training tasks. Only PLUS models, pre-trained with the SFP task, are evaluated for the SFP task. We should be careful for comparing the results from TAPE and our experiments, since they used different test datasets (27.0 for PLUS and 32.0 for TAPE). Nonetheless, we can still indirectly compare them considering the following: (1) The test datasets are both randomly sampled protein sequences from different versions of Pfam dataset. (2) P-ELMo shows similar LM accuracies in TAPE (0.28) and our experiments (0.29).
We can see that some models have lower LM accuracies than the others. However, the lower LM capability does not exactly correspond to performance in the fine-tuning tasks. This discrepancy has been previously observed in TAPE, and it can be also observed in the following sections. In terms of SFP, all PLUS models show great accuracies. This is because it could be a quite easy task. Since the Pfam families are constructed based only on the sequence similarities, a pair of analogous sequences would probably be from a same family. Albeit its plainness, we empirically show that the SFP complements the MLM by enforcing the models to compare representations of protein sequences during the pre-training.
|Protein(-pair)-level Classification||Protein-level Regression||Amino-acid-level Classification|
|w/o PT||SOTA (w/o features)||0.85||0.73||0.44||0.63||0.22||0.57||0.62|
|SOTA (w/ features)||0.62||0.77||0.73||N/A||N/A||0.63||0.80|
|w/ PT||Pre-training SOTA||0.91||0.64||0.54||0.73||0.68||0.61||0.78|
For each task, the highest score in the pre-training category is in bold. It is bold and underlined if it is the highest including those w/o pre-training.
4.3 Fine-tuning results
We evaluate PLUS on seven protein biology tasks. Detailed information on each task is provided in Appendix A.2.
Homology Homology is a protein-pair-level classification task (Fox et al., 2013). The goal is to classify structural similarity of proteins into family, superfamily, fold, class, and none. We report accuracy, Pearson correlation between predicted similarity scores and true similarity levels.
Solubility Solubility is a protein-level binary classification task (Khurana et al., 2018). The goal is to predict whether a protein is soluble or insoluble. We report accuracy for this task.
Localization Localization is a protein-level classification task (Armenteros et al., 2017). The goal is to classify a protein into ten subcellular locations. We report accuracy for this task.
Stability Stability is a protein-level regression task (Rocklin et al., 2017). The goal is to predict a real-valued proxy for the intrinsic stability. This task is from TAPE benchmark and we report Spearman correlation .
Fluorescence Fluorescence is a protein-level regression task (Sarkisyan et al., 2016). The goal is to predict a log-fluorescence intensity. This task is from TAPE benchmark and we report Spearman correlation .
Secondary structure (SecStr) SecStr is an amino-acid-level classification task (Klausen et al., 2019). The goal is to classify each amino acid into eight classes describing its local structure. This task is from TAPE benchmark and we report accuracy for this task.
Transmembrane Transmembrane is an amino-acid-level classification task (Tsirigos et al., 2015). The goal is to detect segments of an amino acid sequence which cross the cell membrane. We report accuracy for this task.
Table 2 presents summarized results for the seven downstream fine-tuning tasks. To be concise, besides the PLUS models, we show the best result for each of the three categories: previous pre-training models (i.e., P-ELMo, UniRep, and the baseline models from TAPE), task-specific algorithms without pre-training which only use the raw-protein sequences, and those with non-neural extracted features. Note that the best performing models for each task can be different. Since we report accuracy, Pearson correlation , and Spearman correlation depending on the task, higher values are always better. Detailed results for Homology and SecStr are provided in the following subsections; those for the other tasks are in Appendix A.2. We denote pre-training as PT in the tables.
|SCOPe 2.06||SCOPe 2.07|
Excerpted from P-ELMo.
Results from our implementation.
|ConvLSTM (w/ features)†||0.63||0.61||0.68|
Excerpted from TAPE.
Results from our implementation.
We can see that PLUS-RNN\textsubscriptBASE performs comparably to the previous pre-training methods. PLUS-RNN\textsubscriptLARGE further improves the performance and advances the previous SOTA pre-training methods on six out of seven protein biology tasks of different types. Considering that some alternative methods showed higher LM capabilities, the performance improvements are contributed to the pertinent protein-specific SFP task. In the ablation studies, we further explain the relative importance of each aspect of PLUS. Although PLUS-TFM has almost twice as many as parameters than PLUS-RNN\textsubscriptLARGE (110M vs. 59M), it only shows comparable performances to the latter. This coincides with the expected result that PLUS-TFM is not effective for the protein sequences due to its disregard of locality bias.
Compared to the task-specific algorithms, PLUS-RNN\textsubscriptLARGE achieves the highest scores on four tasks but lags behind for the others. It shows that the non-neural extracted features provide salient information which still could not be learned from the pre-training. We conjecture that simultaneous observation of multiple protein sequences could be one of the key strengths of the alignment-based features. In contrast, the MLM pre-training task exploits each protein sequence individually, and the SFP pre-training task still only exploits pairwise information.
Homology and SecStr results
For further analyses, we present detailed evaluation results for Homology and SecStr tasks. We chose the two tasks because they are representative protein biology tasks relevant to global and local structures, respectively. The improved results of the former can lead to discovery of new enzymes and antibiotic resistant genes (Tavares et al., 2013). The latter is important for understanding the function of proteins for those evolutionary structural information are not available (Klausen et al., 2019).
The detailed Homology prediction results are presented in Table 3. SCOPe 2.06 and 2.07 denote test datasets. We used the same experiment settings as in P-ELMo and compared the results with four alignment-based SOTA algorithms (Eddy, 2004; Söding et al., 2005; Zhang and Skolnick, 2005; Finn et al., 2011). Surprisingly, all the pre-training methods outperform the alignment-based SOTA by large margins. It indicates that incorporating structural information during pre-training enables inferring global structure similarities of proteins even better than relying on their sequence similarities. Furthermore, the correlation differences between PLUS-RNN\textsubscriptLARGE and P-ELMo are small but statistically significant with p-values less than (Steiger, 1980). The result supports that even though the family labels from Pfam only provide weak structural information, they help us learn improved structurally contextualized representations.
The detailed SecStr prediction results are presented in Table 4. CB513, CASP12, and TS115 denote test datasets. We used the same experiment settings as in TAPE. The results show that the LM helps and the SFP pre-training task further improves structurally contextualized representations. However, compared to the SOTA algorithm with alignment-based features (Remmert et al., 2012), PLUS models still have weaknesses on learning local structural information. This is probably because the effect of local structures is negligible for the SFP task, and it only requires understanding the global structure. Therefore, we believe devising a more difficult pre-training task relevant to local structural information would be able to improve the performance on the SecStr task.
4.4 Ablation studies
In the following, we show results from various ablation studies on the Homology task to better understand the strengths and each aspect of the PLUS framework. We use PLUS-RNN\textsubscriptBASE as the baseline model unless explicitly stated otherwise. Note that we use the development set for the ablation studies.
|Loss Lambda||Overall Performance|
Note: We use the development set and training details are unchanged.
|Loss Lambda||Overall Performance|
Note: We use the development set and training details are unchanged.
First, we explore the effect of using different pre-training loss lambda controlling the relative importance of the MLM and SFP tasks. Due to the subtle differences, we use three decimal places for Table 5. The results show that pre-training is always helpful, and the value of the pre-training loss lambda has a small influence. As expected, between the two pre-training tasks, the MLM task plays the primary role and the SFP task complements the former. Removing the MLM task hurts the prediction performance significantly more than removing the SFP task.
Next, we explore the effect of using different fine-tuning loss lambda. We also use three decimal places for Table 6. SL denotes the supervised classification loss from the Homology task. In addition to the SL, we can simultaneously fine-tune the model for different regularization tasks. Specifically, we tried using additional MLM and contact map prediction (CMP) tasks. The goal of CMP task is to predict whether an amino acid pair makes contact in the three-dimensional structure. Note that the contact map labels are not easily obtainable, and thus, it could only be used for the fine-tuning. The results show that using SL and MLM tasks with proper fine-tuning loss lambda performs the best. The CMP task, which was proposed in P-ELMo to further incorporate structural information, provides only small improvements compared to the MLM task. It indicates that PLUS-RNN has already sufficiently learned structural information from the pre-training, and the MLM task serves as a better regularization than the CMP task.
Finally, we compare the performances of PLUS-TFM and PLUS-RNN\textsubscriptLARGE for protein pairs of different lengths (Figure 2).We denote Long for protein pairs longer than 512 amino acids and Short otherwise. We evaluate PLUS-TFM for the Long protein pairs in two ways: (1) We simply use the protein pairs as they are. (2) We truncate them to 512 amino acids. The former is denoted as PLUS-TFM-EXT (as in extended) and the latter is denoted as PLUS-TFM.
The results show that PLUS-RNN\textsubscriptLARGE consistently provides great performances regardless of the protein lengths. On the other hand, PLUS-TFM-EXT deteriorates for the Long protein pairs and PLUS-TFM shows less performance degradation. The results clearly show the limitation of TFM models using the limited context size of 512 amino acids. Although the number of Long protein pairs is relatively small (13.4%) in the current Homology development dataset, it is indispensable to deal with long protein sequences for analyzing complex proteins that are found in nature. Since this is due to the computational burden of TFM scaling quadratically with the input sequence length, we expect recently proposed adaptive attention span (Sukhbaatar et al., 2019) may be able to help improve PLUS-TFM in the future.
4.5 Qualitative analyses
To better understand the strengths of PLUS-RNN, we provide its qualitative analyses. We use the Homology task and interpret how the learned protein representations help inferring the global structural similarities of proteins.
In order to compare two proteins, PLUS-RNN transforms them into sequences of representations and uses soft-align to compute their similarity score (Equation 7). Even though there is one more computation by the output layer for the Homology prediction output, we can use the similarity scores to interpret PLUS-RNN. Note that using the penultimate layer for the model interpretation is a widely adopted approach in the machine learning community (Zintgraf et al., 2017).
Figure 3 shows the scatter plot of the similarity scores and the true similarity levels of protein pairs from the SCOPe 2.06 Homology test dataset. For comparison, we also show the similarity scores produced by NW-align. The scores from both methods are scaled to lie between 0 and 4, denoting the level of structure similarity. Note that unlike weak family labels from the Pfam pre-training dataset, family labels from the Homology dataset represent the true three-dimensional structure of proteins. The plot shows that NW-align often produces low similarity scores for protein pairs from the same family. This is because of high sequence-level variations, resulting in dissimilar sequences having similar structures. In contrast, most of the similarity scores of protein pairs from the same family have a high value.
Furthermore, we look into three types of protein pairs: (1) a sequence similar - structure similar pair, (2) a sequence dissimilar - structure dissimilar pair, and (3) a sequence dissimilar - structure dissimilar pair (Figure 4(A) and (B)). Note that sequence similar - structure dissimilar pair does not exist in the Homology datasets. The sequence and structure similarities are defined by NW-align scores and Homology dataset labels, respectively. The pairs having similar structures are chosen from the same family, and those having dissimilar structures are chosen from the same fold. Figure 4(C) shows the heatmaps of NW-align of raw amino acids and soft-alignment of PLUS-RNN representations ( in equation 7) for the three pairs. Due to the space limitations, we only show the top left quadrant of the heatmaps. Each cell in the heatmap indicates the corresponding amino acid pairs from protein A and B. Blue denotes high sequence similarity in NW-align and high structure similarity in PLUS-RNN.
First, we compare the pairs having similar structures (the first and second columns in Figure 4(C)). The heatmaps show that NW-align successfully aligns the similar sequence pair with the score of 2.65. However, it fails for the dissimilar sequence pair with the score of 0.92. It supports that comparing the raw sequence similarities cannot identify the correct structure similarities. On the other hand, soft-alignment of PLUS-RNN representations are successful for both similar and dissimilar sequences with the scores of 3.95 and 3.76. Next, we compare the second and the third pairs. Although only the second pair has similar structures, NW-align fails for both and even gives higher score of 1.03 to the third pair. In contrast, regardless of the sequence similarities, the soft-alignment of PLUS-RNN representations correctly degenerates only for the third pair with dissimilar structures with the score of 2.12. Therefore, the interpretation results verify that the learned representations from PLUS-RNN are structurally contextualized and performs better for inferring the global structure similarities.
5 Concluding Remarks
In this work, we presented PLUS, a novel pre-training scheme for bidirectional protein sequence representations. Consisting of the MLM and the protein-specific SFP pre-training tasks, it can better capture structural information contained within the proteins. PLUS can be used to pre-train various model architectures. In this work, considering the sequential modeling capability and computational complexity, we mainly used PLUS-RNN. It advances the previous SOTA pre-training methods on six out of seven protein biology tasks. Furthermore, to better understand its strengths, we also provided the results from our ablation studies and qualitative interpretation analyses.
We are excited about the future of PLUS. We expect the gap between the number of unlabeled and labeled proteins will continue to exponentially grow, and the pre-training method will play even larger roles. Based on the strengths and weaknesses of PLUS, we plan to extend the work in several directions. First, considering that it is especially powerful for inferring global structural information, we are also interested in more exquisite prediction of protein structures (Kryshtafovych et al., 2019). Second, although the pre-training helps, it still lags behind non-neural extracted features for some tasks. We suppose this is because of its weaknesses on learning local structural information. We believe there is still huge room for improvements and exploiting multiple proteins during the pre-training, likewise in the alignment, could be the key (Poplin et al., 2018).
This work was supported by the Brain Korea 21 Plus Project in 2020.
- 20 proteinogenic and 1 unspecified amino acids
- Unified rational protein engineering with sequence-based deep representation learning. Nature methods 16 (12), pp. 1315–1322. Cited by: §1, §2.2.
- AlphaFold at casp13. Bioinformatics 35 (22), pp. 4862–4865. Cited by: §1.
- DeepLoc: prediction of protein subcellular localization using deep learning. Bioinformatics 33 (21), pp. 3387–3395. Cited by: §4.3.1.
- Continuous distributed representation of biological sequences for deep proteomics and genomics. PloS one 10 (11), pp. e0141287. Cited by: §2.2.
- Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §3.4.
- MEME: discovering and analyzing dna and protein sequence motifs. Nucleic acids research 34 (suppl_2), pp. W369–W373. Cited by: §3.2.
- Learning protein sequence embeddings using information from structure. In International Conference on Learning Representations, Cited by: §1, §2.2, §3.3.2.
- Biochemistry. 5th. New York: WH Freeman 38 (894), pp. 76. Cited by: §1.
- Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks 20 (3), pp. 542–542. Cited by: §1.
- Proteins: structures and molecular properties. Macmillan. Cited by: §1.
- Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §2.1, §3.2.
- Where did the blosum62 alignment score matrix come from?. Nature biotechnology 22 (8), pp. 1035–1036. Cited by: §1, §4.3.3.
- Pfam: the protein families database. Nucleic acids research 42 (D1), pp. D222–D230. Cited by: §1, §3.1.
- HMMER web server: interactive sequence similarity searching. Nucleic acids research 39 (suppl_2), pp. W29–W37. Cited by: §4.3.3.
- SCOPe: structural classification of proteinsâextended, integrating scop and astral data and classification of new structures. Nucleic acids research 42 (D1), pp. D304–D309. Cited by: §4.3.1.
- Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.2.
- Mapping the protein universe. Science 273 (5275), pp. 595–602. Cited by: §1.
- DeepSol: a deep learning framework for sequence-based protein solubility prediction. Bioinformatics 34 (15), pp. 2605–2613. Cited by: §4.3.1.
- NetSurfP-2.0: improved prediction of protein structural features by integrated deep learning. Proteins: Structure, Function, and Bioinformatics 87 (6), pp. 520–527. Cited by: §4.3.1, §4.3.3.
- Multiplicative lstm for sequence modelling. arXiv preprint arXiv:1609.07959. Cited by: §2.2.
- Critical assessment of methods of protein structure prediction (casp)âround xiii. Proteins: Structure, Function, and Bioinformatics 87 (12), pp. 1011–1020. Cited by: §5.
- Albert: a lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942. Cited by: §2.2.
- Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §2.1.
- Deep learning in bioinformatics. Briefings in bioinformatics 18 (5), pp. 851–869. Cited by: §1.
- Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §4.
- Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: §2.1.
- A universal snp and small-indel variant caller using deep neural networks. Nature biotechnology 36 (10), pp. 983–987. Cited by: §5.
- Evaluating protein transfer learning with tape. In Advances in neural information processing systems, Cited by: §1, §2.2.
- HHblits: lightning-fast iterative protein sequence searching by hmm-hmm alignment. Nature methods 9 (2), pp. 173. Cited by: §4.3.3.
- Global analysis of protein folding using massively parallel design, synthesis, and testing. Science 357 (6347), pp. 168–175. Cited by: §4.3.1.
- Local fitness landscape of the green fluorescent protein. Nature 533 (7603), pp. 397–401. Cited by: §4.3.1.
- The hhpred interactive server for protein homology detection and structure prediction. Nucleic acids research 33 (suppl_2), pp. W244–W248. Cited by: §1, §4.3.3.
- Tests for comparing elements of a correlation matrix.. Psychological bulletin 87 (2), pp. 245. Cited by: §4.3.3.
- Adaptive attention span in transformers. In ACL, Cited by: §4.4.
- Strategies and molecular tools to fight antimicrobial resistance: resistome, transcriptome, and antimicrobial peptides. Frontiers in microbiology 4, pp. 412. Cited by: §4.3.3.
- The topcons web server for consensus prediction of membrane protein topology and signal peptides. Nucleic acids research 43 (W1), pp. W401–W407. Cited by: §4.3.1.
- Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2.1.
- Learned protein embeddings for machine learning. Bioinformatics 34 (15), pp. 2642–2648. Cited by: §2.2.
- Dilated residual networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 472–480. Cited by: §4.1.
- TM-align: a protein structure alignment algorithm based on the tm-score. Nucleic acids research 33 (7), pp. 2302–2309. Cited by: §4.3.3.
- Visualizing deep neural network decisions: prediction difference analysis. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, Cited by: §4.5.