RAFP-Pred: Robust Prediction of Antifreeze Proteins using Localized Analysis of n-Peptide Compositions
In extreme cold weather, living organisms produce Antifreeze Proteins (AFPs) to counter the otherwise lethal intracellular formation of ice. Structures and sequences of various AFPs exhibit a high degree of heterogeneity, consequently the prediction of the AFPs is considered to be a challenging task. In this research, we propose to handle this arduous manifold learning task using the notion of localized processing. In particular an AFP sequence is segmented into two sub-segments each of which is analyzed for amino acid and di-peptide compositions. We propose to use only the most significant features using the concept of information gain (IG) followed by a random forest classification approach. The proposed RAFP-Pred achieved an excellent performance on a number of standard datasets. We report a high Youden’s index (sensitivity+specificity-1) value of 0.75 on the standard independent test data set outperforming the AFP-PseAAC, AFP_PSSM, AFP-Pred and iAFP by a margin of 0.05, 0.06, 0.14 and 0.68 respectively. The verification rate on the UniProKB dataset is found to be 83.19% which is substantially superior to the 57.18% reported for the iAFP method.
Ice has an unusual property called recrystallization. When water starts to freeze, it forms many small crystals. Some of the small crystals soon dominate and continue to become large by stealing water molecules from the surrounding small crystals . This phenomenon can prove to be particularly lethal for living organisms in extreme cold weather due to the intracellular formation of ice . Antifreeze proteins (AFPs) neutralize this recrystallization effect by binding to the surface of the small ice crystals and retarding the growth into larger dangerous crystals . Therefore they are also called as ’ice structuring proteins’ (ISPs). The AFPs lower the freezing point of water without altering the melting point, this interesting property of the AFPs is called as ’thermal hysteresis’ .
The AFPs are critical for the survival of living organisms in extremely cold environments. They are found in various insects, fish, bacteria, fungi and overwintering plants such as gymnosperms, ferns, monocotyledonous, and angiosperms ,,,, ,,,,. Several studies on various AFPs have shown that there is little structural and sequential similarity for an ice-binding domain . This inconsistency relates to the lack of common features in different AFPs and therefore a reliable prediction of AFPs is considered to be an ardent task.
The Recent success of machine learning algorithms in the area of protein classification, has encourage several researchers to develop automated approaches for the identification of AFPs. AFP-Pred  is considered to be the earliest work in this direction. The work is essentially based on random forest approach making use of the sequence information such as functional groups, physicochemical properties, short peptides and secondary structural element. In AFP_PSSM  evolutionary information is used with support vector machine (SVM) classification. In iAFP  n-peptide composition is used with limited experimental results. In particular amino acids, di-peptide and tri-peptide compositions were used. We argue that tri-peptide composition is computationally expensive (require the calculation of combinations) resulting in redundant information. Consequently the selection of the most significant features using genetic algorithms (GA) has shown limited results . It is also worth noting that n-peptide compositions were derived for the whole sequence. The latest work in this regard is AFP-PseAAC  where the pseudo amino acid composition is used with an SVM classifier to achieve a ’good’ prediction accuracy.
In machine learning, the difficult manifold learning problems can effectively be addressed using a localized processing approach compared to its holistic counterparts . Considering the diversified structures of AFPs, it is intriguing to explore the localized processing of the protein sequences. We therefore propose to adopt a segmentation approach where each protein sequence is segmented into two sub-sequences. The amino acids and di-peptide compositions are derived for each sub-sequence from which we extract the relevant features. The most significant features are further selected using the concept of information gain and the random forest approach is used for classification. To the best of our knowledge, this is the first time that localized processing is proposed to deal with the challenging problem of learning diversified structures of the AFPs. The proposed method has shown to comprehensively outperform all the existing approaches on standard datasets(section 3).
Ii Proposed Approach
The reliable prediction of proteins can only be achieved by robustly encoding the protein sequences into mathematical expressions. This ensures that the underlying structures of the protein sequences have been truly learned. In the absence of robust learning methods of the protein sequences, the predictor is unlikely to perform well for unseen test samples. From the machine learning perspective, the difficult manifold learning problems are effectively tackled using the localized processing approach [17, 18]. While holistic methods deal with the training samples in a global sense, the localized learning focuses on the various segments of the samples. Typically, features extracted from confined segments are efficiently fused. For the challenging manifold learning problems, localized learning has shown to outperform its counterparts in various applications of machine learning [19, 20, 21]. We therefore propose a local analysis approach of AFPs for feature extraction.
Since the structures of various AFPs are uncorrelated and lack in similarity, the automated prediction of AFPs is therefore considered to be a challenging task. Motivated by the robustness of the localized learning approaches, we propose an approach that processes the localized segments of the AFP sequences. In particular, each protein sequence is segmented into two sub-sequences, each sub-sequence is individually analyzed for amino acid and di-peptide compositions.
Consider a protein chain of amino acid residues:
where represents the residue of protein . According to the amino acid composition protein can be expressed as an array of occurrence frequency of the twenty native amino acids:
where is the normalized occurrence frequency of the native amino acid in , and is the vector transpose operator. Accordingly, the amino acid composition of a protein can readily be derived once the protein sequencing information is known. This simple, but effective, amino acid composition (AAC) model has been widely used in a number of statistical methods to predict protein structures , .
Dipeptide compositions are computed using 400 () dipeptides, i.e. AA, AC, AD,, YV, YW, YY. Each component is calculated using the following equation:
|Features||Number of attributes|
|Amino Acid Composition features||20|
|Dipeptide Composition features||400|
|Amino Acid Composition features||20|
|Dipeptide Composition features||400|
The 20 AACs and 400 dipeptide compositions are combined to form 420 attributes for each segment of the AFP sequence. Finally the 420 attributes of individual sub-sequences are fused to form a single representative feature vector consisting of 840 attributes. Table I shows a list of derived features.
It is well established that the redundant information tends to degrade the classification results . It is therefore customary to select the most relevant features for the purpose of classification , . Information gain (IG) or Info-Gain is considered to be an important criterion for the selection of the most significant features . Given a training set and an attribute , the information gain with respect to the attribute , can be defined as a reduction in entropy of the training set once the attribute is observed , mathematically:
where is the entropy of and is the entropy of conditioned to the observation of attribute . For the classical case of a dichotomizer:
where is a set of all possible values of the attribute , is the partition of the training set characterizing the value of attribute , is the entropy of and is the cardinality operator .
We propose to use the concept of the Info-Gain for the selection of the most significant features from a pool of 840 features (discussed in Section II-A). The features are ranked using the above formulation of IG in a descending order such that the attribute with the highest IG is given the top priority.
The Random forest approach has shown to produce excellent results for various prediction problems in proteomics [28, 19, 13, 30, 31, 32, 33, 34]. Random forest is an ensemble classification protocol which combines several weak classifiers (decision trees) to constitute a single strong classifier. The decision trees generated by the random forest approach are combined using a weighted average scheme . The approach harnesses the power of many decision trees, rational randomization, and ensemble learning to develop accurate classification models .
Random forest is a supervised learning approach consisting of two steps: (1) bagging, and (2) random partitioning. In bagging several decision trees are grown by drawing multiple samples (with replacement) from the original training data set. Although an indefinite number of such trees can be grown, typically 200-500 trees are considered to be enough . The Random forest approach introduces randomness in tree-growing by first randomly selecting a subset of prospective predictors and then producing the split by selecting the best available splitter. The approach is robust to overfitting and quite efficient on large datasets . A Random forest classifier was implemented using the WEKA tool , with the following controlling parameters: (1) maximum depth of tree = 10, (2) number of features = 100, (3) number of trees = 50, and (4) number of seeds = 1. The work-flow of the proposed RAFP-Pred is shown in Figure 1
Iii Experimental Results
Iii-a Evaluation Parameters
For any prediction framework, the Receiver Operating Characteristic (ROC) is considered to be the most comprehensive performance criterion. The proposed algorithm was therefore extensively evaluated for true positive rate (sensitivity), true negative rate (specificity), prediction accuracy and the area under the curve (AUC). The proposed algorithm was also evaluated for Matthew’s Correlation Coefficient (MCC). MCC ranges from -1 to 1 with values of MCC = 1 and MCC = -1 indicating the best and the worst predictions respectively, MCC = 0 shows the case of a random guess. Youden’s index (or Youden’s J statistics) is an interesting way of summarizing the results of a diagnostic experiment . Ranging from 0 to 1, 0 indicates the worst performance while 1 shows perfect results with no false positives and no false negatives. Youden’s index is typically useful for the evaluation of highly imbalanced test data.
Iii-B Experimental Results
Iii-B1 Dataset 1
Dataset 1 consists of 481 AFPs and 9493 non-AFPs reported in . The dataset is further partitioned into training and testing sets. The training set characterizes 300 AFPs and 300 non-AFPs selected randomly from a pool of 481 AFPs and 9493 non-AFPs respectively. The remaining 181 AFPs and 9193 non-AFPs constitute the testing set. Training accuracy was achieved by evaluating the proposed algorithm on the 600 training samples. This dataset is obtained from  in which the protein sequences were collected from the Pfam database . For the redundancy check the PSI-BLAST search was performed for each sequence against a non-redundant sequence database with a stringent threshold (E-value 0.001) and followed by the manual inspection to retain only antifreeze proteins. The final dataset contains only the protein sequence with 40% sequence and all other similar proteins were removed from the dataset using CD-HIT .
The proposed approach attained 100% accuracy on a randomly selected training set which outperforms the AFP-Pred method by a margin of 18.67%  and the AFP_PSSM method by a margin of 17.33% . The average accuracy of three randomly selected training sets, for the proposed method, was found to be 99.91% with a standard deviation of 0.16%. This prediction performance is 10.22% better compared to the AFP-PseAAC approach (standard deviation of 0.706%).
|Feature subset||Sensitivity (%)||Specificity (%)||MCC||Accuracy (%)||Youden’s index|
The results for the test data set, using different feature subsets, are shown in Table II. The proposed RAFP-Pred achieves the best accuracy of 90.93% utilizing the 100 most significant features. For a comprehensive evaluation, the proposed approach was also compared to the state-of-art methods reported in the literature (refer to Table III). Note that for a fair comparison we implemented and evaluated all approaches using the same training and testing examples. We were however unable to generate results for AFP_PSSM  as the data was unavailable during our experiments. Instead we compared our results directly with those reported in .
|Predictor||Sensitivity (%)||Specificity (%)||Accuracy (%)||Youden’s index||AUC|
The test data set is highly imbalanced with 181 (AFPs) positive and 9193 (non-AFPs) negative examples. For such a highly imbalanced test data, there is a natural tendency for the predictor to be biased in favor of the class which has more samples. In such scenarios, the evaluation parameters such as the AUC and Youden’s index are more representative of the predictor’s performance than the conventional sensitivity, specificity and accuracy measures.
For instance in Table III iAFP achieves a very high specificity of 97.23% but a poor sensitivity of 9.94%. Therefore, although the overall accuracy of 95.55% appears to be the best reported accuracy, the predictor has a low Youden’s index of 0.07 and therefore cannot be regarded as competitive. The proposed approach achieved a Youden’s index of 0.75 which is better than all reported results in the literature. The receiver operating characteristics (ROC) are shown in Figure 2 where the highest AUC of 0.95 verifies the excellent performance of the proposed RAFP-Pred approach.
The 100 most significant features obtained using the training samples of dataset 1 are available online at https://goo.gl/3i7gQD. These 100 features were used for all the datasets.
It is interesting to compare the proposed approach with the latest and the most successful method reported in literature i.e., AFP-PseAAC. The proposed RAFP-Pred approach has shown to comprehensively outperform the AFP-PseAAC method. The AFP-PseAAC achieved a sensitivity and specificity of 82.87% and 87.61% respectively which lags the proposed approach by a margin of 1.11% and 3.46%. The Youden’s index of the AFP-PseAAC was also found to be 0.70 which is inferior to the proposed approach.
Iii-B2 Dataset 2
Dataset 2 consists of 44 AFPs and 3762 non-AFPs collected from the Protein Data Bank (PDB)  and the PISCES server  respectively (reported in ). The non-AFPs in the dataset had, 25% pairwise sequence identity (SI), R-factors of 0.25 and a crystallographic resolution of at least 2 A. In this dataset only those AFPs that had known 3D structures are included. Dataset 2 is also a highly imbalanced dataset with 44 positive and 3762 negative examples. In the literature, the only results reported on this dataset are for the iAFP method . In particular the iAFP method attained an accuracy of 99.32% on 7-fold cross validation. The proposed RAFP-Pred approach attained a comparable accuracy of 99.71% using the 100 most significant features obtained in section III-B1. The MCC value of 0.87 found for the proposed RAFP-Pred is also favorably comparable to the 0.79 reported for the iAFP method. Note that the proposed RAFP-Pred approach was trained using the samples of dataset 2 only and no other training samples were used. For each iteration of 7-fold cross validation, the redundancy between the training and testing samples was explicitly checked and all samples were found to be unique.
The state-of-art AFP-PseAAC approach achieved an accuracy of 99.74% which is quite comparable to the 99.71% of the proposed approach. The MCC value of 0.88 achieved by the AFP-PseAAC is also comparable to the 0.87 attained by the proposed RAFP-Pred approach.
Iii-B3 Dataset 3
Dataset 3 is an independent dataset representing an evolutionarily divergent group of organisms consisting of 369 AFPs obtained from the UniProKB database by searching for the phrase “antifreeze” , . Any redundancies i.e., duplicate sequence or partial sequences, were removed during the search. To further filter the dataset all sequences were also removed that were labeled as “predicted” and “putative” in the protein name field and followed by the manual check against the literature. To avoid any confusion any proteins which belong to “antifreeze-like proteins” were also excluded. The results on this dataset are reported only for the iAFP method  in . The proposed RAFP-Pred was trained using the training data in  (i.e. dataset 2), where the 100 most significant features were used. The sequences of training and testing sets were scanned for similar sequences and no identical sequences were found. The proposed RAFP-Pred approach attained the highest verification of 83.19% which is substantially better than the 57.18% reported for the iAFP.
The AFP-PseAAC approach was also evaluated using the same training and testing samples achieving a verification rate of 40.17% which is 43.02% inferior to the proposed RAFP-Pred approach.
Iv Biological Justification of the Most Significant Features Selected by the Proposed Approach
It is well known that the biological proteins usually have hydrophobic amino acids in the core (away from water molecules in the solvent). Interestingly, some AFPs have many hydrophobic amino acids on their surfaces , , . On the other hand -helices are most commonly found at the surface of the protein cores (for the case of some fish AFPs for instance) where they provide an interface with the aqueous environment.
Regions which tend to form an -helix are: (1) Richer in alanine (A), glutamic acid (E), leucine (L), and methionine (M), and (2) poorer in proline (P), glycine (G), tyrosine (Y), and serine (S). Careful analysis of the localized segments show that:
Segment 1 contains high Proline, high Serine, high Tyrosine and low Alanine which indicates less likelihood of an -helix in segment 1.
Segment 2 contains low Tyrosine, high Glutamic Acid, high Alanine and moderate Methionine which indicates high probability of an -helix in segment 2.
The above discussion shows that segment 2 has a high probability of an -helix region. Biologically, we can expect AFPs to have more hydrophobic amino acids in segment 2 compared to the non-AFPs. This can serve as a biologically justified point of discrimination as such. The features selected by the proposed RAFP-Pred contains about 68% of the features from segment 2 and the 58% of the segment 2 features are hydrophobic amino acid related features. It therefore follows that the proposed approach selected the most relevant and biologically justified features for the AFP prediction.
The structural and sequential diversity in AFPs demands a feature-set encompassing a broader range of features catering for most types of AFPs. For instance, the cysteine composition my vary for different organisms, conserved cysteines form disulfide bonds in beta-helix insect AFP but the same is not true for type 1 fish AFPs. A broader range of features is therefore required to predict AFPs across organisms. A thorough investigation shows that the optimal feature-set obtained by the proposed RAFP-Pred approach indeed contains a broad spectrum of these significant features.
For instance type 1 AFPs are rich in alanine amino acid , type 2 and type 5 AFPs are rich in cysteine amino acid ,  and Type 4 AFPs are rich in glutamine amino acid , . Interestingly the optimal feature set obtained by the proposed RAFP-Pred approach contains all these features. This explains the better performance of the proposed approach compared to the contemporary predictors.
In our experiments, the training data of dataset 1 (300 AFPs and 300 Non-AFPs) was used to identify the top 100 significant features. The details are provided in the supplementary material. Here we discuss the top three features selected by the proposed approach.
The most relevant feature selected by the proposed approach is the frequency of the tryptophan amino acid in segment 2. A careful exploration of the training data shows that segment 2 of the AFPs contains 40.97% more tryptophan compared to the non-AFPs. It is therefore safe to assume that the frequency of the tryptophan amino acid in segment 2 is a discriminating feature. The second most relevant feature selected by the proposed approach is the frequency of the leucine amino acid in segment 2. The non-AFPs of the training data set contains 20.82% more leucine compared to the AFPs counterpart; leucine can therefore be regarded as another discriminating feature. The frequency of occurrence of the amino acid cysteine is the third most relevant feature that is selected by the proposed approach. Analysis on the training data shows that it is found in abundance in both segments of the AFPs compared to the non-AFPs. In particular the training data contained 36.30% more cysteine in the AFPs than the non-AFPs and therefore cysteine is regarded as an important discriminating feature. This finding is supported by other researches who highlight the significance of cysteine in the prediction of the AFPs , . In fact 19 out of 100 features selected by the proposed approach are cysteine related.
For further details on all the selected features, the reader is referred to the supplementary material.
The structural and sequential dissimilarity protein sequences makes the prediction of the AFPs a difficult task. Previous sequence-based AFP predictors make use of the whole protein sequence. In this work we propose a novel concept of the localized analysis of AFP sequences. Extensive experiments on a number of standard datasets have been conducted. The proposed RAFP-Pred approach has shown to perform better compared to the previous predictors such as AFP-PseAAC, AFP_PSSM, AFP-Pred and iAFP. The Weka model of the proposed approach have been made publicly available for benchmarking purposes (https://goo.gl/3i7gQD). Our favorable results suggest further explorations in this direction. For instance a more extensive segmentation could be a possible area of future research.
The authors would like to thank University of Western Australia (UWA), Pakistan Air Force - Karachi Institute of Economics and Technology (PAF-KIET), and Iqra University (IU), for providing the necessary support towards conducting this research and the anonymous reviewers for their important comments.
-  S. Khan, I. Naseem, R. Togneri, and M. Bennamoun, “Rafp-pred: Robust prediction of antifreeze proteins using localized analysis of n-peptide compositions,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 15, no. 1, pp. 244–250, Jan 2018.
-  X.-M. Yu and M. Griffith, “Winter rye antifreeze activity increases in response to cold and drought, but not abscisic acid,” Physiologia Plantarum, vol. 112, no. 1, pp. 78–86, 2001.
-  M. Griffith, M. Antikainen, W.-C. Hon, K. Pihakaski-Maunsbach, X.-M. Yu, J. U. Chun, and D. S. Yang, “Antifreeze proteins in winter rye,” Physiologia Plantarum, vol. 100, no. 2, pp. 327–332, 1997.
-  P. L. Davies, J. Baardsnes, M. J. Kuiper, and V. K. Walker, “Structure and function of antifreeze proteins,” Philosophical Transactions of the Royal Society B: Biological Sciences, vol. 357, no. 1423, pp. 927–935, 2002.
-  G. L. Fletcher, C. L. Hew, and P. L. Davies, “Antifreeze proteins of teleost fishes,” Annual review of physiology, vol. 63, no. 1, pp. 359–390, 2001.
-  M. E. Urrutia, J. G. Duman, and C. A. Knight, “Plant thermal hysteresis proteins,” Biochimica et Biophysica Acta (BBA)-Protein Structure and Molecular Enzymology, vol. 1121, no. 1, pp. 199–206, 1992.
-  P. Scholander, L. Van Dam, J. Kanwisher, H. Hammel, and M. Gordon, “Supercooling and osmoregulation in arctic fish,” Journal of Cellular and Comparative Physiology, vol. 49, no. 1, pp. 5–24, 1957.
-  M. Moriyama, J. Abe, M. Yoshida, Y. Tsurumi, and S. Nakayama, “Seasonal changes in freezing tolerance, moisture content and dry weight of three temperate grasses [dactylis glomerata, lolium perenne, phleum pratense],” Journal of Japanese Society of Grassland Science (Japan), 1995.
-  J. M. Logsdon and W. F. Doolittle, “Origin of antifreeze protein genes: a cool tale in molecular evolution,” Proceedings of the National Academy of Sciences, vol. 94, no. 8, pp. 3485–3487, 1997.
-  K. Ewart, Q. Lin, and C. Hew, “Structure, function and evolution of antifreeze proteins,” Cellular and Molecular Life Sciences CMLS, vol. 55, no. 2, pp. 271–283, 1999.
-  C.-H. C. Cheng, “Evolution of the diverse antifreeze proteins,” Current opinion in genetics & development, vol. 8, no. 6, pp. 715–720, 1998.
-  P. L. Davies and B. D. Sykes, “Antifreeze proteins,” Current opinion in structural biology, vol. 7, no. 6, pp. 828–834, 1997.
-  K. Kandaswamy, K.-C. Chou, T. Martinetz, S. Möller, P. N. Suganthan, S. Sridharan, and G. Pugalenthi, “AFP-Pred: A random forest approach for predicting antifreeze proteins from sequence-derived,” Journal of Theoretical Biology, vol. 270, pp. 56–62, 2011.
-  Z. Xiaowei, M. Zhiqiang, and Y. Minghao, “Using support vector machine and evolutionary profiles to predict antifreeze protein sequences,” International Journal of Molecular Science, vol. 13, pp. 2196–2207, 2012.
-  C.-S. Yu and C.-H. Lu, “Identification of antifreeze proteins and their functional residues by support vector machine and genetic algorithms based on n-peptide compositions,” PLoS, 2011.
-  S. Mondal and P. Priyadarshini, “Chou’s pseudo amino acid composition improves sequence-based antifreeze protein prediction,” Journal of Theoretical Biology, vol. 356, pp. 30–35, 2014.
-  A. Kovnatsky, K. Glashoff, and M. M. Bronstein, “Madmm: a generic algorithm for non-smooth optimization on manifolds,” arXiv preprint arXiv:1505.07676, 2015.
-  H. Yang, Y.-m. CHENG, S.-w. ZhANG, and Q. PAN, “Prediction of protein subcellular localization using a novel feature extraction method: sequence-segmented pseudo amino acid composition,” Acta Biophysica Sinica, vol. 24, no. 3, pp. 232–238, 2008.
-  K. K. Kandaswamy, G. Pugalenthi, K.-U. Kalies, E. Hartmann, and T. Martinetz, “Ecmpred: Prediction of extracellular matrix proteins based on random forest with maximum relevance minimum redundancy feature selection,” Journal of theoretical biology, vol. 317, pp. 377–383, 2013.
-  A. Dehzangi, R. Heffernan, A. Sharma, J. Lyons, K. Paliwal, and A. Sattar, “Gram-positive and gram-negative protein subcellular localization by incorporating evolutionary-based descriptors into chou? s general pseaac,” Journal of theoretical biology, vol. 364, pp. 284–294, 2015.
-  S.-W. Zhang, W. Chen, F. Yang, and Q. Pan, “Using Chouï¿½s pseudo amino acid composition to predict protein quaternary structure: a sequence-segmented PseAAC approach,” Amino Acids, vol. 35, no. 3, pp. 591–598, 2008.
-  K.-C. Chou, “Some remarks on protein attribute prediction and pseudo amino acid composition,” Journal of theoretical biology, vol. 273, no. 1, pp. 236–247, 2011.
-  P. Horton, K.-J. Park, T. Obayashi, N. Fujita, H. Harada, C. Adams-Collier, and K. Nakai, “Wolf psort: protein localization predictor,” Nucleic acids research, vol. 35, no. suppl 2, pp. W585–W587, 2007.
-  P. Du, X. Wang, C. Xu, and Y. Gao, “PseAAC-Builder: A cross-platform stand-alone program for generating various special Chou’s pseudo-amino acid compositions,” Analytical biochemistry, vol. 425, no. 2, pp. 117–119, 2012.
-  C. Ding and H. Peng, “Minimum redundancy feature selection from microarray gene expression data,” Journal of bioinformatics and computational biology, vol. 3, no. 02, pp. 185–205, 2005.
-  D. Koller and M. Sahami, “Toward optimal feature selection,” in In Proceedings of the Thirteenth International Conference. Morgan Kaufmann Publishers Inc., 1996, pp. 284–292.
-  P. Langley et al., Selection of relevant features in machine learning. Defense Technical Information Center, 1994.
-  K. K. Kandaswamy, G. Pugalenthi, E. Hartmann, K.-U. Kalies, S. Möller, P. Suganthan, and T. Martinetz, “Spred: A machine learning approach for the identification of classical and non-classical secretory proteins in mammalian genomes,” Biochemical and biophysical research communications, vol. 391, no. 3, pp. 1306–1311, 2010.
-  T. M. Mitchell, Machine Learning, 1st ed. New York, NY, USA: McGraw-Hill, Inc., 1997.
-  B. Wu, T. Abbott, D. Fishman, W. McMurray, G. Mor, K. Stone, D. Ward, K. Williams, and H. Zhao, “Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data,” Bioinformatics, vol. 19, no. 13, pp. 1636–1643, 2003.
-  J. W. Lee, J. B. Lee, M. Park, and S. H. Song, “An extensive comparison of recent classification tools applied to microarray data,” Computational Statistics & Data Analysis, vol. 48, no. 4, pp. 869–885, 2005.
-  R. Díaz-Uriarte and S. A. De Andres, “Gene selection and classification of microarray data using random forest,” BMC bioinformatics, vol. 7, no. 1, p. 3, 2006.
-  K. K. Kumar, G. Pugalenthi, and P. Suganthan, “Dna-prot: identification of dna binding proteins from protein sequence information using random forest,” Journal of Biomolecular Structure and Dynamics, vol. 26, no. 6, pp. 679–686, 2009.
-  M. Masso and I. I. Vaisman, “Knowledge-based computational mutagenesis for predicting the disease potential of human non-synonymous single nucleotide polymorphisms,” Journal of Theoretical Biology, vol. 266, no. 4, pp. 560–568, 2010.
-  L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.
-  E. Frank, M. Hall, L. Trigg, G. Holmes, and I. H. Witten, “Data mining in bioinformatics using weka,” Bioinformatics, vol. 20, no. 15, pp. 2479–2481, 2004.
-  W. J. Youden, “Index for rating diagnostic tests,” Cancer, vol. 3, no. 1, pp. 32–35, 1950.
-  E. L. Sonnhammer, S. R. Eddy, R. Durbin et al., “Pfam: a comprehensive database of protein domain families based on seed alignments,” Proteins-Structure Function and Genetics, vol. 28, no. 3, pp. 405–420, 1997.
-  W. Li, L. Jaroszewski, and A. Godzik, “Clustering of highly homologous sequences to reduce the size of large protein databases,” Bioinformatics, vol. 17, no. 3, pp. 282–283, 2001.
-  H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne, “The protein data bank,” Nucleic acids research, vol. 28, no. 1, pp. 235–242, 2000.
-  G. Wang and R. L. Dunbrack, “Pisces: a protein sequence culling server,” Bioinformatics, vol. 19, no. 12, pp. 1589–1591, 2003.
-  A. Bairoch and R. Apweiler, “The swiss-prot protein sequence database and its supplement trembl in 2000,” Nucleic acids research, vol. 28, no. 1, pp. 45–48, 2000.
-  U. Consortium et al., “The universal protein resource (uniprot) in 2010,” Nucleic acids research, vol. 38, no. suppl 1, pp. D142–D148, 2010.
-  S. P. Graether, “The structure of type iii and spruce budworm antifreeze proteins: Globular versus -helix folds,” Ph.D. dissertation, Queens University, 1999.
-  E. I. Howard, M. P. Blakeley, M. Haertlein, I. P. Haertlein, A. Mitschler, S. J. Fisher, A. C. Siah, A. G. Salvay, A. Popov, C. M. Dieckmann, T. Petrova, and A. Podjarny, “Neutron structure of type-iii antifreeze protein allows the reconstruction of afpice interface,” Journal of Molecular Recognition, vol. 24, no. 4, pp. 724–732, 2011.
-  (2016) Neutron science explains mystery of how arctic fishs antifreeze proteins work. @ONLINE https://www.ill.eu/press-and-news/press-room/press-releases/how-arctic-fishs-antifreeze-proteins-work/.
-  S. N. Patel and S. P. Graether, “Structures and ice-binding faces of the alanine-rich type i antifreeze proteins.” Biochemistry and Cell Biology, vol. 88, no. 2, pp. 223–229, 2010, pMID: 20453925.
-  N. Ng and C. Hew, “Structure of an antifreeze polypeptide from the sea raven. disulfide bonds and similarity to lectin-binding proteins.” Journal of Biological Chemistry, vol. 267, no. 23, pp. 16 069–16 075, 1992.
-  J. G. Duman, “Antifreeze and ice nucleator proteins in terrestrial arthropods,” Annual Review of Physiology, vol. 63, no. 1, pp. 327–357, 2001.
-  M. Karel and D. B. Lund, Physical principles of food preservation: revised and expanded. CRC Press, 2003, vol. 129.
-  G. Deng, D. W. Andrews, and R. A. Laursen, “Amino acid sequence of a new type of antifreeze protein, from the longhorn sculpin myoxocephalus octodecimspinosis,” FEBS letters, vol. 402, no. 1, pp. 17–20, 1997.