Feature versus Raw Sequence: Deep Learning Comparative Study on Predicting Pre-miRNA
Background Should we input known genome sequence features or input sequence itself in deep learning framework? As deep learning more popular in various applications, researchers often come to question whether to generate features or use raw sequences for deep learning. To answer this question, we study the prediction accuracy of precursor miRNA prediction of feature-based deep belief network and sequence-based convolution neural network. \parttitleResults Tested on a variant of six-layer convolution neural net and three-layer deep belief network, we find the raw sequence input based convolution neural network model performs similar or slightly better than feature based deep belief networks with best accuracy values of 0.995 and 0.990, respectively. Both the models outperform existing benchmarks models. The results shows us that if provided large enough data, well devised raw sequence based deep learning models can replace feature based deep learning models. However, construction of well behaved deep learning model can be very challenging. In cased features can be easily extracted, feature-based deep learning models may be a better alternative.
addressref=aff1, noteref=n1, email@example.com ]\initsJT\fnmJaya \snmThomas addressref=aff1, noteref=n1, firstname.lastname@example.org ]\initsST\fnmSonia \snmThomas addressref=aff1,aff2, corref=aff1,aff2, email@example.com ]\initsLS\fnmLee \snmSael
precursor miRNA \kwdConvolution neural network \kwdDeep belief network \kwdDeep learning comparison
Deep learning methods have been popularized in bio-sequence analysis. More specifically, convolution neural network (CNN) have been widely applied to characterize and classify raw sequence data . Traditionally, to classify sequence data, sequence features were generated, fed in to classification algorithms, and predictions were made. CNN simplified this process by removing the need to feature generation which could be challenging in cases. However, question arises of whether to use raw sequences when there are already good set of features. We answer this question in the context of precursor micro RNA prediction where raw sequence data as well as set of working sequence features which are shown to give high accuracy.
MicroRNAs (miRNAs) are single-stranded small non-coding RNAs that are typically 22 nucleotides long. A miRNA regulates gene expression at the post transcription level by base pairing with a complementary messenger RNA (mRNA) there by hindering the translation of the mRNA to proteins. The regulatory role of miRNAs are important in development, cell proliferation and cell death and their malfunction has been connected with neuro-degenerative disease, cancer and metabolic disorders . Furthermore, informatics analysis predicts that 30 of human genes are regulated by miRNA .
MiRNAs can be experimentally determined by directional cloning of endogenous small RNAs . However, this is a time consuming process that require expensive laboratory reagents. These drawbacks motivate the application of computational approaches for predicting miRNAs. The goal of miRNA prediction is to correctly classify pre-miRNAs from other pseudo hairpins. Via miRNA biogenesis, pre-miRNA becomes a mature miRNA, however, other hairpins do not. The miRNA biogenesis involves number of steps. First, primary transcripts of miRNA (pri-miRNA) are transcribed from introns of protein coding genes that are several kilobases long. The pri-miRNAs are then clopped by Rnase-III enzyme Drosha into 70 base pairs (bp) long hair-pin-looped precursor miRNAs (pre-miRNAs). Then exportin-5 proteins transport pre-miRNA hairpins into the cytoplasm through nuclear pore. In cytoplasm, pre-miRNAs are further cleaved by Rnase-III enzyme Dicer to produce a 20 bp double stranded intermediate called miRNA:miRNA*. Then a strand of the duplex with the low thermodynamic energy becomes a mature miRNA.
Precursor miRNA prediction methods
Several machine learning based methods have been proposed to predict miRNAs, that is to determine the true pre-miRNAs from other pseudo hairpins, RNA sequences that have similar stem-loop features to pre-miRNAs but does not contain mature miRNAs, with high accuracy. Most methods relies on features generated from sequence, folding measures, stem-loop features and statistical measures and careful selection of features.
Many tools have been developed based on the different classification techniques such as naive Bayes classifier (NBC), artificial neural networks (ANN), support vector machines (SVM), and random forests (RF). Among the approaches, support vector machine (SVM) had been most extensively applied. Some of the notable SVM-based methods includes triplet-SVM , MiRFinder , miPred , microPred , yasMiR , MiRenSVM , MiRPara , YamiPred , and GDE . Among them, Triplet-SVM is the classifier that consider local structure-sequence features that reflect characteristics of miRNAs. The approach report an accuracy of 90 considering pre-miRNAs from the other 11 species including plants and virus without considering any other comparative genomics information. Another, miPred SVM approach considered Gaussian Radial Basis Function kernel (RBF) as a similarity measure for global and intrinsic hairpin folding attribute and resulted with accuracy of around 93. MicroPred introduces some additional features for evaluation of miRNA using SVM based machine learning classifier. Author’s report classification results of high specificity of 97.28 and sensitivity of 90.02. The miR-SF classifier  predicts the identified human pre-miRNAs in miRBase source on the selected optimized feature subset including 13 features, generated by SVM and genetic algorithm. Finially, YamiPred  is a genetic algorithm and SVM based embedded classifier that consider feature selection and parameters optimization for improving performance. Other notable methods are based on random forests (RF) and artificial neural networks (ANN) [15, 16]. In MiRANN predictor, author’s consider neural network for pre-miRNA prediction by expanding the network with more neurons and the hidden layers and reports an ACC on a human dataset. The network is designed to be impartial for any feature by integrating exceptional weight initializing equation where closest neurons slightly differ in weights. MiRANN utilizes carefully selected features on a neural network structure. However, to the best of our knowledge, raw sequence data have not been used for distinguish pre-miRNA from other hair-pin sequences.
Deep learning approaches
Two types of neural network models, i.e., deep belief network and convolution neural network, are used to to compare the prediction accuracy of feature-based learning and raw sequence based learning. Convolution neural network (CNN) has been used in several instances to directly process raw data as input. CNN has gained momentum due to its success in improving the previously recorded state-of-the-art performance measures in a wide range of domains including genome sequence analysis. Deep belief network (DBN), on the other hand has been popular with where there are large number of features. Whether the input is a raw data or a high dimensional feature, DNN uses multi-layer architecture to infer from data. The deep architecture automatically extracts high-level feature necessary for classification or clustering. That is, the multiple layers in deep learning helps exploit the inherent complexities of data.
Deep belief network
Deep belief network (DBN) is an architecture obtained by stacking multiple restricted boltzmann machines (RBMs), such that the hidden layer is the input to the hidden layer. Let be the observed vector and hidden layer , with N hidden layers, then the distribution is as follows :
where , is the distribution for visible (input) unit on the level hidden layer, and is the top level layer distribution of the visible-hidden layer. The first step in training the DBN is to train the first layer (visible layer) of the model such that is models the raw input . In the second step the distribution of the input (i.e transformed data) is obtained as using training results of the first layer and is used as the input for the second layer. In the third step, second layer of RBM is trained on sampling the learned conditional probability in the previous layer. Steps two and three is repeated to generate multiple layers. In the final step, hyper parameters are fine-tuned based on the gradient descent based back propagations. The first hidden layer learns from the structure of the data through the input layer and the process is continued by adding the second layer. The first hidden layer acts as the input, which is multiplied by weight at the nodes of second hidden layer and thus the probability for activating the second hidden layers is calculated. This process results in sequential sets of activations by grouping features of features resulting in a feature hierarchy, by which networks learn more complex and abstract representations of data, and can be repeated several times to create a multi-layer network. A standard feed-forward neural network is added after the last hidden layer to predict the label, the input to the network being being the activation probabilities. The resulting DBN is put together to adjust the weights with stochastic gradient descent back propagation .
Convolution neural network
Typical CNN models consist of multiple layers of convolution layer and pooling layer alternations finalized by a fully connected layer. The convolution layer performs the convolution operation between the input values and learned filters, matrix of weights. Let be the filter size and be the small matrix of weights, then the convolution layer performs a modified convolution of the W with the input X by calculating the dot product , where a instance of and is the bias. Typically the filters are are share by using the same filter across different positions of the input. The step size by which the filter slides across the input is called the stride, and the filter area is called the receptive field. Weight sharing concept is the important characteristics of a convolution network that reduces the complexity of the problem by reducing number of weights learned. It is also allows location invariant learning, i.e., if a important pattern exists, a appropriate CNN model will learn it no matter where in the sequence. The convolution layer is often followed by the pooling layer that summaries the value learned in the convolution layer. The pooling also allows invariance in the learning as well as reducing model complexity. Popular pooling methods are average pooling or max pooling. The final layer is the fully connected layer which is connected to the output or the classification layer
The main contribution of the paper are summarized as follows:
Compare the performance of feature based and raw sequence based deep learning: a deep belief network models is proposed for integrating large number of heterogeneous features and convolution neural network model is proposed for the raw input sequence data.
Provides a solution for class imbalance problem, allowing for unbiased performance measures.
Compares the performance of proposed model against existing machine learning classifier on eleven different species which extends the previous work on human dataset only .
miRNA data selection
We use experimentally validated pre-miRNAs as positive examples and pseudo hairpins as negative examples to train and test the proposed method. The human pre-miRNA sequence was retrieved from the miRBase 18.0 release . Similar to miPred  approach, the multiple loops were discarded to obtain 1600 pre-miRNAs (positive) dataset. The positive sample sequence has an average length of 84 nt with maximum and minimum length being 154 nt and 43 nt respectively. Similarly the negative sample sequence has average length of 85nt with 63 nt and 120 nt as minimum and maximum length respectively. The negative dataset consists of 8494 pseudo hairpins as the false samples. They were extracted from the human protein-coding regions as suggested by microPred . The average length of the sequence is 85 nt with minimum as 63 nt and maximum as 120 nt. The different filtering criteria, including non-overlapping sliding window, no multiple loops, lowest base pair number set to 18, and minimum free energy less than 15kcal/mol were applied on these sequences to resemble the real pre-miRNA properties.
Class imbalance solution
Another problem that we have addressed here is the class imbalance problem in miRNA predictions. Class imbalance is a machine learning problem where the number of data samples belonging to one class (positive) is far less compared to data sample belonging to other class (negative). The imbalance class is often solved by using either under or over sampling methods. In case of under sampling the data samples are removed from the majority class, whereas for over sampling balance is created either by addition of new samples or duplication of the existing minority class samples. Class imbalance problem often arise in miRNA data classification problem due to abundance in pseudo hairpin structure compared to true pre-miRNAs folds. In existing classifiers such as triplet-SVM and miPred  handled the imbalance problem manually.
We address the class imbalance problem during the training phase by adopting a modified under sampling approach . In the modified approach, we divided the majority class into subsets using k-means algorithm with k=5, and thus obtain clusters with slightly higher similarity amongst the group. The entire negative samples is divided into subsets using k-means algorithm with k=5, and the cluster having the highest similarity index among the group is selected. Now using 8 fold cross validation, the negative samples is divided into training and testing dataset such that training dataset has 1400 negative samples and testing dataset has 200 negative samples. Similarly, the positive sample is divided into training and testing dataset using 8 fold cross validation such that training dataset has 1400 positive sample and testing dataset has 200 positive samples. Hence, the training dataset has 2400 samples and testing dataset has 400 samples.
Modeling deep belief network
miRNA feature encoding
Feature based learning require features as inputs. This work adopts 58 characteristic features, which are shown useful in existing studies for predicting miRNA . The features includes sequences composition properties, folding measures, stem-loop features, energy and statistical features, and 20 selected features to differentiate pre-miRNAs from pseudo hairpins for candidate input of the DBN model. These features are extracted based on the knowledge based analysis of the existing methods for the miRNA analysis. The common characteristics of pre-miRNAs used for evaluation consists of sequences composition properties, secondary structures, folding measures and energy. The sequence characteristics include features related to the frequency of two and three adjacent nucleotide and aggregate dinucleotide frequency in the sequence. The secondary structure features from the perspectives of miRNA bio-genesis relating different thermodynamic stability profiles of pre-miRNAs. These structures have lower free energy and often contain stem and loop regions. They include diversity, frequency, entropy-related properties, enthalpy-related properties of the structure. The other features are hairpin length, loop length, consecutive base-pairs and ratio of loop length to hairpin length of pre-miRNA secondary structure. The energy characteristic associated to the energy of secondary structure includes the minimal free energy of the secondary structure, overall free energy NEFE, combined energy features and the energy required for dissolving the secondary structure. All the features extracted are normalized to standardizing the inputs in order to improve the training and to avoid getting stuck in local optima. The features used are summarized Table LABEL:tab:DBN_feat and detailed in Table 7 .
Deep belief network architecture
The proposed DBN based miRNA prediction method, we call miRNA-FDL, has three hidden layers, and the model is denoted as X-100-70-35-1, where X being the size of the input layer, 1 denotes the number of neuron in the output layer and the remaining values denotes the number of neurons in each hidden layer. Figure 1 illustrates the model architecture and layer-by-layer learning procedure described in . Different model architectures were trained using the same learning procedure but varying the number of hidden layer and nodes. Amongst the candidate network models, a better one was selected based on the classifier accuracy.
The weights of the miRNA-FDL was trained with stochastic gradient descent base back propagation algorithm ] were the update rule is the following:
where, is the weight computed at , denotes the learning rate, and is the cost function. For the given model, softmax is used as an activation function and the cost is computed using cross entropy. The softmax function is defined as
where, is the output of the unit j, and denotes the total input to unit and , respectively for the same level. The cross entropy is given by
where is the target probability for output unit and is the probability output after applying the activation function.
Modeling convolution neural network
CNN input processing
Each pre-miRNA is a RNA sequence of composed letters (A, C, G, U). Each of the nucleotide is encoded using one-hot-code methods. That is, A is encoded as (1,0,0,0), C as (0,1,0,0), G as (0,0,1,0), and U as (0,0,0,1). The micro RNA is a nucleotide sequence of (A, C, G, U).
Convolution neural network architecture
Various architecture of the CNN can be generated dependent on the choice of number of layers and on how to combine convolution layer with pooling layers. The Table 3 shows the various variants of the CNN architecture considered for the study. In CNN model type 1, the CNN architecture has single layer of convolution followed by a pooling layer which in turn is connected to the fully connected output layer. The output layer is connected to the classification layer which classifies the predicted labels. The model type 2 is variation to model type 1, such that the pooling layer of model type 1 is replaced by a fully connected layer. Hence the model type 2 has two fully connected layers. The further variation to the model type 1 leads to the model type 3 such that it has two convolution layers. All other layers as similar to the base model 1. The model type 4 has three convolution layers. In all the models global pooling is preferred over local pooling as it is observed that features are better learned in global pooling.
The architecture of the CNN model highly depends on the various hyper-parameters. We set the number of node in the input layer to be , 4 for the one-hot-code encoding and 160 the for length of the sequence as the maximum length of sequence in the database is 154bp. We set the output node of the the fully connected layer to be three and add a classification layer which identifies the input sequence as whether it contains the pre-miRNA or not based on the three nodes result. Other hyper-parameters tested are list in the Table 4. The various combination of the hyper-parameters mentioned in Table 4 with the models mentioned in Table 3, are considered.
Wether to determine the efficacy of raw sequence based learning and feature based learning, we compared the accuracies of two DBN models that works one unselected pre-miRNA feature set of size 58 and selected feature set of size 20 and the accuracy of one CNN model that works on raw pre-miRNA sequences. We also compared the proposed methods on other machine learning methods. The proposed miRNA prediction models are implemented on MATLAB 2016 (b) platform with 2.30 GHz Intel Xenon GPU E5-2630 and 32 GB RAM. The most crucial aspect of the deep learning was on the selection of the appropriate hyper-parameters. We describe the final models that was selected in the following. The performance of the proposed and compared methods are summarized in Table 6.
DBN based precursor miRNA prediction model
The various candidate model for the DBN based precursor miRNA prediction model is obtained by varying the number of hidden nodes in the hidden layer as well as the number of hidden layers. The best prediction accuracy is obtained for a DBN network architecture [Fig. 3] of three hidden layers with first, second and third layer having 100, 70, 35 hidden neurons respectively. Considering the stochastic nature of the algorithm the output values are averaged for twenty executions. It is observed that for 58 features as input, the DBN model (Input-100-70-35-output) gives an accuracy of 0.968 with F1-score of 0.957 Furthermore from the literature survey  it is learned that the most relevant features associated with the miRNA are melting temperature, enthropy, enthalpy and free minimum energy. Thus the 20 relevant features mentioned are aggregate nucleotide frequency A+U, dinucleotide frequencies AG, AU, CU, GA, UU, Minimum Free Energy Index 4 (MFEI4), Positional Entropy, Normalized Ensem-ble free energy, Frequency of the MFE structure (Freq), Enthalpy normalized by the length of the sequence (dH/L), Melting temperature (Tm), Melting temperature normalized by length (Tm/L), Normalized base-pair count by length, j G-Cj /L, Normalized average base pairs by number of stem loops (A-U)/ stems, (G-U)/stems, the length of the sequence (Len), Centroid energy normalized by length (CE/L), and the Statistical Z-scores zG and zSP. The DBN model for the above 20 features gives an accuracy of 0.992 with F score 0.989 which is slightly higher than the using all 58 features.
CNN based precursor miRNA prediction model
The various candidate models obtained by the combination of Table 3 and Table 4 are bested and two models that output highest accuracies on the validation data set are selected. Deeper layers were also tested, however, additional layers of convolution does not increases the performance of the miRNA prediction due to two factors due to limitation of available number of data. The two models are described bellow and summarized in Table 5.
The architecture of the model type 2 could be explained as follows: the input layer (raw sequence data) is convoluted with a filter (window) of size of 18, and the window is shift-ed with value of 4 (using a stride of 4). The total number of filters used are 20. The output obtained after the convolution, is now fed into a fully connected layer having 90 neurons. Furthermore the output from this fully connected layer is again fed into another fully connected layer having 2 neurons. This layer is also called the output layer. The output layer is connected to a classification layer which classifies the label. In the model type 2, after the convolution layer, two fully connected layers are used before the final (output) classification layer. The fully connected layer helps in better learn-ing of the features that are extracted from the convolution layer.
In the model type 3 as depicted by Figure 2, the architecture is as follows: the input layer (raw se-quence data) is convoluted with a filter (window) of size 12 and the window is shifted with value 1 (Stride =1). The output of this convolution layer is again convoluted with another filter (window) of size 6 and Stride=1. For both the convolutions the number filters used is 12. Now after the second convolution, a max pooling layer is connected with window size 6 and the window is shifted with value 4 (stride =4). The output of the pooling layer is connected to the fully connected layer having 2 neurons, (i.e the output layer). The output layer is connected to a classification layer which classifies the label. For both the models, the accuracy is at its best at the dropout ratio of 0.3 at the output layer.
Comparison with the existing computational methods
Proposed prediction model using CNN and DBN are compared to the existing benchmark models. It is clearly observed that the prediction model based on deep learning approaches outperforms compared methods. Two models of CNN and DBN model with 20 selected features has highest accuracy values above 0.99. The DBN model working on 58 features also has high accuracy of 0.968. This shows that DBN performs well on large set of unselected features. Both the proposed models in this study, provided enough data, validate that deeper the network model, higher is the precision efficacy of the classifier. Table 6.
Discussions and Conclusions
In this study, prediction model for prediction of precursor miRNA that contains miRNA sequence is proposed using deep learning techniques using convolution neural networks on raw sequence input and deep belief networks on feature sets. Convolution neural network, when well modeled, were able to automatically learn relevant feature from raw RNA sequence for predicting correct pre-miRNAs, hence developing a highly accurate classifier. Deep learning framework, outperform all the existing popular learning algorithms including naive Bayes, random forest, k nearest neighbor, and SVM.
-  Witkos, T.M., Koscianska, E., Krzyzosiak, W.J.: Practical aspects of microrna target prediction. Curr Mol Med 11(2), 99–109 (2011). doi:10.2174/156652411794859250
-  Ross, J.S., Carlson, J.A., Brock, G.: mirna: the new gene silencer. Am J Clin Pathol. 128(5), 830–836 (2007). doi:10.1309/2JK279BU2G743MWJ
-  Chen, P.Y., Manninga, H., Slanchev, K., Chien, M., Russo, J.J., Ju, J., Sheridan, R., John, B., Marks, D.S., Gaidatzis, D., Sander, C., Zavolan, M., Tuschl, T.: The developmental mirna profiles of zebrafish as determined by small rna cloning. Genes and Development 19(11), 1288–1293 (2005). doi:10.1101/gad.1310605
-  Xue, C., Li, F., He, T., Liu, G.P., Li, Y., Zhang, X.: Classification of real and pseudo microrna precursors using local structure sequence features and support vector machine. BMC Bioinformatics 6, 310 (2005). doi:10.1186/1471-2105-6-310
-  Huang, T.H., Fan, B., Rothschild, M.F., Hu, Z.L., Li, K., Zhao, S.H.: Mirfinder: an improved approach and software implementation for genome-wide fast microrna precursor scans. BMC Bioinformatics 8, 341 (2007). doi:10.1186/1471-2105-8-341
-  Ng, K.L.S., Mishra, S.K.: De novo svm classification of precursor micrornas from genomic pseudo hairpins using global and intrinsic folding measures. BMC Bioinformatics 23(11), 1321–1330 (2007). doi:10.1186/1471-2105-8-341
-  Batuwita, R., Palade, V.: micropred: effective classification of pre-mirnas for human mirna gene prediction. BMC Bioinformatics 25(8), 989–995 (2009). doi:10.1093/bioinformatics/btp107
-  Pasaila, D., Sucial, A., Mohorianu, I., Pantiru, S.T., Ciortuz, L.: Mirna recognition with the yasmir system: The quest for further improvements. Adv Exp Med Biol. 696, 17–25 (2011). doi:10.1007/978-1-4419-7046-6 2
-  Ding, J., Zhou, S., Guan, J.: Mirensvm: towards better prediction of microrna precursors using an ensemble svm classifier with multi loop features. BMC Bioinformatics 14(11), 11–11 (2010). doi:10.1186/1471-2105-11-S11-S11
-  Wu, Y., Wei, B., Liu, H., Li, T., Rayner, S.: Mirpara: a svm-based software tool for prediction of most probable microrna coding regions in genome scale sequences. BMC Bioinformatics 12, 107 (2011). doi:10.1186/1471-2105-12-107
-  Kleftogiannis, D., Theofilatos, K., Likothanassis, S., Mavroudi, S.: Yamipred: A novel evolutionary method for predicting pre-mirnas and selecting relevant features. IEEE ACM Transactions on Computational Biology and Bioinformatics 12(5), 1183–1192 (2015). doi:10.1109/TCBB.2014.2388227
-  Hsieh, C.H., Chang, D.T.H., Hsueh, C.H., Wu, C.Y., Oyang, Y.J.: Predicting microrna precursors with a generalized gaussian components based density estimation algorithm. BMC Bioinformatics 11, 1–52 (2010). doi:10.1186/1471-2105-11-S1-S52
-  Wang, Y., Chen, X., Jiang, W., Li, L., Li, W., Yang, L., Liao, M., Lian, B., Lv, Y., Wang, S., Wang, S., Li, X.: Predicting human microrna precursors based on an optimized feature subset generated by ga-svm. Genomics 98(2), 73–78 (2011)
-  Xiao, J., Tang, X., Li, Y., Fang, Z., Ma, D., He, Y., Li, M.: Identification of microrna precursors based on random forest with network-level representation method of stem-loop structure. BMC Bioinformatics 12:165 (2011). doi:10.1186/1471-2105-12-165
-  Rahman, M.E., Islam, R., Islam, S., Mondal, S.I., Amin, M.R.l.: Mirann: A reliable approach for improved classification of precursor microrna using artificial neural network model. Genomics 99, 189–194 (2012)
-  Thomas, J., Thomas, S., Sael, L.: DP-miRNA: an improved prediction of precursor microRNA using deep learning mode. In: IEEE International Conference on Big Data and Smart Computing (IEEE BigComp 2017), pp. 96–99 (2017). http://conf2017.bigcomputing.org/
-  Hinton, G.E.: Training Products of Experts by Minimizing Contrastive Divergence. Neural Computation 14(8), 1771–1800 (2002)
-  Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A.R., Jaitly, N., Vanhoucke, V., Nguyen, P., Sainath, T., Kingsbury, B.: Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 82–97 (2012). doi:10.1.1.248.3619
-  Zhong, Y., Xuan, P., Han, K., Zhang, W., Li, J.: Improved pre-mirna classification by reducing the effect of class imbalance. BioMed Research International 2015, 1–12 (2015). doi:10.1155/2015/960108
-  Yen, S.J., Lee, Y.S.: Cluster-based under-sampling approaches for imbalanced data distributions. Expert Systems with Applications 36(3), 5718–5727 (2009). doi:10.1016/j.eswa.2008.06.108
|Sequence composition properties||features related to the frequency of two and three adjacent nucleotide, aggregate dinucleotide frequency in the sequence, such as Dinucleotide pair frequency, Trinucleotide frequency, aggregate dinucleotide frequency|
|Secondary structures||thermodynamic stability profiles of pre-miRNAs|
|Stem and loop||diversity, frequency, entropy-related properties, enthalpy-related properties of the structure, hairpin length, loop length, consecutive base-pair, ratio of loop length to hairpin length of pre-miRNA secondary structure|
|Energy characteristics||minimal free energy of the secondary structure, overall free energy NEFE, combined energy features, the energy required for dissolving the secondary structure|
|Statistical measures||Z-score of the folding measures zG, zQ, zSP, zP, zD|
|Model Name||Number of Layers||Description of architecture|
|Type 1||5||Layer 1: Input layer, Layer 2: Convolution with no stride, Layer 3 Pooling layer with stride, Layer 4 Fully connected to output layer, Layer 5 classification layer|
|Type 2||5||Layer 1: Input layer, Layer 2: Convolution with stride, Layer 3 Fully connected Layer , Layer 4 Fully connected to output layer, Layer 5 classification layer|
|Type 3||6||Layer 1: Input layer, Layer 2: Convolution with no stride, Layer 3: Convolution with no stride, Layer 4 Pooling layer with stride, Layer 5 Fully connected to output layer, Layer 6 classification layer|
|Type 4||7||Layer 1: Input layer, Layer 2: Convolution with no stride, Layer 3: Convolution with no stride, Layer 4: Convolution with no stride, Layer 5 Pooling layer with stride, Layer 6 Fully connected to output layer, Layer 7 classification layer|
|Filter size||5 to 24|
|Number of filters||5 to 20|
|Stride||0 to 24|
|Pooling||Max pooling 0 to 9|
|Dropout||0 to 0.4|
|Number of convolution layers||1 to 3|
|Model Type||Description of model||Performance Measure|
|Type 2||Layer 1||Input Sequence||SE=1|
|Layer 2||Convolution Layer, Window size= 18, Stride = 4,||Number of filters =20.|
|Layer 3||Fully connected Layer (90 neurons)||Precision= 0.985|
|Layer 4||Fully connected Layer (2 neurons)||Acc=0.993|
|Layer 5||Classification Layer|
|Type 3||Layer 1||Input Sequence||SE=1|
|Layer 2||Convolution Layer (window size=12, stride=1, Num. of filters=12)||SP=0.990|
|Layer 3||Convolution Layer (window size=6, stride=1, Num. of filters=12)||Precision=0.990|
|Layer 4||Pooling Layer (max pooling, stride=4)||Acc=0.995|
|Layer 5||Fully Layer (2 neurons)|
|Layer 6||Classification Layer|
|K nearest neighbors||0.970||0.657||0.908|
|Deep RBM model [58 features]||0.973||0.942||0.968|
|Deep RBM model [20 features]||0.995||0.982||0.990|
|CNN model 1 (Type 2)||1.00||0.985||0.993|
|CNN model 2 (Type 3)||1.00||0.990||0.995|
Full list of pre-miRNA features
Full list of pre-miRNA features used as inputs to deep belief network is listed.
|XY, where X,Y A,C,G,U||16||Dinucleotide pairs frequency|
|XYZ, where X,Y,Z A,C,G,U||64||Trinucleotide pairs frequency|
|A+U||1||Aggregate dinucleotide frequency (bases which are either A or U)|
|G+C||1||Aggregate dinucleotide frequency (bases which are either G or C)|
|Freq||1||Structural frequency property|
|dP||1||Adjusted base pairing propensity given as totalbasesL|
|dG||1||Adjusted Minimum Free Energy of folding given as dG = MFEL|
|dD||1||Adjusted base pair distance|
|dQ||1||Adjusted shannon entropy|
|dF||1||Compactness of the tree-graph representation of the sequence|
|MFEI1||1||MFEI1 = dG(C+G)|
|MFEI2||1||MFEI2 = dGnumberofstems|
|MFEI3||1||MFEI3 = dGnumberofloops|
|MFEI4||1||MFEI4 = dGtotalbases|
|MFEI5||1||MFEI5= dG(A+U) ratio|
|dSL||1||Normalized structure entropy|
|dHL||1||Normalized structure enthalpy|
|TmL||1||Normalized melting temperature|
|BPX, where X GC,GU,AU||3||Ratio of totalbases to respective base pairs|
|GC||1||Number of G,C bases|
|AvgBPStem||1||Average number of base pairs in the stem region|
|AUL,GCL, GUL||3||XY is the number of (X Y) base pairs in the secondary structure|
|AUnstems, G Cnstems,|
|and G Unstems||3||Average number of base pairs in the stem region|
|zP,zG,zD,zQ,zSP||5||statistical Z-score of the folding measures|
|dPs||1||Positional Entropy which estimates the structural volatility of the secondary structure|
|EAFE||1||Normalized Ensemble Free Energy|
|CEL||1||Centroid energy normalized by length|
|Diff||1||Diff = MFE-EFEL where, EFE is the ensemble free energy|
|IH||1||Hairpin length dangling ends|
|IC||1||Maximum consecutive base-pairs|
|L||1||Ratio of loop length to hairpin length|