Exact Learning of RNA Energy Parameters From Structure

Exact Learning of RNA Energy Parameters From Structure

Hamidreza Chitsaz111Corresponding author; e-mail: chitsaz@wayne.edu
Department of Computer Science
Wayne State University
Detroit, MI
   Mohammad Aminisharifabad
Department of Computer Science
Wayne State University
Detroit, MI
Abstract

We consider the problem of exact learning of parameters of a linear RNA energy model from secondary structure data. A necessary and sufficient condition for learnability of parameters is derived, which is based on computing the convex hull of union of translated Newton polytopes of input sequences [1]. The set of learned energy parameters is characterized as the convex cone generated by the normal vectors to those facets of the resulting polytope that are incident to the origin. In practice, the sufficient condition may not be satisfied by the entire training data set; hence, computing a maximal subset of training data for which the sufficient condition is satisfied is often desired. We show that problem is NP-hard in general for an arbitrary dimensional feature space. Using a randomized greedy algorithm, we select a subset of RNA STRAND v2.0 database that satisfies the sufficient condition for separate A-U, C-G, G-U base pair counting model. The set of learned energy parameters includes experimentally measured energies of A-U, C-G, and G-U pairs; hence, our parameter set is in agreement with the Turner parameters.

1 Introduction

The discovery of key regulatory roles of RNA in the cell has recently invigorated interest in RNA structure and RNA-RNA interaction determination or prediction [2, 3, 4, 5, 6, 7, 8]. Due to high chemical reactivity of nucleic acids, experimental determination of RNA structure is time-consuming and challenging. In spite of the fact that computational prediction of RNA structure may not be accurate in a significant number of cases, it is the only viable low-cost high-throughput option to date. Furthermore, with the advent of whole genome synthetic biology [9], accurate high-throughput RNA engineering algorithms are required for both in vivo and in vitro applications [10, 11, 12, 13, 14].

Since the dawn of RNA secondary structure prediction four decades ago [15], increasingly complex models and algorithms have been proposed. Early approaches considered mere base pair counting [16, 17], followed by the Turner thermodynamics model which was a significant leap forward [18, 19]. Recently, massively feature-rich models empowered by parameter estimation algorithms have been proposed. Despite significant progress in the last three decades, made possible by the work of Turner and others [20] on measuring RNA thermodynamic energy parameters and the work of several groups on novel algorithms [21, 22, 23, 24, 25, 26, 27, 28] and machine learning approaches [29, 30, 31], the RNA structure prediction accuracy has not reached a satisfactory level yet [32].

Until now, computational convenience, namely the ability to develop dynamic programming algorithms of polynomial running time, and biophysical intuition have played the main role in development of RNA energy models. The first step towards high accuracy RNA structure prediction is to make sure that the energy model is inherently capable of predicting every observed structure, but it is not over-capable, as accurate estimation of the parameters of a highly complex model often requires a myriad of experimental data, and lack of sufficient experimental data causes overfitting. Recently, we gave a systematic method to assess that inherent capability of a given energy model [1]. Our algorithm decides whether the parameters of an energy model are learnable. The parameters of an energy model are defined to be learnable iff there exists at least one set of such parameters that renders every known RNA structure to date, determined through X-ray or NMR, the minimum free energy structure. Equivalently, we say that the parameters of an energy model are learnable iff 100% structure prediction accuracy can be achieved when the training and test sets are identical. Previously, we gave a necessary condition for the learnability and an algorithm to verify it. Note that a successful RNA folding algorithm needs to have generalization power to predict unseen structures. We leave assessment of the generalization power for future work.

In this paper, we give a necessary and sufficient condition for the learnability and characterize the set of learned energy parameters. Also, we show that selecting the maximum number of RNA sequences for which the sufficient condition is satisfied is NP-hard in general in arbitrary dimensions. Using a randomized greedy algorithm, we select a subset of RNA STRAND v2.0 database that satisfies the sufficient condition for separate A-U, C-G, G-U base pair counting model and yields a set of learned energy parameters that includes the experimentally measured energies of A-U, C-G, and G-U pairs.

2 Methods

2.1 Preliminaries

Let the training set be a given collection of RNA sequences and their corresponding experimentally observed structures . Throughout this paper, we assume the free energy

(1)

associated with a sequence-structure pair is a linear function of the energy parameters , in which is the number of features, is a structure in the ensemble of possible structures of , and is the feature vector.

2.2 Learnability of energy parameters

The question that we asked before [1] was: does there exist nonzero parameters such that for every , ? We ask a slightly relaxed version of that question in this paper: does there exist nonzero parameters such that for every , ? The answer to this question reveals inherent limitations of the energy model, which can be used to design improved models. We provided a necessary condition for the existence of and a dynamic programming algorithm to verify it through computing the Newton polytope for every in [1]. Following our previous notation, let the feature ensemble of sequence be

(2)

and call the convex hull of ,

(3)

the Newton polytope of . We remind the reader that the convex hull of a set, denoted by ‘conv’ here, is the minimal convex set that fully contains the set. Let and . We previously showed in [1] that if minimizes as a function of , then , i.e. the feature vector of is on the boundary of the Newton polytope of .

In this paper, we provide a necessary and sufficient condition for the existence of . First, we rewrite that necessary condition by introducing a translated copy of the Newton polytope,

(4)

in which is the Minkowski difference. The necessary condition for learnability then becomes .

2.3 Necessary and sufficient condition for learnability

The following theorem specifies a necessary and sufficient condition for the learnability.

Theorem 1 (Necessary and Sufficient Condition)

There exists such that for all , minimizes as a function of iff , in which

(5)
  • () Suppose exists such that for all , minimizes as a function of . To the contrary, assume is in the interior of . Therefore, there is an open ball of radius centered at completely contained in , i.e.

    missing(6)

    Let

    It is clear that since . Therefore, can be written as a convex linear combination of the feature vectors in

    missing(7)
    {

    F(x)⊖c(x, y) } = {v_1, …, v_N }, i.e.

    (8)
    (9)

    Note that

    (10)

    Therefore, there is , such that for otherwise,

    (11)

    which would be a contradiction with (10). Since in (missing7), there is such that . It is now sufficient to note that and which is a contradiction with our assumption that minimizes as a function of .

    () Suppose is on the boundary of . Note that for all , . We construct a nonzero such that for all , . Since is convex, it has a supporting hyperplane , which passes through . Let the positive normal to be , i.e. . Therefore, which implies that for all , or equivalently, .       

The proof above is constructive; hence using a similar argument, we characterize all learned energy parameters in the following theorem.

Theorem 2

Let

(12)

In that case, is the set of vectors orthogonal to the supporting hyperplanes of such that . Moreover, that is the convex cone generated by the inward normal vectors to those facets of that are incident to .

2.4 Compatible training set

We say that a training set is compatible if the sufficient condition for learnability is satisfied when is considered. However, the sufficient condition for the entire training set is often not satisfied in practice, for example when the feature vector is low dimensional. We would like to find a compatible subset of to estimate the energy parameters. A natural quest is to find the maximal compatible subset of . In the following section, we show that even when the Newton polytopes with polynomial complexity for all of the sequences in are given, selection of a maximal compatible subset of is NP-hard in arbitrary dimensions.

2.5 NP-hardness of maximal compatible subset

Problem 1 (Maximal Compatible Subset (MCS))

We are given a collection of convex polytopes in such that is on the boundary of every , i.e. for . The desired output is a maximal subcollection with the following property

(13)
Theorem 3

MCS is NP-hard.

  • We use a direct reduction to missingMax 3-Sat. Let be a formula in the 3-conjunctive normal form with clauses

    (14)

    where are binary variables, , and . For every clause , we build 8 convex polytopes in , where the superscripts are in binary. To achieve that, we first build a true and a false convex polytope for every variable . Let be the standard orthonormal basis for and define

    (15)
    (16)

    for an arbitrary and . Note that is a -dimensional narrow arrow with polynomial complexity, and its vertices are on the integer lattice and can be computed in polynomial time; see Figure 1. Moreover, is incompatible, i.e. . Let

    (18)

    Note that eventhough convex hull is NP-hard in general [33], can be computed in polynomial time. Essentially, the vertex set of is the union of vertices of . Assume is that assignment to variables which makes false, and let

    (19)

    More precisely,

    (20)

    Note that for every , assignment of to variables makes true. Finally, define the input to the MCS as

    (21)

    and assume is a maximal subset of with the property

    (22)
    Figure 1: Intersection of and polytopes with the plane in the proof of Theorem 3.

    It is sufficient to show that Max 3-Sat(). First, we prove that Max 3-Sat(). Suppose Max 3-Sat() = , and the assignment renders true. Define

    (23)

    and verify that it is compatible, i.e. , since

    (24)

    spans only (approximately) one orthant of . Since is a maximal subset of with the compatibility property, . Second, we prove that Max 3-Sat(). We show that for every , at most one . To the contrary, suppose for some , . Without loss of generality, assume . In that case,

    (25)

    which is a contradiction. Above int denotes the interior. Similarly, induces a consistent assignment to the variables, that makes clauses true.       

2.6 Randomized greedy algorithm

Since the maximal compatible subset problem is NP-hard in general, we used a randomized greedy algorithm. In the iteration, our algorithm starts with a seed which is a random subset of , the input set of polytopes. In our case, is a single element subset. The algorithm iteratively keeps adding other members of to as long as remains as a vertex of the convex hull of union of all of the polytopes in . Note that Theorem 1 requires to be on the boundary of the convex hull, not necessarily on a vertex. However in practice, applicable cases have as a vertex. Finally, the with maximum number of elements is returned in the output; see Algorithm 1.

Input:
Output:
1
while  MaxIterations do MaxIterations is a static/dynamic constant
      remove inapplicable polytopes
      empty subcollection
      empty polytope
     while  do
          pick a random polytope from
         
         
         
         if  is a vertex of  then
               greedily expand the subcollection
               update the convex hull of the union
         end if
     end while
     if  then
         
     end if
     
end while
Algorithm 1 Randomized Greedy Maximal Compatible Subset
Figure 2: where consists of three sequence-structure pairs: PDB_00434, PDB_00200, and PDB_00876. The dot shows the origin . Secondary structures in are shown in Figure 3.
PDB_00434 PDB_00200 PDB_00876
Figure 3: Structures of PDB_00434, PDB_00200, and PDB_00876 obtained from RNA STRAND v2.0 website [34].

3 Results

We used 2277 unpseudoknotted RNA sequence-structure pairs from RNA STRAND v2.0 database as our data set . RNA STRAND v2.0 is a convenient source of RNA sequences and structures selected from various Rfam families [35] and contains known RNA secondary structures of any type and organism, particularly with and without pseudoknots [34]. There are 2334 pseudoknot-free RNAs in the RNA STRAND database. We sorted them based on their length and selected the first 2277 ones for computational convenience. We excluded pseudoknotted structures because our current implementation is incapable of considering pseudoknots. Some sequences in the data set allow only A-U base pairs (not a single C-G or G-U pair), in which case the Newton polytope degenerates into a line.

We demonstrate the results for the separate A-U, C-G, and G-U base pair counting energy model similar to our previous model [1]. In that case, the feature vector

is three dimensional: is the number of A-U, the number of C-G, and the number of G-U base pairs in . First, we computed and used our Newton polytope program to compute for each [1]. For completeness, we briefly include our dynamic programming algorithm which starts by computing the Newton polytope for all subsequences of unit length, followed by all subsequences of length two and more up to the Newton polytope for the entire sequence . We denote the Newton polytope of the subsequence by , i.e.

(26)

The following dynamic programming yielded the result

(27)

with the base case . Above is the Minkowski sum. We then computed by translating the Newton polytope so that moves to the origin , to obtain as the input to Algorithm 1.

We then removed those polytopes in that do not have as one of their boundary vertices, to obtain in Algorithm 1. It turns out that only 126 sequences out of the initial 2277 remain in . Note that our condition here is more stringent than the necessary condition in [1], and that is why fewer sequences satisfy this condition. After 100 iterations (MaxIterations = 100) which took less than a minute, the algorithm returned 3 polytopes that are compatible, i.e. the origin is a vertex of the boundary of convex hull of union of three polytopes. They correspond to sequences PDB_00434 (length: 15nt; bacteriophage HK022 nun-protein-nutboxb-RNA complex [36]), PDB_00200 (length: 21nt; an RNA hairpin derived from the mouse 5’ ETS that binds to the two N-terminal RNA binding domains of nucleolin [37]), and PDB_00876 (length: 45nt; solution structure of the HIV-1 frameshift inducing element [38]) which are experimentally verified by NMR or X-ray.

The origin is incident to 5 facets of the resulting convex hull, shown in Figure 2, the inward normal vector to which are in the rows of

(28)

as explained in Theorem 2. The set of those energy parameters that correctly predict the three structures in Figure 3 is the convex cone generated by these 5 vectors. The Turner model measures the average energies of A-U, C-G, and G-U to be approximately kcal/mol [20]. To test whether those energy parameters fall into the convex cone generated by , we solved a convex linear equation. Let

We would like to find a positive solution to the linear equation . We formulated that as a linear program which was solved using GNU Octave, and here is the answer:

The convex cone generated by contains the vector . Therefore, our finding is in agreement with the Turner base pairing energies. Note that the three structures are base-pair rich (Figure 3).

4 Discussion

We further developed the notion of learnability of parameters of an energy model. A necessary and sufficient condition for it was given, and a characterization of the set of energy parameters that realize exact structure prediction followed as a by-product. If an energy model satisfies the sufficient condition, then we say that the training set is compatible. In our case, the RNA STRAND v2.0 training set is not compatible (for the A-U, C-G, G-U base pair counting model). We showed that computing a maximal compatible subset of a set of convex polytopes is NP-hard in general and gave a randomized greedy algorithm for it. The computed set of energy parameters for A-U, C-G, G-U from a maximal compatible subset agreed with the thermodynamic energies. Complexity of the MCS problem is an open and interesting question, particularly if we treat the dimension of the feature space as a constant. Also, assessing the generalization power of an energy model remains for future work.

References

  • [1] Elmirasadat Forouzmand and Hamidreza Chitsaz. The RNA Newton polytope and learnability of energy parameters. Bioinformatics, 29(13):i300–i307, 2013. Also ISMB/ECCB proceedings.
  • [2] Gisela Storz. An expanding universe of noncoding RNAs. Science, 296(5571):1260–3, 2002.
  • [3] David P. Bartel. MicroRNAs: genomics, biogenesis, mechanism, and function. Cell, 116(2):281–97, 2004.
  • [4] Gregory J. Hannon. RNA interference. Nature, 418(6894):244–51, 2002.
  • [5] Phillip D. Zamore and Benjamin Haley. Ribo-gnome: the big world of small RNAs. Science, 309(5740):1519–24, 2005.
  • [6] E.G. Wagner and K. Flardh. Antisense RNAs everywhere? Trends Genet., 18:223–226, May 2002.
  • [7] S. Brantl. Antisense-RNA regulation and RNA interference. Bioch. Biophys. Acta, 1575(1-3):15–25, 2002.
  • [8] Susan Gottesman. Micros for microbes: non-coding regulatory RNAs in bacteria. Trends in Genetics, 21(7):399–404, 2005.
  • [9] Daniel G. Gibson, John I. Glass, Carole Lartigue, Vladimir N. Noskov, Ray-Yuan Chuang, Mikkel A. Algire, Gwynedd A. Benders, Michael G. Montague, Li Ma, Monzia M. Moodie, Chuck Merryman, Sanjay Vashee, Radha Krishnakumar, Nacyra Assad-Garcia, Cynthia Andrews-Pfannkoch, Evgeniya A. Denisova, Lei Young, Zhi-Qing Qi, Thomas H. Segall-Shapiro, Christopher H. Calvey, Prashanth P. Parmar, Clyde A. Hutchison, Hamilton O. Smith, and J. Craig Venter. Creation of a bacterial cell controlled by a chemically synthesized genome. Science, 329(5987):52–56, 2010.
  • [10] N.C. Seeman. From genes to machines: DNA nanomechanical devices. Trends Biochem. Sci., 30:119–125, Mar 2005.
  • [11] N. C. Seeman and P. S. Lukeman. Nucleic acid nanostructures: bottom-up control of geometry on the nanoscale. Reports on Progress in Physics, 68:237–270, January 2005.
  • [12] F.C. Simmel and W.U. Dittmer. DNA nanodevices. Small, 1:284–299, Mar 2005.
  • [13] S. Venkataraman, R.M. Dirks, P.W. Rothemund, E. Winfree, and N.A. Pierce. An autonomous polymerization motor powered by DNA hybridization. Nat Nanotechnol, 2:490–494, Aug 2007.
  • [14] P. Yin, R.F. Hariadi, S. Sahu, H.M. Choi, S.H. Park, T.H. Labean, and J.H. Reif. Programming DNA tube circumferences. Science, 321:824–826, Aug 2008.
  • [15] I. Tinoco, P. N. Borer, B. Dengler, M. D. Levin, O. C. Uhlenbeck, D. M. Crothers, and J. Bralla. Improved estimation of secondary structure in ribonucleic acids. Nature New Biol., 246(150):40–41, Nov 1973.
  • [16] R. Nussinov, G. Piecznik, J. R. Grigg, and D. J. Kleitman. Algorithms for loop matchings. SIAM Journal on Applied Mathematics, 35:68–82, 1978.
  • [17] M. S. Waterman and T. F. Smith. RNA secondary structure: A complete mathematical analysis. Math. Biosc, 42:257–266, 1978.
  • [18] Michael Zuker and Patrick Stiegler. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Research, 9(1):133–148, 1981.
  • [19] J.S. McCaskill. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers, 29:1105–1119, 1990.
  • [20] D.H. Mathews, J. Sabina, M. Zuker, and D.H. Turner. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol., 288:911–940, May 1999.
  • [21] E. Rivas and S.R. Eddy. A dynamic programming algorithm for RNA structure prediction including pseudoknots. J. Mol. Biol., 285:2053–2068, Feb 1999.
  • [22] Robert M. Dirks and Niles A. Pierce. A partition function algorithm for nucleic acid secondary structure including pseudoknots. Journal of Computational Chemistry, 24(13):1664–1677, 2003.
  • [23] Hamidreza Chitsaz, Raheleh Salari, S.Cenk Sahinalp, and Rolf Backofen. A partition function algorithm for interacting nucleic acid strands. Bioinformatics, 25(12):i365–i373, 2009. Also ISMB/ECCB proceedings.
  • [24] Hamidreza Chitsaz, Rolf Backofen, and S.Cenk Sahinalp. biRNA: Fast RNA-RNA binding sites prediction. In S.L. Salzberg and T. Warnow, editors, Workshop on Algorithms in Bioinformatics (WABI), volume 5724 of LNBI, pages 25–36, Berlin, Heidelberg, 2009. Springer-Verlag.
  • [25] S.H. Bernhart, H. Tafer, U. Mückstein, C. Flamm, P.F. Stadler, and I.L. Hofacker. Partition function and base pairing probabilities of RNA heterodimers. Algorithms Mol Biol, 1:3, 2006.
  • [26] C. Honer zu Siederdissen, S. H. Bernhart, P. F. Stadler, and I. L. Hofacker. A folding algorithm for extended RNA secondary structures. Bioinformatics, 27(13):i129–136, Jul 2011.
  • [27] Fenix W. D. Huang, Jing Qin, Christian M. Reidys, and Peter F. Stadler. Target prediction and a statistical sampling algorithm for RNA-RNA interaction. Bioinformatics, 26(2):175–181, 2010.
  • [28] Rolf Backofen, Dekel Tsur, Shay Zakov, and Michal Ziv-Ukelson. Sparse RNA folding: Time and space efficient algorithms. Journal of Discrete Algorithms, 9(1):12–31, 2011. 20th Anniversary Edition of the Annual Symposium on Combinatorial Pattern Matching (CPM 2009).
  • [29] C. B. Do, D. A. Woods, and S. Batzoglou. CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics, 22:90–98, Jul 2006.
  • [30] M. Andronescu, A. Condon, H. H. Hoos, D. H. Mathews, and K. P. Murphy. Computational approaches for RNA energy parameter estimation. RNA, 16:2304–2318, Dec 2010.
  • [31] Shay Zakov, Yoav Goldberg, Michael Elhadad, and Michal Ziv-Ukelson. Rich parameterization improves RNA structure prediction. In Vineet Bafna and S. Sahinalp, editors, Proceedings of the 15th Annual international conference on Research in Computational Molecular Biology, volume 6577 of Lecture Notes in Computer Science, pages 546–562. Springer Berlin-Heidelberg, 2011.
  • [32] E. Rivas, R. Lang, and S. R. Eddy. A range of complex probabilistic models for RNA secondary structure prediction that includes the nearest-neighbor model and more. RNA, 18(2):193–212, Feb 2012.
  • [33] M. E. Dyer. The Complexity of Vertex Enumeration Methods. Mathematics of Operations Research, 8(3):381–402, 1983.
  • [34] M. Andronescu, V. Bereg, H. H. Hoos, and A. Condon. RNA STRAND: the RNA secondary structure and statistical analysis database. BMC Bioinformatics, 9:340, 2008.
  • [35] S. W. Burge, J. Daub, R. Eberhardt, J. Tate, L. Barquist, E. P. Nawrocki, S. R. Eddy, P. P. Gardner, and A. Bateman. Rfam 11.0: 10 years of RNA families. Nucleic Acids Res., 41(Database issue):D226–232, Jan 2013.
  • [36] C. Faber, M. Scharpf, T. Becker, H. Sticht, and P. Rosch. The structure of the coliphage HK022 Nun protein-lambda-phage boxB RNA complex. Implications for the mechanism of transcription termination. J. Biol. Chem., 276(34):32064–32070, Aug 2001.
  • [37] L. D. Finger, L. Trantirek, C. Johansson, and J. Feigon. Solution structures of stem-loop RNAs that bind to the two N-terminal RNA-binding domains of nucleolin. Nucleic Acids Res., 31(22):6461–6472, Nov 2003.
  • [38] D. W. Staple and S. E. Butcher. Solution structure and thermodynamic investigation of the HIV-1 frameshift inducing element. J. Mol. Biol., 349(5):1011–1023, Jun 2005.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
65215
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description