Exact Learning of RNA Energy Parameters From Structure
Abstract
We consider the problem of exact learning of parameters of a linear RNA energy model from secondary structure data. A necessary and sufficient condition for learnability of parameters is derived, which is based on computing the convex hull of union of translated Newton polytopes of input sequences [1]. The set of learned energy parameters is characterized as the convex cone generated by the normal vectors to those facets of the resulting polytope that are incident to the origin. In practice, the sufficient condition may not be satisfied by the entire training data set; hence, computing a maximal subset of training data for which the sufficient condition is satisfied is often desired. We show that problem is NPhard in general for an arbitrary dimensional feature space. Using a randomized greedy algorithm, we select a subset of RNA STRAND v2.0 database that satisfies the sufficient condition for separate AU, CG, GU base pair counting model. The set of learned energy parameters includes experimentally measured energies of AU, CG, and GU pairs; hence, our parameter set is in agreement with the Turner parameters.
1 Introduction
The discovery of key regulatory roles of RNA in the cell has recently invigorated interest in RNA structure and RNARNA interaction determination or prediction [2, 3, 4, 5, 6, 7, 8]. Due to high chemical reactivity of nucleic acids, experimental determination of RNA structure is timeconsuming and challenging. In spite of the fact that computational prediction of RNA structure may not be accurate in a significant number of cases, it is the only viable lowcost highthroughput option to date. Furthermore, with the advent of whole genome synthetic biology [9], accurate highthroughput RNA engineering algorithms are required for both in vivo and in vitro applications [10, 11, 12, 13, 14].
Since the dawn of RNA secondary structure prediction four decades ago [15], increasingly complex models and algorithms have been proposed. Early approaches considered mere base pair counting [16, 17], followed by the Turner thermodynamics model which was a significant leap forward [18, 19]. Recently, massively featurerich models empowered by parameter estimation algorithms have been proposed. Despite significant progress in the last three decades, made possible by the work of Turner and others [20] on measuring RNA thermodynamic energy parameters and the work of several groups on novel algorithms [21, 22, 23, 24, 25, 26, 27, 28] and machine learning approaches [29, 30, 31], the RNA structure prediction accuracy has not reached a satisfactory level yet [32].
Until now, computational convenience, namely the ability to develop dynamic programming algorithms of polynomial running time, and biophysical intuition have played the main role in development of RNA energy models. The first step towards high accuracy RNA structure prediction is to make sure that the energy model is inherently capable of predicting every observed structure, but it is not overcapable, as accurate estimation of the parameters of a highly complex model often requires a myriad of experimental data, and lack of sufficient experimental data causes overfitting. Recently, we gave a systematic method to assess that inherent capability of a given energy model [1]. Our algorithm decides whether the parameters of an energy model are learnable. The parameters of an energy model are defined to be learnable iff there exists at least one set of such parameters that renders every known RNA structure to date, determined through Xray or NMR, the minimum free energy structure. Equivalently, we say that the parameters of an energy model are learnable iff 100% structure prediction accuracy can be achieved when the training and test sets are identical. Previously, we gave a necessary condition for the learnability and an algorithm to verify it. Note that a successful RNA folding algorithm needs to have generalization power to predict unseen structures. We leave assessment of the generalization power for future work.
In this paper, we give a necessary and sufficient condition for the learnability and characterize the set of learned energy parameters. Also, we show that selecting the maximum number of RNA sequences for which the sufficient condition is satisfied is NPhard in general in arbitrary dimensions. Using a randomized greedy algorithm, we select a subset of RNA STRAND v2.0 database that satisfies the sufficient condition for separate AU, CG, GU base pair counting model and yields a set of learned energy parameters that includes the experimentally measured energies of AU, CG, and GU pairs.
2 Methods
2.1 Preliminaries
Let the training set be a given collection of RNA sequences and their corresponding experimentally observed structures . Throughout this paper, we assume the free energy
(1) 
associated with a sequencestructure pair is a linear function of the energy parameters , in which is the number of features, is a structure in the ensemble of possible structures of , and is the feature vector.
2.2 Learnability of energy parameters
The question that we asked before [1] was: does there exist nonzero parameters such that for every , ? We ask a slightly relaxed version of that question in this paper: does there exist nonzero parameters such that for every , ? The answer to this question reveals inherent limitations of the energy model, which can be used to design improved models. We provided a necessary condition for the existence of and a dynamic programming algorithm to verify it through computing the Newton polytope for every in [1]. Following our previous notation, let the feature ensemble of sequence be
(2) 
and call the convex hull of ,
(3) 
the Newton polytope of . We remind the reader that the convex hull of a set, denoted by ‘conv’ here, is the minimal convex set that fully contains the set. Let and . We previously showed in [1] that if minimizes as a function of , then , i.e. the feature vector of is on the boundary of the Newton polytope of .
In this paper, we provide a necessary and sufficient condition for the existence of . First, we rewrite that necessary condition by introducing a translated copy of the Newton polytope,
(4) 
in which is the Minkowski difference. The necessary condition for learnability then becomes .
2.3 Necessary and sufficient condition for learnability
The following theorem specifies a necessary and sufficient condition for the learnability.
Theorem 1 (Necessary and Sufficient Condition)
There exists such that for all , minimizes as a function of iff , in which
(5) 

() Suppose exists such that for all , minimizes as a function of . To the contrary, assume is in the interior of . Therefore, there is an open ball of radius centered at completely contained in , i.e.
missing(6) Let
It is clear that since . Therefore, can be written as a convex linear combination of the feature vectors in
missing(7) F(x)⊖c(x, y) } = {v_1, …, v_N }, i.e.
(8) (9) Note that
(10) Therefore, there is , such that for otherwise,
(11) which would be a contradiction with (10). Since in (missing7), there is such that . It is now sufficient to note that and which is a contradiction with our assumption that minimizes as a function of .
() Suppose is on the boundary of . Note that for all , . We construct a nonzero such that for all , . Since is convex, it has a supporting hyperplane , which passes through . Let the positive normal to be , i.e. . Therefore, which implies that for all , or equivalently, .
The proof above is constructive; hence using a similar argument, we characterize all learned energy parameters in the following theorem.
Theorem 2
Let
(12) 
In that case, is the set of vectors orthogonal to the supporting hyperplanes of such that . Moreover, that is the convex cone generated by the inward normal vectors to those facets of that are incident to .
2.4 Compatible training set
We say that a training set is compatible if the sufficient condition for learnability is satisfied when is considered. However, the sufficient condition for the entire training set is often not satisfied in practice, for example when the feature vector is low dimensional. We would like to find a compatible subset of to estimate the energy parameters. A natural quest is to find the maximal compatible subset of . In the following section, we show that even when the Newton polytopes with polynomial complexity for all of the sequences in are given, selection of a maximal compatible subset of is NPhard in arbitrary dimensions.
2.5 NPhardness of maximal compatible subset
Problem 1 (Maximal Compatible Subset (MCS))
We are given a collection of convex polytopes in such that is on the boundary of every , i.e. for . The desired output is a maximal subcollection with the following property
(13) 
Theorem 3
MCS is NPhard.

We use a direct reduction to missingMax 3Sat. Let be a formula in the 3conjunctive normal form with clauses
(14) where are binary variables, , and . For every clause , we build 8 convex polytopes in , where the superscripts are in binary. To achieve that, we first build a true and a false convex polytope for every variable . Let be the standard orthonormal basis for and define
(15) (16) for an arbitrary and . Note that is a dimensional narrow arrow with polynomial complexity, and its vertices are on the integer lattice and can be computed in polynomial time; see Figure 1. Moreover, is incompatible, i.e. . Let
(18) Note that eventhough convex hull is NPhard in general [33], can be computed in polynomial time. Essentially, the vertex set of is the union of vertices of . Assume is that assignment to variables which makes false, and let
(19) More precisely,
(20) Note that for every , assignment of to variables makes true. Finally, define the input to the MCS as
(21) and assume is a maximal subset of with the property
(22) It is sufficient to show that Max 3Sat(). First, we prove that Max 3Sat(). Suppose Max 3Sat() = , and the assignment renders true. Define
(23) and verify that it is compatible, i.e. , since
(24) spans only (approximately) one orthant of . Since is a maximal subset of with the compatibility property, . Second, we prove that Max 3Sat(). We show that for every , at most one . To the contrary, suppose for some , . Without loss of generality, assume . In that case,
(25) which is a contradiction. Above int denotes the interior. Similarly, induces a consistent assignment to the variables, that makes clauses true.
2.6 Randomized greedy algorithm
Since the maximal compatible subset problem is NPhard in general, we used a randomized greedy algorithm. In the iteration, our algorithm starts with a seed which is a random subset of , the input set of polytopes. In our case, is a single element subset. The algorithm iteratively keeps adding other members of to as long as remains as a vertex of the convex hull of union of all of the polytopes in . Note that Theorem 1 requires to be on the boundary of the convex hull, not necessarily on a vertex. However in practice, applicable cases have as a vertex. Finally, the with maximum number of elements is returned in the output; see Algorithm 1.




PDB_00434  PDB_00200  PDB_00876 
3 Results
We used 2277 unpseudoknotted RNA sequencestructure pairs from RNA STRAND v2.0 database as our data set . RNA STRAND v2.0 is a convenient source of RNA sequences and structures selected from various Rfam families [35] and contains known RNA secondary structures of any type and organism, particularly with and without pseudoknots [34]. There are 2334 pseudoknotfree RNAs in the RNA STRAND database. We sorted them based on their length and selected the first 2277 ones for computational convenience. We excluded pseudoknotted structures because our current implementation is incapable of considering pseudoknots. Some sequences in the data set allow only AU base pairs (not a single CG or GU pair), in which case the Newton polytope degenerates into a line.
We demonstrate the results for the separate AU, CG, and GU base pair counting energy model similar to our previous model [1]. In that case, the feature vector
is three dimensional: is the number of AU, the number of CG, and the number of GU base pairs in . First, we computed and used our Newton polytope program to compute for each [1]. For completeness, we briefly include our dynamic programming algorithm which starts by computing the Newton polytope for all subsequences of unit length, followed by all subsequences of length two and more up to the Newton polytope for the entire sequence . We denote the Newton polytope of the subsequence by , i.e.
(26) 
The following dynamic programming yielded the result
(27) 
with the base case . Above is the Minkowski sum. We then computed by translating the Newton polytope so that moves to the origin , to obtain as the input to Algorithm 1.
We then removed those polytopes in that do not have as one of their boundary vertices, to obtain in Algorithm 1. It turns out that only 126 sequences out of the initial 2277 remain in . Note that our condition here is more stringent than the necessary condition in [1], and that is why fewer sequences satisfy this condition. After 100 iterations (MaxIterations = 100) which took less than a minute, the algorithm returned 3 polytopes that are compatible, i.e. the origin is a vertex of the boundary of convex hull of union of three polytopes. They correspond to sequences PDB_00434 (length: 15nt; bacteriophage HK022 nunproteinnutboxbRNA complex [36]), PDB_00200 (length: 21nt; an RNA hairpin derived from the mouse 5’ ETS that binds to the two Nterminal RNA binding domains of nucleolin [37]), and PDB_00876 (length: 45nt; solution structure of the HIV1 frameshift inducing element [38]) which are experimentally verified by NMR or Xray.
The origin is incident to 5 facets of the resulting convex hull, shown in Figure 2, the inward normal vector to which are in the rows of
(28) 
as explained in Theorem 2. The set of those energy parameters that correctly predict the three structures in Figure 3 is the convex cone generated by these 5 vectors. The Turner model measures the average energies of AU, CG, and GU to be approximately kcal/mol [20]. To test whether those energy parameters fall into the convex cone generated by , we solved a convex linear equation. Let
We would like to find a positive solution to the linear equation . We formulated that as a linear program which was solved using GNU Octave, and here is the answer:
The convex cone generated by contains the vector . Therefore, our finding is in agreement with the Turner base pairing energies. Note that the three structures are basepair rich (Figure 3).
4 Discussion
We further developed the notion of learnability of parameters of an energy model. A necessary and sufficient condition for it was given, and a characterization of the set of energy parameters that realize exact structure prediction followed as a byproduct. If an energy model satisfies the sufficient condition, then we say that the training set is compatible. In our case, the RNA STRAND v2.0 training set is not compatible (for the AU, CG, GU base pair counting model). We showed that computing a maximal compatible subset of a set of convex polytopes is NPhard in general and gave a randomized greedy algorithm for it. The computed set of energy parameters for AU, CG, GU from a maximal compatible subset agreed with the thermodynamic energies. Complexity of the MCS problem is an open and interesting question, particularly if we treat the dimension of the feature space as a constant. Also, assessing the generalization power of an energy model remains for future work.
References
 [1] Elmirasadat Forouzmand and Hamidreza Chitsaz. The RNA Newton polytope and learnability of energy parameters. Bioinformatics, 29(13):i300–i307, 2013. Also ISMB/ECCB proceedings.
 [2] Gisela Storz. An expanding universe of noncoding RNAs. Science, 296(5571):1260–3, 2002.
 [3] David P. Bartel. MicroRNAs: genomics, biogenesis, mechanism, and function. Cell, 116(2):281–97, 2004.
 [4] Gregory J. Hannon. RNA interference. Nature, 418(6894):244–51, 2002.
 [5] Phillip D. Zamore and Benjamin Haley. Ribognome: the big world of small RNAs. Science, 309(5740):1519–24, 2005.
 [6] E.G. Wagner and K. Flardh. Antisense RNAs everywhere? Trends Genet., 18:223–226, May 2002.
 [7] S. Brantl. AntisenseRNA regulation and RNA interference. Bioch. Biophys. Acta, 1575(13):15–25, 2002.
 [8] Susan Gottesman. Micros for microbes: noncoding regulatory RNAs in bacteria. Trends in Genetics, 21(7):399–404, 2005.
 [9] Daniel G. Gibson, John I. Glass, Carole Lartigue, Vladimir N. Noskov, RayYuan Chuang, Mikkel A. Algire, Gwynedd A. Benders, Michael G. Montague, Li Ma, Monzia M. Moodie, Chuck Merryman, Sanjay Vashee, Radha Krishnakumar, Nacyra AssadGarcia, Cynthia AndrewsPfannkoch, Evgeniya A. Denisova, Lei Young, ZhiQing Qi, Thomas H. SegallShapiro, Christopher H. Calvey, Prashanth P. Parmar, Clyde A. Hutchison, Hamilton O. Smith, and J. Craig Venter. Creation of a bacterial cell controlled by a chemically synthesized genome. Science, 329(5987):52–56, 2010.
 [10] N.C. Seeman. From genes to machines: DNA nanomechanical devices. Trends Biochem. Sci., 30:119–125, Mar 2005.
 [11] N. C. Seeman and P. S. Lukeman. Nucleic acid nanostructures: bottomup control of geometry on the nanoscale. Reports on Progress in Physics, 68:237–270, January 2005.
 [12] F.C. Simmel and W.U. Dittmer. DNA nanodevices. Small, 1:284–299, Mar 2005.
 [13] S. Venkataraman, R.M. Dirks, P.W. Rothemund, E. Winfree, and N.A. Pierce. An autonomous polymerization motor powered by DNA hybridization. Nat Nanotechnol, 2:490–494, Aug 2007.
 [14] P. Yin, R.F. Hariadi, S. Sahu, H.M. Choi, S.H. Park, T.H. Labean, and J.H. Reif. Programming DNA tube circumferences. Science, 321:824–826, Aug 2008.
 [15] I. Tinoco, P. N. Borer, B. Dengler, M. D. Levin, O. C. Uhlenbeck, D. M. Crothers, and J. Bralla. Improved estimation of secondary structure in ribonucleic acids. Nature New Biol., 246(150):40–41, Nov 1973.
 [16] R. Nussinov, G. Piecznik, J. R. Grigg, and D. J. Kleitman. Algorithms for loop matchings. SIAM Journal on Applied Mathematics, 35:68–82, 1978.
 [17] M. S. Waterman and T. F. Smith. RNA secondary structure: A complete mathematical analysis. Math. Biosc, 42:257–266, 1978.
 [18] Michael Zuker and Patrick Stiegler. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Research, 9(1):133–148, 1981.
 [19] J.S. McCaskill. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers, 29:1105–1119, 1990.
 [20] D.H. Mathews, J. Sabina, M. Zuker, and D.H. Turner. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol., 288:911–940, May 1999.
 [21] E. Rivas and S.R. Eddy. A dynamic programming algorithm for RNA structure prediction including pseudoknots. J. Mol. Biol., 285:2053–2068, Feb 1999.
 [22] Robert M. Dirks and Niles A. Pierce. A partition function algorithm for nucleic acid secondary structure including pseudoknots. Journal of Computational Chemistry, 24(13):1664–1677, 2003.
 [23] Hamidreza Chitsaz, Raheleh Salari, S.Cenk Sahinalp, and Rolf Backofen. A partition function algorithm for interacting nucleic acid strands. Bioinformatics, 25(12):i365–i373, 2009. Also ISMB/ECCB proceedings.
 [24] Hamidreza Chitsaz, Rolf Backofen, and S.Cenk Sahinalp. biRNA: Fast RNARNA binding sites prediction. In S.L. Salzberg and T. Warnow, editors, Workshop on Algorithms in Bioinformatics (WABI), volume 5724 of LNBI, pages 25–36, Berlin, Heidelberg, 2009. SpringerVerlag.
 [25] S.H. Bernhart, H. Tafer, U. Mückstein, C. Flamm, P.F. Stadler, and I.L. Hofacker. Partition function and base pairing probabilities of RNA heterodimers. Algorithms Mol Biol, 1:3, 2006.
 [26] C. Honer zu Siederdissen, S. H. Bernhart, P. F. Stadler, and I. L. Hofacker. A folding algorithm for extended RNA secondary structures. Bioinformatics, 27(13):i129–136, Jul 2011.
 [27] Fenix W. D. Huang, Jing Qin, Christian M. Reidys, and Peter F. Stadler. Target prediction and a statistical sampling algorithm for RNARNA interaction. Bioinformatics, 26(2):175–181, 2010.
 [28] Rolf Backofen, Dekel Tsur, Shay Zakov, and Michal ZivUkelson. Sparse RNA folding: Time and space efficient algorithms. Journal of Discrete Algorithms, 9(1):12–31, 2011. 20th Anniversary Edition of the Annual Symposium on Combinatorial Pattern Matching (CPM 2009).
 [29] C. B. Do, D. A. Woods, and S. Batzoglou. CONTRAfold: RNA secondary structure prediction without physicsbased models. Bioinformatics, 22:90–98, Jul 2006.
 [30] M. Andronescu, A. Condon, H. H. Hoos, D. H. Mathews, and K. P. Murphy. Computational approaches for RNA energy parameter estimation. RNA, 16:2304–2318, Dec 2010.
 [31] Shay Zakov, Yoav Goldberg, Michael Elhadad, and Michal ZivUkelson. Rich parameterization improves RNA structure prediction. In Vineet Bafna and S. Sahinalp, editors, Proceedings of the 15th Annual international conference on Research in Computational Molecular Biology, volume 6577 of Lecture Notes in Computer Science, pages 546–562. Springer BerlinHeidelberg, 2011.
 [32] E. Rivas, R. Lang, and S. R. Eddy. A range of complex probabilistic models for RNA secondary structure prediction that includes the nearestneighbor model and more. RNA, 18(2):193–212, Feb 2012.
 [33] M. E. Dyer. The Complexity of Vertex Enumeration Methods. Mathematics of Operations Research, 8(3):381–402, 1983.
 [34] M. Andronescu, V. Bereg, H. H. Hoos, and A. Condon. RNA STRAND: the RNA secondary structure and statistical analysis database. BMC Bioinformatics, 9:340, 2008.
 [35] S. W. Burge, J. Daub, R. Eberhardt, J. Tate, L. Barquist, E. P. Nawrocki, S. R. Eddy, P. P. Gardner, and A. Bateman. Rfam 11.0: 10 years of RNA families. Nucleic Acids Res., 41(Database issue):D226–232, Jan 2013.
 [36] C. Faber, M. Scharpf, T. Becker, H. Sticht, and P. Rosch. The structure of the coliphage HK022 Nun proteinlambdaphage boxB RNA complex. Implications for the mechanism of transcription termination. J. Biol. Chem., 276(34):32064–32070, Aug 2001.
 [37] L. D. Finger, L. Trantirek, C. Johansson, and J. Feigon. Solution structures of stemloop RNAs that bind to the two Nterminal RNAbinding domains of nucleolin. Nucleic Acids Res., 31(22):6461–6472, Nov 2003.
 [38] D. W. Staple and S. E. Butcher. Solution structure and thermodynamic investigation of the HIV1 frameshift inducing element. J. Mol. Biol., 349(5):1011–1023, Jun 2005.