Fixed-Parameter Tractable Sampling for RNA Design with Multiple Target Structures
Motivation: The design of multi-stable RNA molecules has important applications in biology, medicine, and biotechnology. Synthetic design approaches profit strongly from effective in-silico methods, which can tremendously impact their cost and feasibility.
Results: We revisit a central ingredient of most in-silico design methods: the sampling of sequences for the design of multi-target structures, possibly including pseudoknots. For this task, we present the efficient, tree decomposition-based algorithm RNARedPrint. Our fixed parameter tractable approach is underpinned by establishing the P-hardness of uniform sampling. Modeling the problem as a constraint network, RNARedPrint supports generic Boltzmann-weighted sampling for arbitrary additive RNA energy models; this enables the generation of RNA sequences meeting specific goals like expected free energies or GC-content. Finally, we empirically study general properties of the approach and generate biologically relevant multi-target Boltzmann-weighted designs for a common design benchmark. Generating seed sequences with RNARedPrint, we demonstrate significant improvements over the previously best multi-target sampling strategy (uniform sampling).
Availability: Our software is freely available at: https://github.com/yannponty/RNARedPrint
Synthetic biology endeavors the engineering of artificial biological systems, promising broad applications in biology, biotechnology and medicine. Centrally, this requires the design of biological macromolecules with highly specific properties and programmable functions. In particular, RNAs present themselves as well-suited tools for rational design targeting specific functions (Kushwaha2016). RNA function is tightly coupled to the formation of secondary structure, as well as changes in base pairing propensities and the accessibility of regions, e.g. by burying or exposing interaction sites (Rodrigo2014). At the same time, the thermodynamics of RNA secondary structure is well understood and its prediction is computationally tractable (McCaskill1990). Thus, structure can serve as effective proxy within rational design approaches, ultimately targeting catalytic (Zhang2013) or regulatory (Rodrigo2014) functions.
The function of many RNAs depends on their selective folding into one or several alternative conformations. Classic examples include riboswitches, which notoriously adopt different stable structures upon binding a specific ligand. Riboswitches have been a popular application of rational design (Wachsmuth2013; Domin2017), partly motivated by their capacity to act as biosensors (Findeiss2017). At the co-transcriptional level, certain RNA families feature alternative, evolutionarily conserved, transient structures (Zhu2013), which facilitate the ultimate adoption of functional structures at full elongation. More generally, simultaneous compatibility to multiple structures is a relevant design objective for engineering kinetically controlled RNAs, finally targeting prescribed folding pathways. Thus, modern applications of RNA design often target multiple structures, additionally aiming at other features, such as specific GC-content (Reinharz2013) or the presence/absence of functionally relevant motifs (either anywhere or at specific positions) (Zhou2013); these objectives motivate flexible computational design methods.
Many computational methods for RNA design rely on similar overall strategies: initially generating one or several seed sequences and optimizing them subsequently. In many cases, the seed quality was found to be critical for the empirical performance of RNA design methods (Levin2012). For instance, random seed generation improves the prospect of subsequent optimizations, helping to overcome local optima of the objective function, and increases the diversity across designs (Reinharz2013). For single-target approaches, INFO-RNA (Busch2006) made significant improvements mainly by starting its local search from the minimum energy sequence for the target structure instead of (uniform) random sequences for the early RNAinverse algorithm (Hofacker1994). This strategy was later shown to result in unrealistically high GC-contents in designed sequences. To address this issue, IncaRNAtion (Reinharz2013) controls the GC-content through an adaptive sampling strategy.
Specifically, for multi-target design, virtually all available methods (Lyngsoe2012; HoenerzuSiederdissen2013; Taneda2015; Hammer2017) follow the same overall generation/optimization scheme. Facing the complex sequence constraints induced by multiple targets, early methods such as Frnakenstein (Lyngsoe2012) and Modena (Taneda2015) did not attempt to solve sequence generation systematically, but rely on ad-hoc sampling strategies. Recently, the RNAdesign approach (HoenerzuSiederdissen2013), coupled with powerful local search within RNAblueprint (Hammer2017), solved the problem of sampling seeds from the uniform distribution for multiple targets. These methods adopt a graph coloring perspective, initially decomposing the graph hierarchically using various decomposition algorithms, and precomputing the number of valid sequences within each subgraph. The decomposition is then reinterpreted as a decision tree to perform a stochastic backtrack, inspired by Ding and Lawrence (Ding2003). Uniform sampling is achieved by choosing individual nucleotide assignments with probabilities derived from the subsolution counts. The overall complexity of RNAdesign grows like , where the parameter is bounded by the length of the designed RNA; typically, the decomposition strategy achieves much lower .
The exponential time and space requirements of the RNAdesign method already raise the question of the complexity of (uniform) sampling for multi-target design. Since stochastic backtrack can be performed in linear time per sample, the method is dominated by the precomputation step, which requires counting valid designs. Thus, we focus on the question: Is there a polynomial-time algorithm to count valid multi-target designs? In Section 5, we answer in the negative, showing that there exists no such algorithm unless . Our result relies on a surprising bijection (up to a trivial symmetry) between valid sequences and independent sets of a bipartite graph, being the object of recent breakthroughs in approximate counting complexity (Bulatov2013; Cai2016). The hardness of counting (and conjectured hardness of sampling) does not preclude, however, practically applicable algorithms for counting and sampling. In particular, we wish to extend the flexibility of multi-structure design, leading to the following questions: How to sample, generally, from a Boltzmann distribution based on expressive energy models? How to enforce additional constraints, such as the GC-content, complex sequence constraints, or the energy of individual structures?
To answer these questions, we introduce a generic framework (illustrated in Fig. 1) enabling efficient Boltzmann-weighted sampling over RNA sequences with multiple target structures (Section 3). Guided by a tree decomposition of the network, we devise dynamic programming to compute partition functions and sample sequences from the Boltzmann distribution. We show that these algorithms are fixed-parameter tractable for the treewidth; in practice, we limit this parameter by using state-of-the-art tree decomposition algorithms. By evaluating (partial) sequences in a weighted constraint network, we support arbitrary multi-ary constraints and thus arbitrarily complex energy models, notably subsuming all commonly used RNA energy models. Moreover, we describe an adaptive sampling strategy to control the free energies of the individual target structures and the GC-content. We observe that sampling based on less complex RNA energy models (taking only the most important energy contributions into account) still allows targeting realistic RNA energies in the well-accepted Turner RNA energy model. The resulting combination of efficiency and high accuracy finally enables generating biologically relevant multi-target designs in our final application of our overall strategy to a large set of multi-target RNA design instances from a representative benchmark Section 4).
2 Definitions and problem statement
An RNA sequence is a word over the nucleotides alphabet ; let denote the set of sequences of length . An RNA (secondary) structure of length is a set of base pairs , where , where for all different : (“degree 1”). Valid base pair must pair bases from Consequently, is valid for , iff for all .
We consider a fixed set of target RNA structures for sequences of length . induces a base pair dependency graph with nodes and edges , which describe the minimal dependencies present in all relevant settings due to the requirement of canonical base pairing.
One can interpret the valid sequences for as colorings of in a slightly modified graph coloring variant, where the colors (from ) assigned to adjacent vertices of must constitute valid base pairs. We define the energy of a sequence based on the set of structures as . In our setting, the energy is additively composed of the energies of the single RNA structures in an RNA energy model, as well as sequence dependent features like GC-content. Furthermore note that is finite iff is valid for each structure .
At the core of this work, we study the computation of partition
functions over sequences.
Central problem (Partition function over sequences). Given an energy function and a set of structures of length , compute the partition function
where denotes the inverse pseudo-temperature, and the set of sequences of length .
As we elaborate in subsequent sections, our approach relies on breaking down the energy function into additive components, each depending on only few sequence positions. Given , we express as the sum of energy contributions over a set (of functions ), s.t. . This captures realistic RNA energy models—including nearest neighbor models, e.g. Turner2009, and even pseudoknot models, e.g. Andronescu2010, while bounding the dependencies to sequence positions introduced by each single . Formally, define the dependencies of as the minimum set of sequence positions , where for all sequences and that agree at the positions in .
Each set of functions (on sequences of length ) induces a dependency graph on sequence positions, namely the hypergraph . Our algorithms will critically rely on a tree decomposition of the dependency graph, which we define below.
Definition 1 (Tree decomposition and width)
Let be a hypergraph. A tree decomposition of is a pair , where is an unrooted tree/forest, and (for each ) is a set of vertices assigned to the node tree , such that
each occurs in at least one ;
for all , induces a connected subtree of ;
for all , there is a node , such that .
The width of a tree decomposition is defined as . The treewidth of is the smallest width of any tree decomposition of .
3 An FPT algorithm for the partition function and sampling of Boltzmann-weighted designs
For our algorithmic description, we translate the concepts of Section 2 to the formalism of constraint networks, here specialized as RNA design network. This allows us to base our algorithm on the cluster tree elimination (CTE) of Dechter2013. In the RNA design network, (partially determined) RNA sequences replace the more general concept of (partial) assignments in constraint networks. Partially determined RNA sequences, for short partial sequences, are words over the alphabet equivalently representing the set of RNA sequences, where for positions , implies for all . The positions , where , are called determined by and form its domain. Since the functions of Section 2 depend on only the subset of sequence positions, one can evaluate them for partial sequences that determine (at least) the nucleotides at all positions in . Thus for functions and partial sequences that determine , we write to evaluate for ; i.e. for any sequence .
An RNA design network (for sequences of length ) is a tuple , where
is the set of sequence positions
is a set of functions
The energy of a sequence in a network is defined as sum of the values of all functions in evaluated for , i.e.
The network energy corresponds to the energy in Eq. where this energy is modeled as sum of the functions in . Consequently, of Eq. is modeled as network partition function
3.1 Partition function and Boltzmann sampling through stochastic backtrack
The minimum energy, counting, and partition function over RNA design network can be computed by dynamic programming based on a tree decomposition of the network’s dependency graph (i.e. cluster tree elimination). We focus on the efficient computation of the partition function.
We require additional definitions: A cluster tree for the network is a tuple , where is a tree decomposition of , and represents a set of functions , each uniquely assigned to a node ; and for all . For two nodes and of the cluster tree, define their separator as ; moreover, we define the difference positions from to an adjacent by .
For a set of sequence positions, write to denote the set of all partial sequences that determine exactly the positions in ; furthermore, given partial sequences and , we define the combined partial sequence such that
Finally we assume, w.l.o.g., that all position difference sets are singleton: for any given cluster tree, an equivalent (in term of treewidth) cluster tree can always be obtained by inserting at most additional clusters.
Let us now consider the computation of the partition function. Given is the RNA design network and its cluster tree decomposition . W.l.o.g., we assume that is connected and contains a dedicated node , with and , added as a virtual root connected to a node in each connected component of . Now, we consider the set of directed edges of oriented to ; define as the induced subtree of . Algorithm 3.1 computes the partition function by passing messages along these directed edges (i.e. always from some child to its parent ). Each message is a function that depends on the positions and yields a partition function in . The message from to represents the partition functions of the subtree of for all possible partial sequences in . Induction over lets us show the correctness of the algorithm (Supp. Mat. 10). After running Alg. 3.1, multiplying the 0-ary messages sent to the root yields the total partition function:
\KwDataCluster tree \KwResultMessages
for all ; i.e. partition
functions of the subtrees of all for all possible partial
sequences determining exactly the positions .
\For in postorder
product( for )
product( for ) \Return
FnFunction \SetKwFunctionSampleSample \SetKwFunctionRandomUnifRand
The partition functions can then direct a stochastic backtrack to achieve Boltzmann sampling of sequences, such that one samples from the Boltzmann distribution of a given design network . The sampling algorithm assumes that the cluster tree was expanded and the messages for the edges in are already generated by Algorithm 3.1 for the expanded cluster tree. Algorithm 3.2 defines the recursive procedure , which returns—randomly drawn from the Boltzmann distribution—a partial sequence that determines all sequence positions in the subtree rooted at . Called on and the empty partial sequence, which does not determine any positions, the procedure samples a random sequence from the Boltzmann distribution.
3.2 Computational complexity of the multiple target sampling algorithm
Note that in the following complexity analysis, we omit time and space for computing the tree decomposition itself, since we observed that the computation time of tree decomposition (GreedyFillIn, implemented in LibTW by Dijk2006) for multi-target sampling is negligible compared to Alg. 3.1 (Supp. Mat. 8 and 9).
We define the maximum separator size as and denote the maximum size of over as . In the absence of specific optimizations, running Alg. 3.1 requires time and space (Supp. Mat. 11); Alg. 3.2 would require per sample on arbitrary tree decompositions (Supp. Mat. 11). W.l.o.g. we assume that ; note that tree decompositions can generally be transformed, such that . Moreover, the size of is linearly bounded: for input structures for sequences of length , the energy function is expressed by functions. Finally, the number of cluster tree nodes is in , such that .
Node , partial sequence ;
Cluster tree and partition functions , and . \KwResultBoltzmann-distributed random partial sequence for the subtree rooted at , specializing a partial sequence . \Fn\Sample \For
product( for ) \If \For \Return
Theorem 3.1 (Complexities)
For sequence length , target structures, treewidth and a base pair dependency graph having connected components, sequences are generated from the Boltzmann distribution in time.
As shown in Supp. Mat. 12, the complexity of the precomputation can be further improved to , where is the maximum number of connected components represented in a node of the tree decomposition ().
3.3 Sequence features, constraints, and energy models.
The functions in allow expressing complex features of the sequences alone, e.g. rewarding or penalizing specific sequence motifs, as well as features depending on the target structures. Furthermore, constraints, which enforce or forbid features, are naturally expressed by assigning infinite penalties to invalid (partial) sequences. Finally, but less obviously, the framework captures various RNA energy models with bounded dependencies, which we describe briefly.
In simple base pair-based energy models, energy is defined as the sum of base pair (pseudo-)energies. If base pair energies (where and are sequence positions, and are bases in ) are given for each target structure , s.t. , we encode the network energy by the set of functions for each base pair of each input structure that evaluate to under partial sequence ; here, is a weight that controls the influence of structure on the sampling (as elaborated later).
More complex loop-based energy models —e.g. the Turner model, which among others includes energy terms for special loops and dangling ends—can also be encoded as straightforward extensions. An interesting stripped-down variant of the nearest neighbor model is the stacking energy model. This model assigns non-zero energy contributions only to stacks, i.e. interior loops with closing base pair and inner base pair .
The arity of the introduced functions provides an important bound on the treewidth of the network (and therefore computational complexity). Thus, it is noteworthy that the base pair energy model requires only binary functions; the stacking model, only quarternary dependencies. This arity is increased in a few cases by the commonly used Turner 2004 model (Turner2009) for encoding tabulated special hairpin and interior loop contributions, which depend on up to nine bases for the interior loops with a total of 5 unpaired bases (“2x3” interior loops)—all other energy contributions (like dangling ends) still depend on at most four bases of the sequence.
3.4 Extension to multidimensional Boltzmann sampling
The flexibility of our framework supports an advanced sampling technique, named multidimensional Boltzmann sampling (Bodini2010) to (probabilistically) enforce additional constraints. This technique was previously used to control the GC-content (Waldispuehl2011; Reinharz2013) and dinucleotide content (Zhang2013) of sampled RNA sequences; it enables explicit control of the free energies of the single targets.
Multidimensional Boltzmann sampling requires the ability to sample from a weighted distribution over the set of compatible sequences, where the probability of a sequence with free energies for its target structures is where is a vector of positive real-valued weights, and is the weighted partition function. Such a distribution can be induced by a simple modification of the functions described in Sec. 3.3, where any energy function for a structure is replaced by . The probability of a sequence is thus proportional to
One then needs to learn a weights vector such that, on average, the targeted energies are achieved by a random sequences in the weighted distribution, i.e. such that , . The expected value of is always decreasing for increasing weights (see Supp. Mat. 14). More generally, computing a suitable can be restated as a convex optimization problem, and be efficiently solved using a wide array of methods (Denise2010; Bendkowski2017). In practice, we use a simple heuristics which starts from an initial weight vector and, at each iteration, generates a sample of sequences. The expected value of an energy is estimated as , and the weights are updated at the -th iteration by . In practice, the constant is chosen empirically to achieve effective optimization. While heuristic in nature, this basic iteration was elected in our initial version of RNARedPrint because of its reasonable empirical behavior (choosing ).
A further rejection step is applied to the generated structures to retain only sequences whose energy for each structure belongs to , for some predefined tolerance. The rejection approach is justified by the following considerations: i) Enacting an exact control over the energies would be technically hard and costly. Indeed, controlling the energies through dynamic programming would require explicit convolution products, generalizing Cupal1996, inducing additional time and space overheads; ii) Induced distributions are typically concentrated. Intuitively, unless sequences are fully constrained individual energy terms are independent enough so that their sum is concentrated around its mean – the targeted energy (cf Fig. 2). For base pair-based energy models and special base pair dependency graph (paths, cycles…) this property rigorously follows from analytic combinatorics, see Bender1983 and Drmota1997. In such cases, the expected number of rejections before reaching the targeted energies remains constant when , and when . The GC-content of designs can also be controlled, jointly and in a similar way, as done in IncaRNAtion (Reinharz2013).
4.1 Targeting Turner energies and GC-content
We implemented the Boltzmann sampling approach (Algorithms 3.1 and 3.2), performing sampling for given target structures and weights ; moreover on top, multi-dimensional Boltzmann sampling (see Section 3.4) to target specific energies and GC-content. Our tool RNARedPrint evaluates energies according to the base pair energy model or the stacking energy model, using parameters which were trained to approximate Turner energies (Supp. Mat. 13).
To capture the realistic Turner model , we exploit its very good correlation (supp. Fig 6) with our simple stacking-based model . Namely, we observe a structure-specific affine dependency between these Turner and stacking energy models, so that for each structure . We learn the parameters from a set of sequences generated with homogenous weights , tuning only the GC-content to a predetermined target frequency. Finally, we adjust the targeted energy of our stacking model to .
To illustrate our above strategy, we sampled sequences targeting and different Turner energies for the three structure targets of an example instance. Fig. 2A illustrates how the Turner energy distributions of the shown target structures can be accurately shifted to prescribed target energies; for comparison, we plot the respective energy distribution of uniformly and Boltzmann sampled sequences. Fig. 2B summarizes the targeting accuracy for the single structure energies over a larger set of instances from the Modena benchmark. We emphasize that only the simple stacking energies are directly targeted (at 10% tolerance, expect for a few hard instances). Due to our linear adjustment, we finally achieve well-controlled energies in the realistic Turner energy model.
4.2 Generating high-quality seeds for further optimization
We empirically study the capacity of RNARedPrint to improve the generation of seed sequences for subsequent local optimizations, showing an important application in biologically relevant scenario.
For a multi-target design instance, individual reasonable target energies are estimated from averaging the energy over samples at relatively high weights , setting the targeted GC-content to 50%. For each instance of the Modena benchmark (Taneda2015), we estimated such target energies and subsequently generated seed sequences at these energies. Our procedure is tailored to produce sequences with similar Turner energy that favor the stability of the target structures having moderate GC-content; all these properties are desirable objectives for RNA design. All sampled sequences are evaluated under the multi-stable design objective function of Hammer2017:
Using our strategy, we generated at least 1 000 seeds per instance of the subsets RNAtabupath (2str), 3str, and 4str, of the Modena benchmark set with 2,3, and 4 structures. The results are compared, in terms of their value for the objective function of Eq. (4.2), to seed sequences uniformly sampled using RNAblueprint (Hammer2017) (Table 1, seeds). Our first analysis reveals that Boltzmann-sampled sequences constitute better seeds, associated with better values ( improvement on average), compared to uniform seeds for all instances of our benchmark (Supp. Mat. 15).
Moreover, for each generated seed sequence, we optimized the objective function through an adaptive greedy walk consisting of 500 steps. At each step, we resampled (uniformly random) the positions of a randomly selected component in the base pair dependency graph, and accepted the modification only if it resulted in a gain, as described in RNAblueprint (Hammer2017). Once again, for all instances, we observe a positive improvement of the quality of Boltzmann designs over uniform ones, suggesting a superior potential for optimization ( avg improvement, Table 1, optimized). Notably, our subsequent optimization runs, which are directly inherited from the uniformly sampling RNABlueprint software, partially level the advantages of Boltzmann sampling. In future work, we hope to improve this aspect by exploiting Boltzmann sampling even during the optimization run.
|Seeds||2str||21.67 (4.38)||37.74 (6.45)||73%|
|3str||18.09 (3.98)||30.49 (5.41)||71%|
|4str||19.94 (3.84)||32.29 (5.24)||63%|
|Optimized||2str||5.84 (1.31)||7.95 (1.76)||28%|
|3str||5.08 (1.10)||7.04 (1.52)||31%|
5 Counting valid designs is #P-hard
Finally, we turn to the complexity of , the problem of computing the number of valid sequences for a given compatibility graph . Note that this problem corresponds to the partition function problem in a simple -valued base pair model, with . As previously noted (Flamm2001), a set of target structures admits a valid design iff its compatibility graph is bipartite, which can be tested in linear time. Moreover, without base pairs, any connected component is entirely determined by the assignment of a single of its nucleotides. The number of valid designs is thus simply , where is the number of connected components.
The introduction of base pairs radically changes this picture, and we show that valid designs for a set of structures cannot be counted in polynomial time, unless . The latter equality would, in particular, imply the classic , and solving is polynomial time is therefore probably difficult.
To establish that claim, we consider instances that are connected and bipartite (), noting that hardness over restricted instances implies hardness of the general problem. Moreover remark that, as observed in Subsec. 3.2, assigning a nucleotide to a position constrains the parity ( or ) of all positions in the connected component of . For this reason, we restrict our attention to the counting of valid designs up to trivial symmetry, by constraining the positions in (resp. ) to only and (resp. and ). Let denote the subset of all designs for under this constraint, noting that .
We remind that an independent set of is a subset of nodes that are not connected by any edge in . Let denote the set of all independent sets in .
is in bijection with .
Consider the function defined by
Let us establish the injectivity of , i.e. that for all . If , then there exists a node such that . Assume that , and remind that the only nucleotides allowed in are and . Since , then we have and we conclude that differs from at least with respect to its inclusion/exclusion of .
We turn to the surjectivity of , i.e. the existence of a preimage for each element . Let us consider the function defined as
One easily verifies that .
We are then left to determine if is a valid design for , i.e. if for each one has Since is bipartite, any edge in involves two nodes and . Remark that, among all the possible assignments and , the only invalid combination of nucleotides is . However, such nucleotides are assigned to positions that are in the independent set , and therefore cannot be adjacent. We conclude that is surjective, and thus bijective.
Now we can build on the connection between the two problems to obtain complexity results for Designs. Counting independent sets in bipartite graphs () is indeed a well-studied #P-hard problem (Ge2012), from which we immediately conclude:
Designs is -hard
Note that is also #P-hard on connected graphs, as the number of independent sets for a disconnected graph is given by . Thus any efficient algorithm for on connected instances provides an efficient algorithm for general graphs.
Let us now hypothesize the existence of a polynomial-time algorithm for Designs over strongly-connected graphs . Consider the (polynomial-time) algorithm that first executes on to produce , and returns . Clearly solves in polynomial-time. This means that is at least as hard as , thus does not admit a polynomial time exact algorithm unless .
Motivated by the—here established—hardness of the problem, we introduced a general framework and efficient algorithms for design of RNA sequences to multiple target structures with fine-tuned properties. Our method combines an FPT stochastic sampling algorithm with multi-dimensional Boltzmann sampling over distributions controlled by expressive RNA energy models. Compared to the previously best available sampling method (uniform sampling), this approach generated significantly better seed sequences for instances of a common multi-target design benchmark.
The presented method enables new possibilities for sequence generation in the field of RNA sequence design by enforcing additional constraints, like the GC-content, while controlling the energy of multiple target structures. Thus it presents a major advance over previously applied ad-hoc sampling and even efficient uniform sampling strategies. We have shown the practicality of such controlled sequence generation and studied its use for multi-target RNA design. Moreover, our framework is equipped to include more complex sequence constraints, including mandatory/forbidden motifs at specific positions or anywhere in the designed sequences. Furthermore, the method can support even negative design principles, for instance by penalizing a set of alternative helices/structures. Based on this generality, we envision our framework—in extensions, which still require delicate elaboration—to support various complex RNA design scenarios far beyond the sketched applications.
YP is supported by the Agence Nationale de la Recherche and the FWF (ANR-14-CE34-0011; project RNALands). SH is supported by the German Federal Ministry of Education and Research (BMBF support code 031A538B; de.NBI: German Network for Bioinformatics Infrastructure) and the Future and Emerging Technologies programme (FET-Open grant 323987; project RiboNets). We thank Leonid Chindelevitch for suggesting using the stable sets analogy to optimize our FPT algorithm, and Arie Koster for practical recommendations for computing tree decompositions.
7 Approximate counting and random generation
In fact, not only is a reference problem in counting complexity, but it is also a landmark problem with respect to the complexity of approximate counting problems. In this context, it is the representative for a class of -hard problems (Bulatov2013) that are easier to approximate than , yet are widely believed not to admit any Fully Polynomial-time Randomized Approximation Scheme. Recent results reveal a surprising dichotomy in the behavior of : it admits a Fully Polynomial-Time Approximation Scheme (FPTAS) for graphs of max degree (Weitz2006), but is as hard to approximate as the general problem on graphs of degree (Cai2016). In other words, there is a clear threshold, in term of the max degree, separating (relatively) easy instances from really hard ones.
Additionally, let us note that, from the classic Vizing Theorem, any bipartite graph having maximum degree can be decomposed in polynomial time in exactly matchings. Any such matching can be reinterpreted as a secondary structure, possibly with crossing interactions (pseudoknots). These results have two immediate consequences for the pseudoknotted version of the multiple design counting problem.
Corollary 2 (as follows from (Weitz2006))
The number of designs compatible with pseudoknotted RNA structures can be approximated within any fixed ratio by a deterministic polynomial-time algorithm.
Corollary 3 (as follows from (Cai2016))
As soon as the number of pseudoknotted RNA structures strictly exceeds , Designs is as hard to approximate as #BIS.
It is worth noting that the hardness of Designs does not immediately imply the hardness of generating a valid design uniformly at random, as demonstrated constructively by Jerrum, Valiant and Vazirani (Jerrum1986). However, in the same work, the authors establish a strong connection between the complexity of approximate counting and the uniform random generation. Namely, they showed that, for problems associated with self-reducible relations, approximate counting is equally hard as (almost) uniform random generation. We conjecture that the (almost) uniform sampling of sequences from multiple structures with pseudoknots is in fact #BIS-hard as soon as the number of input structures strictly exceeds , as indicated by Goldberg2004, motivating even further our parameterized approach.
8 Tree decomposition for RNA design instances in practice
For studying the typically expected treewidths and tree decomposition run times in multi-target design instances, we consider five sets of multi-target RNA design instances of different complexity. Our first set consists of the Modena benchmark instances.
In addition, we generated four sets of instances of increasing complexity. The instances of the sets RF3, RF4, RF5, and RF6, each respectively specify 3,4,5, and 6 target structures for sequence length 100. For each instance (100 instances per set), we generated a set of () compatible structures as follows
Generate a random sequence of length 100;
Compute its minimum free energy structure (ViennaRNA package);
Add the new structure to the instances if the resulting base pair dependency graph is bipartite;
Repeat until structures are collected.
For each instance, we generated the dependency graphs in the base pair model and in the stacking model. Then, we performed tree decomposition (using strategy “GreedyFillIn” of LibTW (Dijk2006)) on each dependency graph. The obtained treewidths are reported in Fig. 3, while Fig. 4 shows the corresponding run-times of the tree decomposition. Fig. 5 shows the tree decompositions for an example instance from set RF3.
9 Expressing higher arity functions as cliques in the dependency graph
Our implementation relies on external tools for the tree decomposition. Therefore it is not trivial that the weight functions (i.e. terms of the energy function) can be captured within the tree decompositions returned by a specific tool. In other words, it is not immediately clear how one can construct a cluster tree from an arbitrary tree decomposition for a given network.
Crucially, one can show that, as long as the network the arguments of a function are materialized by a clique within the network , then there exists at least one node in the tree decomposition that contains , thus allowing the evaluation of .
Let be an undirected graph, and be a tree decomposition for . For each clique , there exists a node such that .
For the sake of simplicity, let us consider as rooted on an arbitrary node, inducing an orientation, and denote by the set of vertices found in a node or its descendants, namely
Now consider (one of) the deepest node(s) , such that . We are going to prove that , using contradiction, by showing that implies that is not a tree decomposition.
Indeed, let us assume that , then there exists a node whose presence in stems from its presence in some descendant of . Let us denote by the closest descendant of such that . Now, consider a node such that and . Such a node always exists, otherwise , and not , would be the deepest node such that . From the definition of the tree decomposition, we know that neither nor its descendants may contain . On the other hand, none of the parents or siblings of may contain . It follows that is no node whose vertex set includes, at the same time, and . Since both and belong to a clique, one has . The absence of a node in capturing an edge in contradicts our initial assumption that is tree decomposition for .
We conclude that is such that .
This result implies that classic tree decomposition algorithms and, more importantly, implementations can be directly re-used to produce decompositions that capture energy models of arbitrary complexity, functions of arbitrary arity. Indeed, it suffices to add a clique, involving the parameters of the function, and Lemma 1 guarantees that the tree decomposition will feature one node to which the function can be associated.
10 Correctness of the FPT partition function algorithm
Theorem 10.1 (Correctness of Alg. 3.1)
As computed by Alg. 3.1 for cluster tree , the messages , for all edges , yield the partition functions of subtree of for the partial sequences , i.e. the messages satisfy
where denotes all -assigned positions of nodes in ; respectively ; all -assigned functions.
Note that in more concise notation, Alg. 3.1 computes messages such that
Proof by induction on . If is a leaf, , there are no messages sent to , and implying Eq. Otherwise, since the algorithm traverses edges in postorder, received from its children the messages , which satisfy Eq. (induction hypothesis). Let ; then, is computed by the algorithm according to Eq. . We rewrite as follows
To see (*) and (**), we observe:
The sets and are all disjoint due to Def. 1, property 2. First, this property implies that any shared position between the subtrees of and must be in , and , thus the positions of are disjoint. Second, if a position occurs in , it must occur in and consequently in .
The union of the sets and is
11 General complexity of the partition function computation by cluster tree elimination and generation of samples
Given a cluster tree , of the RNA design network with treewidth and maximum separator size , computing the partition function by Algorithm 3.1 takes time and space (for storing all messages). The sampling step has time complexity of per sample.
Proof (Proposition 2)
Let denote the degree of node in , is the size of the separator between and its parent in rooted at . For each node in the cluster tree, Alg. 3.1 computes one message by combining functions, each time enumerating combinations; Alg. 3.2 computes partition functions each time combining functions. is bound by and by ; moreover .
12 Exploiting constraint consistency to reduce the complexity
While Alg. 3.1 computes messages values for all possible combinations of nucleotides for the positions in a node, we observe here that many such combinations are not required for computing all relevant partition functions. In particular, the algorithm can be restricted to consider only valid combinations, satisfying the (hard) constraints induced by valid base pairs.
RNA design introduces constraints due to the requirement of canonical (aka complementary) base pairing, which can be exploited in a particularly simple and effective way to reduce the complexity. As previously noted by Flamm2001, the base pair complementarity induces a bi-partition of each connected component within the base pair dependency graph, such that the nucleotides assigned to the two set of nodes in the partition are restricted to values in and respectively. We call a partial sequence cc-valid, iff its determined positions are consistent with such a separation for all determined positions of the same connected component.
One can now modify Alg. 3.1, on a tree decomposition , such that the message values are only computed for cc-valid partial sequences . Moreover, the loop over is restricted, such that are cc-valid. Analogous restrictions are then implemented in the sampling algorithm Alg. 3.2.
The correctness of the modified algorithm follows from the same induction argument, where the message computation over one node evaluates messages from its children only at cc-valid partial sequences. The result of this computation is a message, which corresponds to the partition function, restricted to cc-valid partial sequences. Since invalid sequences have infinite energy, they do not contribute to the partition function, and the partition function restricted to cc-valid sequences coincides with the initial one.
This restriction drastically improves the time complexity. Indeed, for any given node , the original algorithm sends message for such that and , while the modified algorithm only considers cc-valid assignments for . Remark that, in any connected component , assigning some nucleotide to a position reduces cc-valid assignments to (at most) two alternatives for each of the remaining positions. It follows that, for a node featuring positions from distinct connected components in the base pair dependency graph, the number of cc-valid assignments to positions in is exactly
for a single node, where is the total number of connected components in the base pair dependency graph, and is the tree-width of the tree decomposition . Since the number of nodes in is in , and the number of atomic energy contributions associated with structures is in , then the overall complexity grows like .
Finally, we remark that even stronger time savings could be possible in practice, since cc-valid partial sequences can still violate complementarity constraints, e.g. by assigning C and A to positions in different sets of a partition, thus satisfying the bi-partition constraints, where the positions base pair directly, rendering the partial sequence invalid. Moreover, applications of the sampling framework, can introduce additional constraints that further reduce the number of valid partial subsequences. However, exploiting all such constraints, in a complete and general way, would likely cause significant implementation overhead, while not significantly improving the asymptotic complexity.
13 Parameters for the base pair and the stacking energy model
We trained parameters for two RNA energy models to approximate the Turner energy model, as implemented in the ViennaRNA package. In the base pair model, the total energy of a sequence for an RNA structure is given as sum of base pair energies, where we consider six types of base pairs distinguishing by the bases -, - or - (symmetrically) and between stacked and non-stacked base pairs; here, we consider base pairs stacked iff , otherwise non-stacked. In the stacking model, our features are defined by the stacks, i.e. pairs and which both occur in ; we distinguish 18 types based on (i.e. all combinations that allow canonical base pairs; the configurations and are symmetric).
Both models describe the energy assigned to a pair of sequence and structure in linear dependency of the number of features and their weights. We can thus train weights for linear predictors of the Turner energy in both models.
For this purpose, we generated 5000 uniform random RNA sequences of random lengths between 100 and 200. For each sample, we predict the minimum free energy and corresponding structure using the ViennaRNA package; then, we count the distinguished features (i.e., base pair or stack types). The parameters are estimated fitting linear models without intercept (R function lm). For both models, R reports an adjusted R-squared value of 0.99. The resulting parameters are reported in Tables 2 and 3.
For validating the trained parameters, we generate a second independent test set of random RNA sequences in the same way. Fig. 6 shows correlation plots for the trained parameters in the base pair and stacking models for predicting the Turner energies in the test set with respective correlations of an .
14 Monotonicity of the partial derivatives within weight calibration
In weighted distributions, one witnesses a fairly predictable impact of the variation