Foldability of Words

k-Foldability of Words

Beth Bjorkman Department of Mathematics, Iowa State University
Work completed while affiliated with Washington University in St. Louis.
Garner Cochran Department of Mathematics, University of South Carolina Wei Gao Department of Mathematics and Statistics, Auburn University Lauren Keough Mathematics Department, Grand Valley State University Rachel Kirsch Department of Mathematics, University of Nebraska - Lincoln Mitch Phillipson School of Natural Sciences, St. Edward’s University Danny Rorabaugh Department of Mathematics and Statistics, Queen’s University Heather Smith School of Mathematics, Georgia Institute of Technology  and  Jennifer Wise Work completed while affiliated with the University of Illinois at Urbana-Champaign.
Abstract.

We extend results regarding a combinatorial model introduced by Black, Drellich, and Tymoczko (2017+) which generalizes the folding of the RNA molecule in biology. Consider a word on alphabet in which is called the complement of . A word is foldable if can be wrapped around a rooted plane tree , starting at the root and working counterclockwise such that one letter labels each half edge and the two letters labeling the same edge are complements. The tree is called -valid.

We define a bijection between edge-colored plane trees and words folded onto trees. This bijection is used to characterize and enumerate words for which there is only one valid tree. We follow up with a characterization of words for which there exist exactly two valid trees.

In addition, we examine the set consisting of all integers for which there exists a word of length with exactly valid trees. Black, Drellich, and Tymoczko showed that for the th Catalan number , but for . We describe a superset of in terms of the Catalan numbers by which we establish more missing intervals. We also prove contains all non-negative integer less than .

Key words and phrases:
Catalan numbers, plane trees, non-crossing perfect matchings
2010 Mathematics Subject Classification:
05A15, 05C05, 20M05

1. Introduction

The molecule ribonucleic acid (RNA) consists of a single strand of the four nucleotides adenine, uracil, cytosine, and guanine. In short, RNA is representable by finite sequences (or words) from the alphabet , , , and , lending itself to combinatorial study. In contrast to the double helix of DNA, the single-stranded nature of RNA often results in RNA folding onto itself as the nucleotides form bonds. As in DNA, we have the Watson-Crick pairs so that and form bonds and and form bonds. However, RNA has one more bond that may form, the wobble pair and . It is worth noting that when RNA folds onto itself, not all nucleotides on a strand form bonds. Predicting the folded structure of RNA is important as the folded structure gives indication of its functionality.

In this paper, we direct our attention to a generalized combinatorial model, motivated by the folding of RNA. This model was first introduced by Black, Drellich, and Tymoczko [1] with an initial restriction made to the Watson-Crick bonding pairs, leaving the potential bond for future study. With this restriction, we relabel our words to use the letters , , , where only bonds with and only bonds with .

Further, we do not limit ourselves to an alphabet with only four letters. In particular, fix an integer and expand the alphabet to where and are called complements and may only form a bond with and vice versa. We say that this is an alphabet on letters and their complements. Define the length of a word to be the number of letters in the word, letting be the word of length zero (the empty word).

As in [1], we assume that when a word folds onto itself, every letter is matched with exactly one other letter. Thus, a folding of a word can be represented by a non-crossing perfect matching of the letters so that two matched letters are complements.

Recall that the Catalan numbers enumerate the non-crossing perfect matchings on points (see [5], problem 6.19, part o). In our model, the underlying word restricts the allowable matching edges based on the letter corresponding to each point. However, the word admits every non-crossing perfect matching.

Section 2 contains some preliminaries and background from the work of Black, Drellich, and Tymoczko [1]. In Section 3, we define a bijection between foldings of words and edge-colored plane trees. This is used to enumerate the words which fold in precisely one way, a problem posed in [1]. We also characterize -foldable words by a decomposition in terms of -foldable words. Section  4 is devoted to studying the set of integers such that there is a word which folds in precisely ways. We give a superset of , making a strong connection with Catalan numbers. In search of the smallest value which is not found in , we also determine a large consecutive set of small values in .

2. Preliminaries

In addition to non-crossing perfect matchings, the Catalan numbers also enumerate plane trees. A plane tree is a straight line drawing of a rooted tree embedded in the plane with the root above all other vertices. This induces a left-to-right ordering of the children of a vertex. To obtain an ordering on the half edges of a plane tree, start at the root and trace the perimeter counterclockwise, touching each side of an edge exactly once.

Fix a word and let be a plane tree with edges. Following the order of the half edges, label the half edge of with . We say that is -valid if for each edge of , the two letters from which label that edge are complements.

Definition 2.1.

A word is said to be foldable if there is a plane tree that is -valid. For integer , a word is -foldable if there are exactly plane trees that are -valid.

For example, is -foldable as seen in Figure 1. The corresponding non-crossing perfect matchings are also given. Further, the word is -foldable as every plane tree with edges is -valid.

Black, Drellich, and Tymoczko [1] defined the following greedy algorithm to produce a folding of . Given a word of length , for each starting at , create a matching as follows: Match with provided is the largest index such that , is not yet matched, and is a complement of . If no such exists, leave (temporarily) unmatched. If is foldable, this algorithm will produce a non-crossing perfect matching of [1]; the folding produced by the greedy algorithm is called the greedy folding. The folding on the right in Figure 1 is the greedy folding.

We will examine the following four sets in more detail.

Definition 2.2 (Black, Drellich, and Tymoczko [1]).

Fix and . Let be the collection of words of length from an alphabet with letters and their complements. For , define the following quantities:

• is the set of words in that are foldable.

• is the set of words in that are -foldable.111Note that [1] uses instead of .

• is the set of plane trees that are -valid.

• is the set of integers for which is non-empty.

The set can also be viewed as the length- elements of the free group on generators. However, we are primarily interested in foldable words, which are precisely those that reduce to the identity element in the free group, so we make no further group theoretic connections.

Let . Heitsch, Condon, and Hoos [2] defined a local move to transform one plane tree in into another. For two trees in , there is a move from one to the other if there is a pair of edges that can be re-paired as in Figure 2. This defines a directed graph with a vertex for each plane tree in and an edge from to when there is a Type 1 move from to . The following were proved in [1].

Theorem 2.3 (See Section 3 in [1]).

Let be a foldable word.

1. The greedy folding is a unique source of .

2. If is the greedy folding and , then there exists a path in from to .

3. Characterization and enumeration of foldable words

In this section we give a bijection between foldings of words and edge-colored plane trees. Using this bijection we characterize both -foldable and -foldable words. An enumeration of -foldable words is also given.

3.1. Doubled Alphabet

Fix a foldable word . In any folding of , if bonds with , then and must have different parities because the subword must be foldable and hence has an even (possibly zero) number of letters. This leads to the notion of a doubled alphabet to reflect that and are the only possible bonds between an and an . For an alphabet and a word , define on the doubled alphabet, , as follows:

• If , then .

• If , then .

Definition 3.1.

Fix .

• is the set of words for which each letter in an odd-index position is from , and each letter in even-index position is from .

• .

Proposition 3.2.

For , the map defines two bijections:

 S(n,m)⟷^S(n,2m)

and

 P(n,m)⟷^P(n,2m).

3.2. Walks on Regular Trees

This alternation between letters from and letters from gives us greater ability to enumerate foldable words. To demonstrate, let us view words in as walks on an infinite regular tree. The infinite, unrooted, -regular tree has distinct edges incident to every vertex, so for convenience, we can use the label set . Given a walk on , write down the sequence of edge labels, but on even-index steps, write down the complement of the labeling letter instead of the letter. So from any fixed vertex, walks of length are in bijection with the elements of .

A walk is closed if it ends where it begins. Note that a walk is closed precisely when its corresponding word in is foldable. That is, closed walks on from a fixed vertex are in bijection with the elements of . These were enumerated by Quenell:

Theorem 3.3 (Equation (19) in [4]).

Fix integer and vertex of . The generating function for the number of length- closed walks on starting at is

 fm(x)=∞∑n=0anxn = 2(m−1)m−2+m√1−4(m−1)x = 2−m(1−√1−4(m−1)x)2(1−m2x).
Corollary 3.4.

For integers and and vertex of , the number of length- closed walks on starting at is

 an = m2n−n∑i=1m(1+2(n−i))(m−1)i4i−2(2ii).
Proof.

By Newton’s generalized binomial theorem,

 √1−4(m−1)x = ∞∑n=0(1/2n)(−4(m−1)x)n = ∞∑n=0∏n−1i=0(12−i)n!(−4(m−1)x)n = ∞∑n=0−(m−1)n2n−1(2nn)xn.

Substituting this back into , we get

 fm(x) = 2−m(∑∞n=1(m−1)n2n−1(2nn)xn)2(1−m2x).

Setting for , we have

 fm(x) = 1−∑∞n=1bnxn1−m2x = 1+(m2−b1)x+(m2(m2−b1)−b2)x2+⋯ = ∞∑n=0(m2n−n∑i=1bim2(n−i))xn.

We can obtain asymptotics for using the Maple™ package algolib (version 17.0), or the saddle point method, on the generating function.

Corollary 3.5.

For fixed and vertex of , the number of length- closed walks on starting at is asymptotically

 an=(4m−4)nn3/2(m(m−1)√π(m−2)2+O(1√n)).

Recall that is in bijection with , which can be enumerated by closed walks on . Thus, using Corollary 3.5 with , we have that for fixed the number of foldable words of length as approaches infinity is asymptotically

 |P(n,m)|=Θ(n−3/2(8m−4)n). (1)

3.3. Labeling Plane Trees

There is a natural bijection between foldings of words and (not necessarily proper) edge-colorings of rooted plane trees which is most clearly seen by examining the foldings of rather than . More generally, we consider words in —that is, with alternating unbarred and barred letters—rather than in . Set

where the element is viewed as a folding of around . With denoting the set , define

 C(n,m)\coloneqq{(c,T)|T is a plane tree with n % edges and c:E(T)→[m]},

so that the elements of represent edge-colored plane trees, where the coloring is not necessarily proper.

Theorem 3.6.

For all integers and , .

Proof.

We will define a bijection from to . Fix an arbitrary . Define the edge-coloring so that if the half edges of are labeled or . Figure 3 gives an example of this mapping.

The inverse function is defined as follows. Fix . The color of each edge indicates the two letters that will be assigned to its half edges. It only remains to determine which letter will be assigned to which half edge. In the ordering of the half edges of , each edge will have an even half edge and an odd half edge. This is because the subtree below the edge contains an even number of half edges. Assign the letter with the bar to the even half edge and the other to the odd half edge. This labeling is precisely the folding of on . Since an inverse exists, the function defined is an injection. ∎

3.4. 1-foldable classification and enumeration

The correspondence between foldings of words and edge-colored plane trees leads to an enumeration of words which are -foldable. Denote with the set of all where is -foldable. Then for any and, in particular, .

Theorem 3.7.

The words in are in bijection with the proper -edge-colorings of plane trees with edges.

Proof.

First recall that the graph of valid trees for a given word is connected (Theorem 2.3). The bijection in Theorem 3.6 can be used to detect available local moves from the edge-coloring. In particular, a move exists precisely when two incident edges have the same color. Therefore, elements of with a proper edge-coloring correspond exactly with elements of . Hence, elements of with a proper edge-coloring are in correspondence with elements of . ∎

Using this classification, we now enumerate -foldable words.

Lemma 3.8.

Let be a plane tree with degree multiset where there are vertices with degree . Then the number of proper -edge-colorings of is

 k∞∏i=1((k−1i−1)(i−1)!)αi, (2)

where is understood to equal 1.

Proof.

Fix a leaf . Let be the unique neighbor of and let be the degree of . Color the edge with one of colors. Next, color the remaining incident edges, which can be ordered in ways. Visiting the vertices via a breadth-first search, each vertex will contribute a similar factor to the product as all but one of its incident edges will already be colored. ∎

Let be the maximum degree in . Note that if , expression (2) collapses to zero as expected since colors are required for a proper coloring of the edges incident to a maximum-degree vertex.

Lemma 3.9 (Mallows and Wacher [3]).

Let be the number of plane trees with degree multiset . Then

 RPT(α)=2α1(1+α2+2α3+3α4+⋯α1−1,α2,α3,…). (3)

For any sequence of non-negative integers , there is a plane tree with degree multiset for an appropriate choice of , the number of leaves. In particular, if and only if

 α1=2+α3+2α4+3α5+4α6+⋯=2+∞∑i=2(i−2)αi. (4)
Lemma 3.10.

For the multiset , set as in (4), and let be a plane tree with the degree multiset . Then satisfy the conditions

1. for , and

2. ,

if and only if has edges and can be properly edge-colored with colors.

Proof.

For any tree , the edge chromatic number . Condition 1 is equivalent to saying , so is -edge-colorable. Condition 2 is equivalent to having edges. ∎

Lemma 3.10 gives an explicit characterization of the degree conditions for a plane tree to have a proper edge-coloring. Having previously established a correspondence between foldings of -foldable words and proper edge-colorings of plane trees, the following theorem is now clear.

Theorem 3.11.

The number of -foldable words of length on an alphabet with letters and their complements is

 ∑2α1(nα1−1,α2,α3,…,α2m)⋅2m2m∏i=1((2m−1i−1)(i−1)!)αi, (5)

where the sum is over all non-negative sequences such that and .

Proof.

By Theorem 3.7 and Lemmas 3.8, 3.9, and 3.10 with the observation that

 n=1+α2+2α3+⋯+(2m−1)α2m.

Example 3.12.

When , expression (5) is a summation with one term, when , and gives words of length which are -foldable. The words are exactly and .

Example 3.13.

When , each term in (5) has the following form:

 2α1(nα1−1,α2,α3,α4)⋅44∏i=1((3i−1)(i−1)!)αi =8α1⋅(nα1−1,α2,α3,α4)⋅(1⋅0!)α1⋅(3⋅1!)α2⋅(3⋅2!)α3⋅(1⋅3!)α4 =83(2+α3+2α4)⋅(n1+α3+2α4,n−2α3−3α4−1,α3,α4)⋅3n−α3−2α4⋅2α3+α4,

with and positive integers such that . The multinomial coefficient pulls the maximum of this term away from the boundaries; that is, as grows, the maximum is not found where one of is . Let us therefore assume that and for positive constants and with . Applying Stirling’s approximation, this term is asymptotically

 (1+o(1))n(x+2y)n(x+2y)⋅(1−2x−3y)n(1−2x−3y)⋅xxn⋅yyn⋅3n(1−x−2y)⋅2n(x+y) =[3+o(1)1−2x−3y⋅(2(1−2x−3y)23x(x+2y))x⋅(2(1−2x−3y)39y(x+2y)2)y]n.

Numerically, the base of this exponential is maximized when , which gives a maximum term of . Since the number of terms in the sum is polynomial in for fixed , this is also an asymptotic approximation for the whole sum. Compare this to the length- words on an alphabet of letters and their complements, of which are foldable by equation (1).

3.5. 2-foldable classification

Using the bijection with edge-colored trees, we can also classify -foldable words. In particular, a foldable word is -foldable if the edge-colored tree corresponding to the greedy folding has only one pair of incident edges with the same color, and the tree that results after making the corresponding Type 1 move at those edges has only one pair of incident edges with the same color.

An equivalent characterization is those trees that have an -decomposition, defined as follows.

Definition 3.14.

An -decomposition of a word is a list of words , , , , and such that

1. for some (possibly barred) letter ,

2. the words , , , and are foldable, and

3. the words and are -foldable.

Note that we consider here any word in , not necessarily with an alternating bar pattern. Moreover, in condition (1) may be a barred letter. In this case, signifies its unbarred complement.

Theorem 3.15.

A word is -foldable if and only if it has an -decomposition.

Proof.

Suppose has an -decomposition. Then can be folded into the two edge-colored trees using parts (1) and (2) of the definition of -decomposition. Parts (2) and (3) of the definition imply that , , , and are -foldable. By part (3) of the definition, the only incident edges with local moves in either of these two trees are the edges labeled by or in Figure 4. Thus there are no other local moves, and since the state space graph is connected, these are the only two foldings of .

Now suppose is -foldable. The two foldings correspond bijectively to two edge-colored trees, which are adjacent by local moves shown in Figure 2. Thus we have properties (1) and (2) of an -decomposition, and (3) follows from the fact that is -foldable so has no other local moves. ∎

4. The values in R(n,m)

In this section we develop a better understanding of the set of all for which there is a word in which is -foldable. Black, Drellich, and Tymoczko [1] initiated this study with the following proposition.

Proposition 4.1 ([1]).

Let , the Catalan number. For integers and , but if then .

Wagner [6] further investigated and established monotonicity in the following sense.

Proposition 4.2 ([6]).

For positive integers and ,

Note that the previous two propositions establish that for . Wagner also showed monotonicity of in .

Proposition 4.3 ([6]).

For positive integers and , . Further, there exist and such that .

We focus our attention mainly on the case . To give some indication of the values in , computationally we find:

 R(0,1)={1}; R(1,1)=R(0,1)∪{ 0}; R(2,1)=R(1,1)∪{ 2}; R(3,1)=R(2,1)∪{ 5}; R(4,1)=R(3,1)∪{ 3,4,14}; R(5,1)=R(4,1)∪{ 7,10,42}; R(6,1)=R(5,1)∪{ 6,8,12,16,18,19,25,28,132}; R(7,1)=R(6,1)∪{ 9,15,20,30,40,43,52,56,70,84,429}; R(8,1)=R(7,1)∪{ 22,23,24,26,32,35,36,38,50,55,73,74,80,85,96, 106,114,115,126,157,160,174,196,210,264,1430}.

Working toward a more thorough understanding of the set , we first construct a superset of in Theorem 4.5 providing some structure for the values that can appear in . From there, we determine intervals of integers which do not lie in the set, such as integers in the interval from Proposition 4.1. Then, in search of the smallest value such that , we conclude by proving and hence in .

4.1. Catalan numbers and R(n,1)

The Catalan numbers, which enumerate the plane trees, are an integral part of the set . As already noted, for all . In fact, is the number of foldings of , and this is the maximum number of foldings for a word of length . Theorem 4.5 establishes a superset of which highlights the fundamental nature of the Catalan numbers in the values of . The following discussion and examples motivate the theorem.

As previously mentioned, for calculating the values in , it suffices to consider words in , foldable words which strictly alternate between unbarred and barred letters. For readability in this case, we will use and instead of and . Fix such a foldable word with entries that are and entries that are . Without loss of generality, assume begins with . For example, let . Here and .

Now consider the maximal subwords (consecutive letters) of which consist of only the letters and . We call these subwords maximal -subwords. Let be the number of letters in each of these maximal -subwords. Consequently and . In our present example, has maximal -subwords with lengths , , and .

Fix a non-crossing matching on the letters and in . We will use the term -matching to refer to such a partial matching of . Because of the alternating pattern of barred letters in the doubled alphabet any -matching partitions the maximal -subwords into groups of the form or when concatenated, where is the sum of the corresponding values. We will refer to these as -groupings. Thus, for each -matching , there is at least one non-crossing perfect matching of which extends since was foldable.

See Figure 5 for one possible -matching on our example word . With this -matching, the -subwords have been partitioned into with and with .

For a -grouping of length , there are ways to fold the group. Thus, each -matching of extends to non-crossing perfect matchings of , where is the number of -groupings and each is half of the sum of the corresponding subset of . For the present example, the -matching in Figure 5 extends to non-crossing matchings of . Alternatively, for the -matching , there is nothing separating the maximal -subwords, so extends in ways.

The following example highlights the structure that results from an -matching when there are maximal -subwords of odd length.

Example 4.4.

Let . Here , and . Again, there are multiple non-crossing -matchings, but any such matching ensures that the maximal -subwords of lengths and will be free to match with each other because they are the only two of odd length. One non-crossing -matching is and another is . Both and extend in ways. (See Figure 6.)