Multi de Bruijn Sequences

Multi de Bruijn Sequences

Glenn Tesler* Department of Mathematics
University of California, San Diego
La Jolla, CA 92093-0112
Abstract.

We generalize the notion of a de Bruijn sequence to a “multi de Bruijn sequence”: a cyclic or linear sequence that contains every -mer over an alphabet of size exactly times. For example, over the binary alphabet , the cyclic sequence and the linear sequence each contain two instances of each 2-mer . We derive formulas for the number of such sequences. The formulas and derivation generalize classical de Bruijn sequences (the case ). We also determine the number of multisets of aperiodic cyclic sequences containing every -mer exactly times; for example, the pair of cyclic sequences contains two instances of each -mer listed above. This uses an extension of the Burrows-Wheeler Transform due to Mantaci et al, and generalizes a result by Higgins for the case .

*This work was supported in part by National Science Foundation grant CCF-1115206.

1. Introduction

We consider sequences over a totally ordered alphabet of size . A linear sequence is an ordinary sequence of elements of , denoted in string notation as . Define the cyclic shift of a linear sequence by . In a cyclic sequence, we treat all rotations of a given linear sequence as equivalent:

Each rotation is called a linearization of the cycle .

A -mer is a sequence of length over . The set of all -mers over is . A cyclic de Bruijn sequence is a cyclic sequence over alphabet (of size ) in which all -mers occur exactly once. The length of such a sequence is , because each of the -mers accounts for one starting position.

In 1894, the problem of counting cyclic de Bruijn sequences over a binary alphabet (the case ) was proposed by de Rivière [7] and solved by Sainte-Marie [16]. In 1946, the same problem was solved by de Bruijn [4], unaware of Sainte-Marie’s work. In 1951, the solution was extended to any size alphabet () by van Aardenne-Ehrenfest and de Bruijn [19, p. 203]:

(1)

The sequences were subsequently named de Bruijn sequences, and work continued on them for decades before the 1894 publication by Sainte-Marie was rediscovered in 1975 [5]. Table 1 summarizes the cases considered and notation used in these and other papers.

Reference Multiplicity Alphabet size Word size
[7] de Rivière (1894) 1 2
[16] Sainte-Marie (1894) 1 2
[4] de Bruijn (1946) 1 2
[19] van A.-E. & de Bruijn (1951) 1
[5] de Bruijn (1975) 1
[3] Dawson & Good (1957)
[8] Fredricksen (1982) 1
[11] Kandel et al (1996) variable any (uses 4)
[17] Stanley (1999) 1
[10] Higgins (2012) 1
[15] Osipov (2016)
Tesler (2016) [this paper]
Table 1. Notation or numerical values considered for de Bruijn sequences and related problems in the references.

We introduce a cyclic multi de Bruijn sequence: a cyclic sequence over alphabet (of size ) in which all -mers occur exactly times, with . Let denote the set of all such sequences, The length of such a sequence is , because each of the -mers accounts for starting positions. For , , , , one such sequence is . The length is . Each -mer 000, 001, 010, 011, 100, 101, 110, 111 occurs twice, including overlapping occurrences and occurrences that wrap around the end.

Let denote the set of linearizations of cyclic sequences in . These will be called linearized or linearized cyclic multi de Bruijn sequences. Let denote the set of linearizations that start with -mer . In the example above, the two linearizations starting with are and .

For a sequence and , let denote concatenating copies of .

A cyclic sequence (or a linearization ) of length has a th order rotation for some (positive integer divisor of ) iff iff for some sequence of length . The order of cycle (or linearization ) is the largest such that . For example, has order , while if all rotations of are distinct, then has order . Since a cyclic multi de Bruijn sequence has exactly copies of each -mer, the order must divide into . Sets , , and denote multi de Bruijn sequences of order that are cyclic, linearized cyclic, and linearized cyclic starting with .

A linear multi de Bruijn sequence is a linear sequence of length in which all -mers occur exactly times. Let denote the set of such sequences and denote those starting with -mer . A linear sequence does not wrap around and is not the same as a linearization of a cyclic sequence. For , , , an example is 111101100010100011, with length .

For , we will compute the number of linearized cyclic (Sec. 2), cyclic (Sec. 3), and linear (Sec. 4) multi de Bruijn sequences. We also compute the number of cyclic and linearized cyclic sequences with a th order rotational symmetry. We give examples in Sec. 5. In Sec. 6, we use brute force to generate all multi de Bruijn sequences for small values of , confirming our formulas in certain cases. In Sec. 7, we show how to select a uniform random linear, linearized cyclic, or cyclic multi de Bruijn sequence. A summary of our notation and formulas is given in Table 2.

Definition Notation
Alphabet
Alphabet size
Word size
Multiplicity of each -mer
Rotational order of a sequence ; must divide into
Specific -mer that sequences start with
Complete de Bruijn multigraph
Cyclic shift
Power of a linear sequence ( times)
Möbius function
Euler’s totient function
Set of permutations of

Multi de Bruijn Sequence Set Size
Linear
Cyclic
Linearized cyclic
Multicyclic
Linear, starts with -mer
Linearized, starts with
Cyclic, order
Linearized, starts with ,
  order
Table 2. Notation and summary of formulas for the number of multi de Bruijn sequences of different types.

We also consider another generalization: a multicyclic de Bruijn sequence is a multiset of aperiodic cyclic sequences such that every -mer occurs exactly times among all the cycles. For example, has two occurrences of each -mer . This generalizes results of Higgins [10] for the case . In Sec. 8, we develop this generalization using the “Extended Burrows-Wheeler Transform” of Mantaci et al [13, 14]. In Sec. 9, we give another method to count multicyclic de Bruijn sequences by counting the number of ways to partition a balanced graph into aperiodic cycles with prescribed edge multiplicities.

We implemented these formulas and algorithms in software available at
http://math.ucsd.edu/gptesler/multidebruijn .

Related work. The methods van Aardenne-Ehrenfest and de Bruijn [19] developed in 1951 to generalize de Bruijn sequences to alphabets of any size potentially could have been adapted to ; see Sec. 3. In 1957, Dawson and Good [3] counted “circular arrays” in which each -mer occurs times (in our notation). Their formula corresponds to an intermediate step in our solution, but overcounts cyclic multi de Bruijn sequences; see Sec. 2.3. Very recently, in 2016, Osipov [15] introduced the problem of counting “-ary -fold de Bruijn sequences.” However, Osipov only gives a partial solution, which appears to be incorrect; see Sec. 6.

2. Linearizations of cyclic multi de Bruijn sequences

2.1. Multi de Bruijn graph

We will compute the number of cyclic multi de Bruijn sequences by counting Eulerian cycles in a graph and adjusting for multiple Eulerian cycles corresponding to each cyclic multi de Bruijn sequence. For now, we assume that ; we will separately consider in Sec. 2.4. We will also clarify details for the case in Sec. 2.5.

Define a multigraph whose vertices are the -mers over alphabet of size . There are vertices. For each -mer , add directed edges , each labelled by . Every vertex has outdegree and indegree . Further, we will show in Sec. 2.2 that the graph is strongly connected, so is Eulerian.

Consider a walk through the graph:

This walk corresponds to a linear sequence , which can be determined either from vertex labels (when ) or edge labels (when ). If the walk is a cycle (first vertex equals last vertex), then it also corresponds to a linearization of cyclic sequence . Starting at another location in the cycle will result in a different linearization of the same cyclic sequence.

Any Eulerian cycle in this graph gives a cyclic sequence in . Conversely, each sequence in corresponds to at least one Eulerian cycle.

Cyclic multi de Bruijn sequences may also be obtained through a generalization of Hamiltonian cycles. Consider cycles in , with repeated edges and vertices allowed, in which every vertex (-mer) is visited exactly times (not double-counting the initial vertex when it is used to close the cycle). Each such graph cycle starting at vertex corresponds to the linearization it spells in , and vice-versa. However, we will use the Eulerian cycle approach because it leads to an enumeration formula.

2.2. Matrices

Let and form the adjacency matrix of directed graph . When ,

(2)

For every pair of -mers and , the walks of length from to have this form:

For each of the arrows, there are parallel edges to choose from. Thus, there are walks of length from to , so the graph is strongly connected and for all pairs of vertices . This gives , where is the matrix of all 1’s.

has an eigenvalue of multiplicity and an eigenvalue of multiplicity . So has an eigenvalue of multiplicity and an eigenvalue of multiplicity .

Thus, has one eigenvalue of the form , where is a th root of unity, and eigenvalues equal to . In fact, the first eigenvalue of is , because the all 1’s column vector is a right eigenvector with eigenvalue (that is, ): for every vertex , we have

Similarly, the all 1’s row vector is a left eigenvector with eigenvalue .

The degree matrix of an Eulerian graph is an diagonal matrix with on the diagonal and for . All vertices in have indegree and outdegree , so . The Laplacian matrix of is . It has one eigenvalue equal to and eigenvalues equal to .

2.3. Number of Eulerian cycles in the graph

Choose any edge in . Let be the -mer represented by . The number of spanning trees of with root (with a directed path from each vertex to root ) is given by Tutte’s Matrix-Tree Theorem for directed graphs [18, Theorem 3.6] (also see [19, Theorem 7] and [17, Theorem 5.6.4]). For a directed Eulerian graph, the formula can be expressed as follows [17, Cor. 5.6.6]: the number of spanning trees of rooted at is

(3)

By the BEST Theorem [19, Theorem 5b] (also see [17, pp. 56, 68] and [8]), the number of Eulerian cycles with initial edge is

(4) (# spanning
(5)

Each Eulerian cycle spells out a linearized multi de Bruijn sequence that starts with -mer . However, when , there are multiple cycles generating each such sequence. Let be an Eulerian cycle spelling out linearization . For -mer , the first edge of the cycle, , was given; the other edges labelled by are parallel to and may be permuted in in any of ways. For the other -mers, the edges representing each -mer can be permuted in in ways. Thus, each linearization starting in is generated by Eulerian cycles that start with edge . Divide Eq. (5) by this factor to obtain the number of linearized multi de Bruijn sequences starting with -mer :

(6)

where we set

(7)

Related work. Dawson and Good [3, p. 955] computed the number of “circular arrays” containing each -mer exactly times (in our notation; theirs is different), and obtained an answer equivalent to (5). They counted graph cycles in starting on a specific edge, but when , this does not equal ; their count includes each cyclic multi de Bruijn sequence multiple times. This is because they use the convention that the multiple occurrences of each symbol are distinguishable [3, p. 947]. We do additional steps to count each cyclic multi de Bruijn sequence just once: we adjust for multiple graph cycles corresponding to each linearization (6) and for multiple linearizations of each cyclic sequence (Sec. 3).

2.4. Special case

When , the number of sequences of length with each symbol in occurring exactly times is . Divide by to obtain the number of linearized de Bruijn sequences starting with each -mer :

This agrees with plugging into (6), so (6) holds for as well.

The derivation of (6) in Secs. 2.12.3 requires , due to technicalities. The above derivation uses a different method, but we can also adapt the first derivation as follows.

has a single vertex ‘’ (a -mer) and loops on , because for each , there are loops . The sequence spelling out a walk can be determined from edge labels but not vertex labels, since they’re null.

The adjacency matrix is rather than Eq. (2). The Laplacian matrix has one eigenvalue: . In (3), the product of nonzero eigenvalues is vacuous; on plugging in , the formula correctly gives that there is spanning tree (it is the vertex without any edges). The remaining steps work for as-is.

2.5. Special case

For , the only cyclic sequence of length is . We regard cycle and linearization as having an occurrence of starting at each position, even if , in which case some positions of are used multiple times to form . With this convention, there are exactly occurrences of in . Thus, for and ,

Using positions of a cycle multiple times to form a -mer will also arise in multicyclic de Bruijn sequences in Sec. 8.

3. Cyclic multi de Bruijn sequences

In a classical de Bruijn sequence (), each -mer occurs exactly once, so the number of cyclic de Bruijn sequences equals the number of linearizations that begin with any fixed -mer :

This agrees with (1). But a multi de Bruijn sequence has cyclic shifts starting with , and the number of these that are distinct varies by sequence. Let be the order of the cyclic symmetry of the sequence; then there are distinct linearizations starting with , so

(8)

Further, we have the following:

Lemma 3.1.

For and :

(a) 

(b) For each -mer , .

(c) 

Proof.

At , all sets in (a)–(c) have size . Below, we consider .

(a) Let . Since has order , it splits into equal parts, . The occurrences of each -mer in reduce to occurrences of each -mer in , and the order of is , so .

Conversely, if then has occurrences of each -mer and has order , so .

(b) Continuing with (a), the length of is at least (since for and ), so the initial -mer of and must be the same.

(c) The map in (a) induces a bijection as follows. Let . For any linearization of , let and set . This is well-defined: the distinct linearizations of are where . Rotating by also rotates by and gives the same cycle in , so all linearizations of give the same result.

Conversely, given , the unique inverse is . As above, this is well-defined. ∎

Partitioning the cyclic multi de Bruijn sequences by order gives the following, where the sums run over positive integers that divide into :

(9)

Each cyclic sequence of order has distinct linearizations starting with , so

By Möbius Inversion,

Solve this for :

(10)

Combining this with (6) and Lemma 3.1(c) gives

(11)

Plug (10) into (9) to obtain

Change variables: set

where is the Euler totient function. Thus,

(12)

Related work. Van Aardenne-Ehrenfest and de Bruijn [19, Sec. 5] take a directed Eulerian multigraph and replace each edge by a “bundle” of parallel edges to obtain a multigraph . They compute the number of Eulerian cycles in in terms of the number of Eulerian cycles in , and obtain a formula related to (12). Their bundle method could have been applied to count cyclic multi de Bruijn sequences in terms of ordinary cyclic de Bruijn sequences by taking , , and considering , but they did not do this. Instead, they set and found a correspondence between Eulerian cycles in vs.  with , yielding a recursion in for . They derived (1) by this recursion rather than the BEST Theorem, which was also introduced in that paper. Dawson and Good [3, p. 955] subsequently derived (1) by the BEST Theorem.

4. Linear multi de Bruijn sequences

Now we consider , the set of linear sequences in which every -mer over occurs exactly times. A linear sequence is not the same as a linearized representation of a circular sequence. Recall that a linearized sequence has length . A linear sequence has length , because there are positions at which -mers start, and an additional positions at the end to complete the final -mers. Below, assume .

Lemma 4.1.

For , for each sequence in , the first and last -mer are the same.

Proof.

Count the number of times each -mer occurs in as follows:

Each -mer occurs times in among starting positions . Since is the first letters of different -mers, there are occurrences of starting in this range. There is also a -mer with an additional occurrence at starting position (the last positions of ), so occurs times in while all other -mers occur times.

By a similar argument, the -mer at the start of has a total of occurrences in , and all other -mers occur exactly times.

So and each occur times in . But we showed that there is only one -mer occurring times, so . ∎

For , this leads to a bijection . Given , drop the last characters to obtain sequence . To invert, given , form by repeating the first characters of at the end.

For , the same -mer occurs linearly at position of and circularly at position of , so every -mer occurs the same number of times in both linear sequence and cycle . Thus, maps to , and does the reverse.

This also gives . Multiply (6) by choices of initial -mer to obtain

(13)

Above, we assumed . Eq. (13) also holds for , since all four parts equal (see Sec. 2.5).

5. Examples

(A) : Linearizations starting with 00

Order 1: 00010111 00011011 00011101 00100111
00101110 00110110 00111010 00111001
Order 2: 00110011

(B) : Cyclic multi de Bruijn sequences

Order 1: (00010111) (00011011) (00011101) (00100111)
Order 2: (00110011)

(C) : Linear multi de Bruijn sequences

000101110 010001110 100010111 110001011
000110110 010011100 100011011 110001101
000111010 010111000 100011101 110010011
001001110 011000110 100100111 110011001
001011100 011001100 100110011 110100011
001100110 011011000 100111001 110110001
001101100 011100010 101000111 111000101
001110010 011100100 101100011 111001001
001110100 011101000 101110001 111010001

(D) : Multicyclic de Bruijn sequences

(0)(0)(01)(01)(1)(1) (0)(001011)(1) (0001)(01)(1)(1) (00011)(01)(1)
(0)(0)(01)(011)(1) (0)(0010111) (0001)(011)(1) (000111)(01)
(0)(0)(01)(0111) (0)(0011)(011) (0001)(0111) (00011011)
(0)(0)(01011)(1) (0)(00101)(1)(1) (0001011)(1) (001)(001)(1)(1)
(0)(0)(010111) (0)(001101)(1) (00010111) (001)(0011)(1)
(0)(0)(011)(011) (0)(0011101) (00011)(011) (001)(00111)
(0)(001)(01)(1)(1) (0)(0011)(01)(1) (000101)(1)(1) (0010011)(1)
(0)(001)(011)(1) (0)(00111)(01) (0001101)(1) (00100111)
(0)(001)(0111) (0)(0011011) (00011101) (0011)(0011)

Table 3. Multi de Bruijn sequences of each type, with copies of each -mer, alphabet of size , and word size .

The case , arbitrary , is the classical de Bruijn sequence. Eq. (12) has just one term, , and agrees with (1):

When is prime, the sum in (12) for the number of cyclic multi de Bruijn sequences has two terms (divisors ):

(14)

For , Table 3 lists all multi de Bruijn sequences of each type. Table 3(A) shows linearizations beginning with . By (6),

Some of these are equivalent upon cyclic rotations, yielding 5 distinct cyclic multi de Bruijn sequences, shown in Table 3(B). By (12) or (14),

There are 9 linearizations starting with each -mer 00, 01, 10, 11. Converting these to linear multi de Bruijn sequences yields linear multi de Bruijn sequences in total, shown in Table 3(C). By (13),

Table 3(D) shows the 36 multicyclic de Bruijn sequences; see Sec. 8.

6. Generating multi de Bruijn sequences by brute force

Osipov [15] recently defined a “-ary -fold de Bruijn sequence”; this is the same as a cyclic multi de Bruijn sequence but in different terminology and notation. The description below is in terms of our notation , which respectively correspond to Osipov’s (see Table 2).

Osipov developed a method [15, Prop. 3] to compute the number of cyclic multi de Bruijn sequences for multiplicity , alphabet size , and word size . This method requires performing a detailed construction for each . The paper states the two sequences for , which agrees with our of size 2, and then carries out this construction to evaluate cases . It determines complete answers for those specific cases only [15, pp. 158–161]. For , Osipov’s answer agrees with ours, . But for and respectively, Osipov’s method gives 72 and 43768, whereas by (12) or (14), we obtain and .

For these parameters, it is feasible to find all cyclic multi de Bruijn sequences by a brute force search in (with ). This agrees with our formula (12), while disagreeing with Osipov’s results for and . Table 4 lists the 82 sequences in . The 52496 sequences for and an implementation of the algorithm below are on the website listed in the introduction. We also used brute force to confirm (12) for all combinations of parameters with at most a 32-bit search space ():

Represent by -digit base integers (range ):

Since we represent each element of by a linearization that starts with -mer , we restrict to . To generate :

  • for from to (pruning as described later):

  • let be the -digit base expansion of

  • if (cycle is a cyclic multi de Bruijn sequence

  • if (and is smallest among its cyclic shifts)

  • then output cycle

To generate , skip (B4) and output linearizations in (B5).

To determine if is a cyclic multi de Bruijn sequence, examine its -mers (subscripts taken modulo ) in the order , counting how many times each -mer is encountered. Terminate the test early (and potentially prune as described below) if any -mer count exceeds . If the test does not terminate early, then all -mer counts must equal , and it is a multi de Bruijn sequence.

If when -mer counting terminates, then some -mer count exceeded without examining . Advance to instead of , skipping all remaining sequences starting with the same characters. This is equivalent to incrementing the digit prefix as a base number and then concatenating ’s to the end. However, if , then all characters have been examined, so advance to .

We may do additional pruning by counting -mers for . Each -mer is a prefix of -mers and thus should occur times in . If a -mer has more than occurrences in with , advance as described above.

We may also restrict the search space to length sequences with exactly occurrences of each of the characters. We did not need to implement this for the parameters considered.

(0000100101101111), (0000100101110111), (0000100101111011),
(0000100110101111), (0000100110111101), (0000100111010111),
(0000100111011101), (0000100111101011), (0000100111101101),
(0000101001101111), (0000101001110111), (0000101001111011),
(0000101011001111), (0000101011100111), (0000101011110011),
(0000101100101111), (0000101100111101), (0000101101001111),
(0000101101111001), (0000101110010111), (0000101110011101),
(0000101110100111), (0000101110111001), (0000101111001011),
(0000101111001101), (0000101111010011), (0000101111011001),
(0000110010101111), (0000110010111101), (0000110011110101),
(0000110100101111), (0000110100111101), (0000110101001111),
(0000110101111001), (0000110111100101), (0000110111101001),
(0000111001010111), (0000111001011101), (0000111001110101),
(0000111010010111), (0000111010011101), (0000111010100111),
(0000111010111001), (0000111011100101), (0000111011101001),
(0000111100101011), (0000111100101101), (0000111100110101),
(0000111101001011), (0000111101001101), (0000111101010011),
(0000111101011001), (0000111101100101), (0000111101101001),
(0001000101101111), (0001000101110111), (0001000101111011),
(0001000110101111), (0001000110111101), (0001000111010111),
(0001000111011101), (0001000111101011), (0001000111101101),
(0001010001101111), (0001010001110111), (0001010001111011),
(0001010110001111), (0001010111000111), (0001010111100011),
(0001011000101111), (0001011000111101), (0001011010001111),
(0001011100010111), (0001011100011101), (0001011101000111),
(0001011110001101), (0001011110100011), (0001100011110101),
(0001101000111101), (0001101010001111), (0001110001110101),
(0001110100011101)
Table 4. : For multiplicity , alphabet size , and word size , there are 82 cyclic multi de Bruijn sequences. We show the lexicographically least rotation of each.

7. Generating a uniform random multi de Bruijn sequence

We present algorithms to select a uniform random multi de Bruijn sequence in the linear, linearized, and cyclic cases.

Kandel et al [11] present two algorithms to generate random sequences with a specified number of occurrences of each -mer: (i) a “shuffling” algorithm that permutes the characters of a sequence in a manner that preserves the number of times each -mer occurs, and (ii) a method to choose a uniform random Eulerian cycle in an Eulerian graph and spell out a cyclic sequence from the cycle. Both methods may be applied to select linear or linearized multi de Bruijn sequences uniformly. However, additional steps are required to choose a uniform random cyclic de Bruijn sequence; see Sec. 7.2.

Example 7.1.

For , Table 3(A) lists the nine linearizations starting with 00. The random Eulerian cycle algorithm selects each with probability . Table 3(B) lists the five cyclic de Bruijn sequences. The four cyclic sequences of order 1 are each selected with probability since they have two linearizations starting with , while (00110011) has one such linearization and thus is selected with probability . Thus, the cyclic sequences are not chosen uniformly (which is probability for each).

Kandel et al’s shuffling algorithm performs a large number of random swaps of intervals in a sequence and approximates the probabilities listed above. Below, we focus on the random Eulerian cycle algorithm.

7.1. Random linear(ized) multi de Bruijn sequences

Kandel et al [11] present a general algorithm for finding random cycles in a directed Eulerian graph, and apply it to the de Bruijn graph of a string: the vertices are the -mers of the string and the edges are the -mers repeated with their multiplicities in the string. As a special case, the de Bruijn graph of any cyclic sequence in is the same as our . We explain their algorithm and how it specializes to .

Let be a directed Eulerian multigraph and be an edge in . The proof of the BEST Theorem [19, Theorem 5b] gives a bijection between

  1. Directed Eulerian cycles of whose first edge is , and

  2. ordered pairs