Linearsize CDAWG: new repetitionaware indexing and grammar compression
Abstract
In this paper, we propose a novel approach to combine compact directed acyclic word graphs (CDAWGs) and grammarbased compression. This leads us to an efficient selfindex, called Linearsize CDAWGs (LCDAWGs), which can be represented with bits of space allowing for time random and time sequential accesses to edge labels, and time pattern matching. Here, is the number of all extensions of maximal repeats in , and are respectively the lengths of the text and a given pattern, is the alphabet size, and is the number of occurrences of the pattern in . The repetitiveness measure is known to be much smaller than the text length for highly repetitive text. For constant alphabets, our LCDAWGs achieve pattern matching time with bits of space, which improves the pattern matching time of Belazzougui et al.’s runlength BWTCDAWGs by a factor of , with the same space complexity. Here, is the number of right extensions of maximal repeats in . As a byproduct, our result gives a way of constructing a straightline program (SLP) of size for a given text in time.
1 Introduction
Background: Text indexing is a fundamental problem in theoretical computer science, where the task is to preprocess a given text so that subsequent pattern matching queries can be answered quickly. It has wide applications such as information retrieval, bioinformatics, and big data analytics [14, 10]. There have been a lot of recent research on compressed text indexes [4, 10, 14, 1, 13, 11, 16, 9] that store a text supporting extract and find operations in space significantly smaller than the total size of texts. Operation extract returns any substring of the text. Operation find returns the list of all occurrences of a given pattern in . For instance, Grossi, Gupta, and Vitter [9] gave a compressed text index based on compressed suffix arrays, which takes bits of space and supporting pattern match time, where is the th order entropy of and is the length of the pattern .
Compression measures for highly repetitive text: Recently, there has been an increasing interest in indexed searches for highly repetitive text collections. Typically, the compression size of such a text can be described in terms of some measure of repetition. The followings are examples of such repetitiveness measures for :

the number of rules in a grammar (SLP) representing ,

the number of phrases in the LZ77 parsing of ,

the number of runs in the BurrowsWheeler transform of , and

the number of right and leftextensions of maximal repeats of .
Belazzougui et al. [1] observed close relationship among these measures. Specifically, the authors empirically observed that all of them showed similar logarithmic growth behavior in on a real biological sequence, and also theoretically showed that both and are upper bounded by . These repetitive texts are formed from many repeated fragments nearly identical. Therefore, one can expect that compressed index based on these measures such as , and can effectively capture the redundancy inherent to these highly repetitive texts than conventional entropybased compressed indexes [14].
Repetitionaware indexes: There has been extensive research on a family of repetitionaware indexes [1, 4, 10, 11] since the seminal work by Claude and Navarro [4]. They proposed the first compressed selfindex based on grammars, which takes bits supporting pattern match time, where and are respectively the size and height of a grammar. Kreft and Navarro [10] gave the first compressed selfindex based on LZ77, which takes bits supporting pattern match time. Here, is the height of the LZ parsing. Makinen, Navarro, Siren, and Valimaki [11] gave a compressed index based on RLBWT, which takes bits supporting pattern match time, where is the time for a binary searchable dictionary which is and for example [11].
Previous approaches: Considering the above results, we notice that in compression ratio, all indexes above achieve good performance depending on the repetitive measures, while in terms of operation time, most of them except the RLBWTbased one [11] have quadratic dependency in pattern size . Hence, a challenge here is to develop repetitionaware text indexes to achieve good compression ratio for highly repetitive texts in terms of repetition measures, while supporting faster extract and find operations. Belazzougui et al. [1] proposed a repetitionaware index which combines CDAWGs [3, 7] and the runlength encoded BWT [11], to which we refer as RLBWTCDAWGs. For a given text of the length and a pattern of the length , their index uses bits of space and supports find operation in time.
Main results: In this paper, we propose a new repetitionaware index based on combination of CDAWGs and grammarbased compression, called the Linearsize CDAWG (LCDAWG, for short). The LCDAWG of a text of length is a selfindex for which can be stored in bits of space, and support time random access to the text, time sequential character access from the beginning of each edge label, and time pattern matching. For constant alphabets, our LCDAWGs use bits of space and support pattern matching in time, hence improving the pattern matching time of Belazzougui et al.’s RLBWTCDAWGs by a factor of . We note that RLBWTCDAWGs use hashing to retrieve the first character of a given edge label, and hence RLBWTCDAWGs seem to require time for pattern matching even for constant alphabets.
From the context of studies on suffix indices, our LCDAWGs can be seen as a successor of the linearsize suffix trie (LSTries) by Crochemore et al. [5]. The LSTrie is a variant of the suffix tree [6], which need not keep the original text by elegant scheme of linear time decoding using suffix links and a set of auxiliary nodes. However, it is a challenge to generalize their result for the CDAWG because the paths between a given pair of endpoints are not unique. By combining the idea of LSTries, an SLPbased compression with direct access [8, 2], we successfully devise a text index of bits by improving functionalities of LSTries. As a byproduct, our result gives a way of constructing an SLP of size bits of space for a text . Moreover, since the LCDAWG of retains the topology of the original CDAWG for , the LCDAWG is a compact representation of all maximal repeats [15] that appear in .
2 Preliminaries
In this section, we give some notations and definitions to be used in the following sections. In addition, we recall string data structures such as suffix tries, suffix trees, CDAWGs, linearsize suffix tries and straightline programs, which are the data structures to be considered in this paper.
2.1 Basic definitions and notations
Strings: Let be a general ordered alphabet of size . An element of is called a string, where denotes its length. We denote the empty string by which is the string of length , namely, . Let . If , then , , and are called a prefix, a substring, and a suffix of , respectively. Let be any string of length . For any , let denote the substring of that begins and ends at positions and in , and let denote the th character of . For any string , we denote by the reversed string of , i.e., . Let denote the set of suffixes of . For a string , the number of occurrences of in means the number of positions where is a substring in .
Maximal repeats and other measures of repetition: A substring of is called a repeat if the number of occurrences of in more than one. A right extension (resp. a left extension) of of is any substring of with the form (resp. ) for some letter . A repeat of is a maximal repeat if both left and rightextensions of occur strictly fewer times in than . In what follows, we denote by , , , and the numbers of maximal repeats, rightextensions, leftextensions, and all extensions of maximal repeats appearing in , respectively. Recently, it has been shown in [1] that the number is an upper bound on the number of runs in the BurrowsWheeler transform for and the number of factors in the LempelZiv parsing of . It is also known that and , where [3, 15].
Notations on graphical indexes: All index structures dealt with in this paper, such as suffix tries, suffix trees, CDAWGs, linearsize suffix tries (LSTries), and linearsize CDAWGs (LCDAWGs), are graphical indexes in the sense that an index is a pointerbased structure built on an underlying DAG with a root and mapping that assign a label to each edge . For an edge , we denote its end points by and , respectively. The label string of is . The string length of is . An edge is called atomic if , and thus, . For a path of length , we extend its end points, label string, and string length by , , , and , respectively.
2.2 Suffix tries and suffix trees
The suffix trie [6] for a text of length , denoted , is a trie which represents . The size of is . The path label of a node is the string formed by concatenating the edge labels on the unique path from the root to . If , we denote by . We may identify with its label if it is clear from context. A substring of is said to be branching if there exists two distinct characters such that both and are substrings of . For any , , we define the suffix link of node by if is defined.
The suffix tree [17, 6] for a text , denoted , is a compacted trie which also represents . can be obtained by compacting every path of which consists of nonbranching internal nodes (see Fig. 1). Since every internal node of is branching, and since there are at most leaves in , the numbers of edges and nodes are . The edges of are labeled by nonempty substrings of . By representing each edge label with a pair of integers such that , can be stored in bits of space.
2.3 CDAWGs
The compact directed acyclic word graph [3, 6] for a text , denoted , is the minimal compact automaton which represents . can be obtained from by merging isomorphic subtrees and deleting associated endmarker . Since is an edgelabeled DAG, we represent a directed edge from node to with label string by a triple . For any node , the label strings of outgoing edges from start with mutually distinct characters.
Formally, is defined as follows. For any strings , we denote (resp. ) iff the beginning positions (resp. ending positions) of and in are equal. Let (resp. ) denote the equivalence class of strings w.r.t. (resp. ). All strings that are not substrings of form a single equivalence class, and in the sequel we will consider only the substrings of . Let (resp. ) denote the longest member of the equivalence class (resp. ). Notice that each member of (resp. ) is a prefix of (resp. a suffix of ). Let . We denote iff , and let denote the equivalence class w.r.t. . The longest member of is and we will also denote it by . We define as an edgelabeled DAG such that and . The operator corresponds to compacting nonbranching edges (like conversion from to ) and the operator corresponds to merging isomorphic subtrees of . For simplicity, we abuse notation so that when we refer to a node of as , this implies and .
Let be any node of and consider the suffixes of which correspond to the suffix tree nodes that are merged when transformed into the CDAWG. We define the suffix link of node by , iff is the longest suffix of that does not belong to .
It is shown that all nodes of except the sink correspond to the maximal repeats of . Actually, is a maximal repeat in [15]. Following this fact, one can easily see that the numbers of edges of and coincide with the numbers and of right and left extensions of maximal repeats of , respectively [1, 15].
By representing each edge label with pairs of integers such that , can be stored in bits of space.
2.4 LSTrie
Recently, Crochemore et al. [5] proposed a compact variant of a suffix trie, called linearsize suffix trie (or LSTrie, for short), denoted . It is a compacted tree with the topology and the size similar to , but has no indirect references to a text (See Fig. 2). is obtained from by adding all nodes such that their suffix links appear also in . Unlike , each edge of stores the first character and the length of the corresponding suffix tree edge label (see Fig. 2). Using auxiliary links called the jump pointers the following theorem is proved.
Proposition 1 (Crochemore et al. [5]).
For a text of length , the linearsize suffix trie for can be stored in bits of space supporting reconstruction of the label of a given edge in time, where is the length of the edge label.
Crochemore et al.’s method [5] does not regard the order of decoding characters on an edge label. This implies that needs worst case time to read any prefix of an edge label of length . This may cause troubles in some applications including pattern matching. In particular, it does not seem straightforward to match a pattern against a prefix of the label of an edge in time when . We will solve these problems in Section 3 later.
2.5 Straightline programs
A straightline program (SLP) is a contextfree grammar (CFG) in the Chomsky normal form generating a single string. SLPs are often used in grammar compression algorithms [14].
Consider an SLP with variables. Each production rule is either of form with or without loops. Thus an SLP produces a single string. The phrase of each , denoted , is the string that produces. The string defined by SLP is . We will use the following results.
Proposition 2 (Gasieniec et al. [8]).
For an SLP of size for a text of length , there exist a data structure of bits of space which supports expansion of a prefix of for any variable in time per character, and can be constructed in time.
Proposition 3 (Bille et al. [2]).
For an SLP of size representing a text of length , there exists a data structure of bits of space which supports to access consecutive characters at arbitrary position of for any variable in time, and can be constructed in time.
3 The proposed data structure: LCDAWG
In this section, we present the Linearsize CDAWG (LCDAWG, for short). The LCDAWG can support CDAWG operations in the same time complexity without holding the original input text and can reduce the space complexity from bits of space to bits of space, where is the number of extensions of maximal repeats. From now on, we assume that an input text terminates with a unique character $ which appears nowhere else in .
3.1 Outline
The Linearsize CDAWG for a text of length , denoted , is a DAG whose edges are labeled with single characters. can be obtained from by the following modifications. From now on, we refer to the original nodes appearing in as type1 nodes, which are always branching except the sink.

First, we add new nonbranching nodes, called type2 nodes to . Let for any type1 node of . If is a substring of but the path spelling out ends in the middle of an edge, then we introduce a type2 node representing . We add the suffix link as well. Adding type2 nodes splits an edge into shorter ones. Note that more than one type2 nodes can be inserted into an edge of .

Let be any edge after all the type2 nodes are inserted, where . We represent this edge by where is the first character of the original label. We also store the original label length .

We will augment with a set of SLP production rules whose nonterminals correspond to edges of . The definition and construction of this SLP will be described later in Section 3.3.
If nonbranching type2 nodes are ignored, then the topology of is the same as that of . For ease of explanation, we denote by the original label of edge . Namely, for any edge , iff is the original edge for .
The following lemma gives an upper bound of the numbers of nodes and edges in . Recall that is the number of maximal repeats in , and are respectively the number of left and rightextensions of maximal repeats in , and .
Lemma 1.
For any string , let , then and .
Proof.
Let and . It is known that , and (see [3] and [15]). Let and be the set of type1 and type2 nodes in , respectively. Clearly, , , and . Let and . Note that is a maximal repeat of . For any character such that is a substring of , clearly is a leftextension of . By the definition of , it always has a (type1 or type2) node which corresponds to . Hence . This implies . Since each type2 node is nonbranching, clearly . ∎∎
Corollary 4.
For any string of over a constant alphabet, and , where .
Proof.
It clearly holds that and . Thus we have . The corollary follows from Lemma 1 when . ∎∎
3.2 Constructing type2 nodes and edge suffix links
Lemma 2.
Given for a text , we can compute all type2 nodes of in time.
Proof.
We create a copy of . For each edge of , we compute node and the path that spells out from . The number of type1 nodes in this path is equal to the number of type2 nodes that need to be inserted on edge , and hence we insert these nodes to . After the above operation is done for all edges, contains all type2 nodes of . Since there always exists such a path , to find it suffices to check the first characters of outgoing edges. Hence we need only time for each node in . Overall, it takes time. ∎∎
The above lemma also indicates the notion of the following edge suffix links in which are virtual links, and will not be actually created in the construction.
Definition 1 (Edge suffix links).
For any edge with , is the path, namely a list of edges, from to that can be reachable from by scanning .
Edge suffix links have the following properties.
Lemma 3.
For any edge such that and its edge suffix link , (1) both and are type1 nodes, and (2) all nodes in the path are type2 nodes.
Proof.
From the definition of edge suffix links, we have and the path from to spells out . (1) By the definitions of type2 nodes and edge suffix links, is always of type1. Hence it suffices to show that is of type1. There are two cases: (a) If is a type2 node, then by the definition of type2 nodes, must be the node pointed by . Therefore, is a type1 node. (b) If is a type1 node, then let be the shortest string represented by with and . Then, string is spelled out by a path from the source to , where either or . Since is of type1, is also of type1. (2) If there is a type1 node in the path , then there has to be a (type1 or type2) node between and , a contradiction. ∎∎
Lemma 3 says that the label of any edge with can be represented by a path . In addition, since the path includes type1 nodes only at the end points and since type2 nodes are nonbranching, is uniquely determined by a pair of . We can compute all edges for in per query, as follows. Firstly, we compute and then select the outgoing edge starting with the character in time. Next, we blindly scan the downward path from while the lower end of the current edge has type2. This scanning terminates when we reach an edge such that is of type1.
3.3 Construction of the SLP for LCDAWG
We give an SLP of size which represents and all edge labels of based on the jump links.
Jumping from an edge to a path: First, we define jump links, by which we can jump from a given edge with to the path consisting of at least two edges, and having the same string label. Although our jump link is based on that of LSTries [5], we need a new definition since a path in (and hence in ) cannot be uniquely determined by a pair of nodes, unlike (or ).
Definition 2 (Jump links).
For an edge with and , is recursively defined as follows:

if (thus ), and

if .
Note that equals for .
Lemma 4.
For any edge with , consists of at least two edges.
Proof.
Assume on the contrary that for some edge . This implies . By definition, is a proper suffix of , namely, there exists an integer such that . For any character which appears in , there is a (type1 or type2) node which represents as a child of the source of . This implies that there is an outgoing edge of length from the source representing the first character of . This contradicts that only contains a single edge with . ∎∎
Theorem 5.
For a given , there is an algorithm that computes all jump links in time.
Proof.
We explain how to obtain for an edge with . For all edge with , we manage a pointer to the first edge of by . We initially set for all . For all nodes with , let be an outgoing edge of with the same label character of . We check whether and, if so, we recursively compute , and then set . In this way all can be computed in time in total, where the is needed for selecting the out going edge. From Lemma 3, since there does not exist branching edge on each jump link, can be easily obtained from by traversing the path until encountered a type1 node. ∎∎
An SLP for the LCDAWG: We build an SLP which represents all edge labels in based on jump links. For each edge , let denote the variable which generates the string label . Let . For any with , we construct a production where is the label. For any with , let . We construct productions , , …, , and . We call a production whose lefthand size is an intermediate production. It is clear that generates and we introduced productions. If there is another edge () such that , then we construct a new production and reuse the other productions. Let be the path that spells out the text . We create productions which generates using the same technique as above for this path . Overall, the total number of intermediate productions is linear in the number of type2 nodes in . Since there are nonintermediate productions, this SLP consists of productions.
Now, we have the main result of this subsection.
Theorem 6.
For a given , there is an algorithm that constructs an SLP which represents all edge labels in time.
Proof.
By the above algorithm, if jump links are computed, we can obtain an SLP which represents all edge labels in time. From Theorem 5, we can compute all jump links in times. Overall, the total time of this algorithm is . ∎∎
Fig. 2 shows and enhanced with the SLP for string .
We associate to each edge label the corresponding variable of the SLP. By applying algorithms of Gasieniec et al. [8] (in Proposition 2) and Bille et al. [2] (in Proposition 3), we can show the following theorems.
Theorem 7.
For a text , can support pattern matching for a pattern of length in time.
Proof.
Theorem 8.
For a text of length , has an SLP that derives . In addition, we can read any substring can be read in time.
Proof.
The text of is represented by the longest path from the source to the sink. Remembering makes it possible to read any position of by using the Proposition 3. ∎∎
3.4 The main result
It is known that for a given string of length over an integer alphabet of size , can be constructed in time [12]. Combining this with the preceding discussions, we obtain the main result of this paper.
Theorem 9.
For a text of length , supports pattern matching in time for a given pattern of length and substring extraction in time for any substring of length , and can be stored in bits of space (or words of space). If is already constructed, then can be constructed in total time. If is given as input, then can be constructed in total time for integer alphabets of size . After has been constructed, the input string can be discarded.
4 Conclusions and further work
In this paper, we presented a new repetitionaware data structure called Linearsize CDAWGs. takes linear space in the number of the left and rightextensions of the maximal repeats in , which is known to be small for highly repetitive strings. The key idea is to introduce type2 nodes following LSTries proposed by Crochemore et al. [5]. Using a small SLP induced from edgesuffix links that is enhanced with random access and prefix extraction data structures, our supports efficient pattern matching and substring extraction. This SLP is repetitionaware, i.e., its size is linear in the number of left and rightextensions of the maximal repeats in . We also showed how to efficiently construct .
Our future work includes implementation of and evaluation of its practical efficiency, when compared with previous compressed indexes for repetitive texts. An interesting open question is whether we can efficiently construct in an online manner for growing text.
References
 [1] D. Belazzougui, F. Cunial, T. Gagie, N. Prezza, and M. Raffinot. Composite repetitionaware data structures. In Combinatorial Pattern Matching, pages 26–39. Springer, 2015.
 [2] P. Bille, G. M. Landau, R. Raman, K. Sadakane, S. R. Satti, and O. Weimann. Random access to grammarcompressed strings and trees. SIAM Journal on Computing, 44(3):513–539, 2015.
 [3] A. Blumer, J. Blumer, D. Haussler, R. McConnell, and A. Ehrenfeucht. Complete inverted files for efficient text retrieval and analysis. Journal of the ACM (JACM), 34(3):578–595, 1987.
 [4] F. Claude and G. Navarro. Selfindexed grammarbased compression. Fundamenta Informaticae, 111(3):313–337, 2011.
 [5] M. Crochemore, C. Epifanio, R. Grossi, and F. Mignosi. Linearsize suffix tries. Theoretical Computer Science, 638:171–178, 2016.
 [6] M. Crochemore and W. Rytter. Jewels of stringology: text algorithms. World Scientific, 2003.
 [7] M. Crochemore and R. Vérin. Direct construction of compact directed acyclic word graphs. In Combinatorial Pattern Matching, pages 116–129. Springer, 1997.
 [8] L. Gasieniec, R. M. Kolpakov, I. Potapov, and P. Sant. Realtime traversal in grammarbased compressed files. In Data Compression Conference, page 458, 2005.
 [9] R. Grossi, A. Gupta, and J. S. Vitter. Highorder entropycompressed text indexes. In ACMSIAM symposium on Discrete algorithms, pages 841–850. Society for Industrial and Applied Mathematics, 2003.
 [10] S. Kreft and G. Navarro. On compressing and indexing repetitive sequences. Theoretical Computer Science, 483:115–133, 2013.
 [11] V. Mäkinen, G. Navarro, J. Sirén, and N. Välimäki. Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology, 17(3):281–308, 2010.
 [12] K. Narisawa, S. Inenaga, H. Bannai, and M. Takeda. Efficient computation of substring equivalence classes with suffix arrays. Algorithmica, 2016.
 [13] G. Navarro. A selfindex on block trees. arXiv preprint arXiv:1606.06617, 2016.
 [14] G. Navarro and V. Mäkinen. Compressed fulltext indexes. ACM Computing Surveys (CSUR), 39(1):2, 2007.
 [15] M. Raffinot. On maximal repeats in strings. Information Processing Letters, 80(3):165 – 169, 2001.
 [16] Y. Takabatake, Y. Tabei, and H. Sakamoto. Improved ESPindex: a practical selfindex for highly repetitive texts. In International Symposium on Experimental Algorithms, pages 338–350. Springer, 2014.
 [17] P. Weiner. Linear patternmatching algorithms. In IEEE Annual Symposium on Switching and Automata Theory, pages 1–11, 1973.