Linear-size CDAWG: new repetition-aware indexing and grammar compression
In this paper, we propose a novel approach to combine compact directed acyclic word graphs (CDAWGs) and grammar-based compression. This leads us to an efficient self-index, called Linear-size CDAWGs (L-CDAWGs), which can be represented with bits of space allowing for -time random and -time sequential accesses to edge labels, and -time pattern matching. Here, is the number of all extensions of maximal repeats in , and are respectively the lengths of the text and a given pattern, is the alphabet size, and is the number of occurrences of the pattern in . The repetitiveness measure is known to be much smaller than the text length for highly repetitive text. For constant alphabets, our L-CDAWGs achieve pattern matching time with bits of space, which improves the pattern matching time of Belazzougui et al.’s run-length BWT-CDAWGs by a factor of , with the same space complexity. Here, is the number of right extensions of maximal repeats in . As a byproduct, our result gives a way of constructing a straight-line program (SLP) of size for a given text in time.
Background: Text indexing is a fundamental problem in theoretical computer science, where the task is to preprocess a given text so that subsequent pattern matching queries can be answered quickly. It has wide applications such as information retrieval, bioinformatics, and big data analytics [14, 10]. There have been a lot of recent research on compressed text indexes [4, 10, 14, 1, 13, 11, 16, 9] that store a text supporting extract and find operations in space significantly smaller than the total size of texts. Operation extract returns any substring of the text. Operation find returns the list of all occurrences of a given pattern in . For instance, Grossi, Gupta, and Vitter  gave a compressed text index based on compressed suffix arrays, which takes bits of space and supporting pattern match time, where is the -th order entropy of and is the length of the pattern .
Compression measures for highly repetitive text: Recently, there has been an increasing interest in indexed searches for highly repetitive text collections. Typically, the compression size of such a text can be described in terms of some measure of repetition. The followings are examples of such repetitiveness measures for :
the number of rules in a grammar (SLP) representing ,
the number of phrases in the LZ77 parsing of ,
the number of runs in the Burrows-Wheeler transform of , and
the number of right- and left-extensions of maximal repeats of .
Belazzougui et al.  observed close relationship among these measures. Specifically, the authors empirically observed that all of them showed similar logarithmic growth behavior in on a real biological sequence, and also theoretically showed that both and are upper bounded by . These repetitive texts are formed from many repeated fragments nearly identical. Therefore, one can expect that compressed index based on these measures such as , and can effectively capture the redundancy inherent to these highly repetitive texts than conventional entropy-based compressed indexes .
Repetition-aware indexes: There has been extensive research on a family of repetition-aware indexes [1, 4, 10, 11] since the seminal work by Claude and Navarro . They proposed the first compressed self-index based on grammars, which takes bits supporting pattern match time, where and are respectively the size and height of a grammar. Kreft and Navarro  gave the first compressed self-index based on LZ77, which takes bits supporting pattern match time. Here, is the height of the LZ parsing. Makinen, Navarro, Siren, and Valimaki  gave a compressed index based on RLBWT, which takes bits supporting pattern match time, where is the time for a binary searchable dictionary which is and for example .
Previous approaches: Considering the above results, we notice that in compression ratio, all indexes above achieve good performance depending on the repetitive measures, while in terms of operation time, most of them except the RLBWT-based one  have quadratic dependency in pattern size . Hence, a challenge here is to develop repetition-aware text indexes to achieve good compression ratio for highly repetitive texts in terms of repetition measures, while supporting faster extract and find operations. Belazzougui et al.  proposed a repetition-aware index which combines CDAWGs [3, 7] and the run-length encoded BWT , to which we refer as RLBWT-CDAWGs. For a given text of the length and a pattern of the length , their index uses bits of space and supports find operation in time.
Main results: In this paper, we propose a new repetition-aware index based on combination of CDAWGs and grammar-based compression, called the Linear-size CDAWG (L-CDAWG, for short). The L-CDAWG of a text of length is a self-index for which can be stored in bits of space, and support -time random access to the text, -time sequential character access from the beginning of each edge label, and -time pattern matching. For constant alphabets, our L-CDAWGs use bits of space and support pattern matching in time, hence improving the pattern matching time of Belazzougui et al.’s RLBWT-CDAWGs by a factor of . We note that RLBWT-CDAWGs use hashing to retrieve the first character of a given edge label, and hence RLBWT-CDAWGs seem to require time for pattern matching even for constant alphabets.
From the context of studies on suffix indices, our L-CDAWGs can be seen as a successor of the linear-size suffix trie (LSTries) by Crochemore et al. . The LSTrie is a variant of the suffix tree , which need not keep the original text by elegant scheme of linear time decoding using suffix links and a set of auxiliary nodes. However, it is a challenge to generalize their result for the CDAWG because the paths between a given pair of endpoints are not unique. By combining the idea of LSTries, an SLP-based compression with direct access [8, 2], we successfully devise a text index of bits by improving functionalities of LSTries. As a byproduct, our result gives a way of constructing an SLP of size bits of space for a text . Moreover, since the L-CDAWG of retains the topology of the original CDAWG for , the L-CDAWG is a compact representation of all maximal repeats  that appear in .
In this section, we give some notations and definitions to be used in the following sections. In addition, we recall string data structures such as suffix tries, suffix trees, CDAWGs, linear-size suffix tries and straight-line programs, which are the data structures to be considered in this paper.
2.1 Basic definitions and notations
Strings: Let be a general ordered alphabet of size . An element of is called a string, where denotes its length. We denote the empty string by which is the string of length , namely, . Let . If , then , , and are called a prefix, a substring, and a suffix of , respectively. Let be any string of length . For any , let denote the substring of that begins and ends at positions and in , and let denote the th character of . For any string , we denote by the reversed string of , i.e., . Let denote the set of suffixes of . For a string , the number of occurrences of in means the number of positions where is a substring in .
Maximal repeats and other measures of repetition: A substring of is called a repeat if the number of occurrences of in more than one. A right extension (resp. a left extension) of of is any substring of with the form (resp. ) for some letter . A repeat of is a maximal repeat if both left- and right-extensions of occur strictly fewer times in than . In what follows, we denote by , , , and the numbers of maximal repeats, right-extensions, left-extensions, and all extensions of maximal repeats appearing in , respectively. Recently, it has been shown in  that the number is an upper bound on the number of runs in the Burrows-Wheeler transform for and the number of factors in the Lempel-Ziv parsing of . It is also known that and , where [3, 15].
Notations on graphical indexes: All index structures dealt with in this paper, such as suffix tries, suffix trees, CDAWGs, linear-size suffix tries (LSTries), and linear-size CDAWGs (L-CDAWGs), are graphical indexes in the sense that an index is a pointer-based structure built on an underlying DAG with a root and mapping that assign a label to each edge . For an edge , we denote its end points by and , respectively. The label string of is . The string length of is . An edge is called atomic if , and thus, . For a path of length , we extend its end points, label string, and string length by , , , and , respectively.
2.2 Suffix tries and suffix trees
The suffix trie  for a text of length , denoted , is a trie which represents . The size of is . The path label of a node is the string formed by concatenating the edge labels on the unique path from the root to . If , we denote by . We may identify with its label if it is clear from context. A substring of is said to be branching if there exists two distinct characters such that both and are substrings of . For any , , we define the suffix link of node by if is defined.
The suffix tree [17, 6] for a text , denoted , is a compacted trie which also represents . can be obtained by compacting every path of which consists of non-branching internal nodes (see Fig. 1). Since every internal node of is branching, and since there are at most leaves in , the numbers of edges and nodes are . The edges of are labeled by non-empty substrings of . By representing each edge label with a pair of integers such that , can be stored in bits of space.
The compact directed acyclic word graph [3, 6] for a text , denoted , is the minimal compact automaton which represents . can be obtained from by merging isomorphic subtrees and deleting associated endmarker . Since is an edge-labeled DAG, we represent a directed edge from node to with label string by a triple . For any node , the label strings of out-going edges from start with mutually distinct characters.
Formally, is defined as follows. For any strings , we denote (resp. ) iff the beginning positions (resp. ending positions) of and in are equal. Let (resp. ) denote the equivalence class of strings w.r.t. (resp. ). All strings that are not substrings of form a single equivalence class, and in the sequel we will consider only the substrings of . Let (resp. ) denote the longest member of the equivalence class (resp. ). Notice that each member of (resp. ) is a prefix of (resp. a suffix of ). Let . We denote iff , and let denote the equivalence class w.r.t. . The longest member of is and we will also denote it by . We define as an edge-labeled DAG such that and . The operator corresponds to compacting non-branching edges (like conversion from to ) and the operator corresponds to merging isomorphic subtrees of . For simplicity, we abuse notation so that when we refer to a node of as , this implies and .
Let be any node of and consider the suffixes of which correspond to the suffix tree nodes that are merged when transformed into the CDAWG. We define the suffix link of node by , iff is the longest suffix of that does not belong to .
It is shown that all nodes of except the sink correspond to the maximal repeats of . Actually, is a maximal repeat in . Following this fact, one can easily see that the numbers of edges of and coincide with the numbers and of right- and left- extensions of maximal repeats of , respectively [1, 15].
By representing each edge label with pairs of integers such that , can be stored in bits of space.
Recently, Crochemore et al.  proposed a compact variant of a suffix trie, called linear-size suffix trie (or LSTrie, for short), denoted . It is a compacted tree with the topology and the size similar to , but has no indirect references to a text (See Fig. 2). is obtained from by adding all nodes such that their suffix links appear also in . Unlike , each edge of stores the first character and the length of the corresponding suffix tree edge label (see Fig. 2). Using auxiliary links called the jump pointers the following theorem is proved.
Proposition 1 (Crochemore et al. ).
For a text of length , the linear-size suffix trie for can be stored in bits of space supporting reconstruction of the label of a given edge in time, where is the length of the edge label.
Crochemore et al.’s method  does not regard the order of decoding characters on an edge label. This implies that needs worst case time to read any prefix of an edge label of length . This may cause troubles in some applications including pattern matching. In particular, it does not seem straightforward to match a pattern against a prefix of the label of an edge in time when . We will solve these problems in Section 3 later.
2.5 Straight-line programs
A straight-line program (SLP) is a context-free grammar (CFG) in the Chomsky normal form generating a single string. SLPs are often used in grammar compression algorithms .
Consider an SLP with variables. Each production rule is either of form with or without loops. Thus an SLP produces a single string. The phrase of each , denoted , is the string that produces. The string defined by SLP is . We will use the following results.
Proposition 2 (Gasieniec et al. ).
For an SLP of size for a text of length , there exist a data structure of bits of space which supports expansion of a prefix of for any variable in time per character, and can be constructed in time.
Proposition 3 (Bille et al. ).
For an SLP of size representing a text of length , there exists a data structure of bits of space which supports to access consecutive characters at arbitrary position of for any variable in time, and can be constructed in time.
3 The proposed data structure: L-CDAWG
In this section, we present the Linear-size CDAWG (L-CDAWG, for short). The L-CDAWG can support CDAWG operations in the same time complexity without holding the original input text and can reduce the space complexity from bits of space to bits of space, where is the number of extensions of maximal repeats. From now on, we assume that an input text terminates with a unique character $ which appears nowhere else in .
The Linear-size CDAWG for a text of length , denoted , is a DAG whose edges are labeled with single characters. can be obtained from by the following modifications. From now on, we refer to the original nodes appearing in as type-1 nodes, which are always branching except the sink.
First, we add new non-branching nodes, called type-2 nodes to . Let for any type-1 node of . If is a substring of but the path spelling out ends in the middle of an edge, then we introduce a type-2 node representing . We add the suffix link as well. Adding type-2 nodes splits an edge into shorter ones. Note that more than one type-2 nodes can be inserted into an edge of .
Let be any edge after all the type-2 nodes are inserted, where . We represent this edge by where is the first character of the original label. We also store the original label length .
We will augment with a set of SLP production rules whose nonterminals correspond to edges of . The definition and construction of this SLP will be described later in Section 3.3.
If non-branching type-2 nodes are ignored, then the topology of is the same as that of . For ease of explanation, we denote by the original label of edge . Namely, for any edge , iff is the original edge for .
The following lemma gives an upper bound of the numbers of nodes and edges in . Recall that is the number of maximal repeats in , and are respectively the number of left- and right-extensions of maximal repeats in , and .
For any string , let , then and .
Let and . It is known that , and (see  and ). Let and be the set of type-1 and type-2 nodes in , respectively. Clearly, , , and . Let and . Note that is a maximal repeat of . For any character such that is a substring of , clearly is a left-extension of . By the definition of , it always has a (type-1 or type-2) node which corresponds to . Hence . This implies . Since each type-2 node is non-branching, clearly . ∎∎
For any string of over a constant alphabet, and , where .
It clearly holds that and . Thus we have . The corollary follows from Lemma 1 when . ∎∎
3.2 Constructing type-2 nodes and edge suffix links
Given for a text , we can compute all type-2 nodes of in time.
We create a copy of . For each edge of , we compute node and the path that spells out from . The number of type-1 nodes in this path is equal to the number of type-2 nodes that need to be inserted on edge , and hence we insert these nodes to . After the above operation is done for all edges, contains all type-2 nodes of . Since there always exists such a path , to find it suffices to check the first characters of out-going edges. Hence we need only time for each node in . Overall, it takes time. ∎∎
The above lemma also indicates the notion of the following edge suffix links in which are virtual links, and will not be actually created in the construction.
Definition 1 (Edge suffix links).
For any edge with , is the path, namely a list of edges, from to that can be reachable from by scanning .
Edge suffix links have the following properties.
For any edge such that and its edge suffix link , (1) both and are type-1 nodes, and (2) all nodes in the path are type-2 nodes.
From the definition of edge suffix links, we have and the path from to spells out . (1) By the definitions of type-2 nodes and edge suffix links, is always of type-1. Hence it suffices to show that is of type-1. There are two cases: (a) If is a type-2 node, then by the definition of type-2 nodes, must be the node pointed by . Therefore, is a type-1 node. (b) If is a type-1 node, then let be the shortest string represented by with and . Then, string is spelled out by a path from the source to , where either or . Since is of type-1, is also of type-1. (2) If there is a type-1 node in the path , then there has to be a (type-1 or type-2) node between and , a contradiction. ∎∎
Lemma 3 says that the label of any edge with can be represented by a path . In addition, since the path includes type-1 nodes only at the end points and since type-2 nodes are non-branching, is uniquely determined by a pair of . We can compute all edges for in per query, as follows. Firstly, we compute and then select the out-going edge starting with the character in time. Next, we blindly scan the downward path from while the lower end of the current edge has type-2. This scanning terminates when we reach an edge such that is of type-1.
3.3 Construction of the SLP for L-CDAWG
We give an SLP of size which represents and all edge labels of based on the jump links.
Jumping from an edge to a path: First, we define jump links, by which we can jump from a given edge with to the path consisting of at least two edges, and having the same string label. Although our jump link is based on that of LSTries , we need a new definition since a path in (and hence in ) cannot be uniquely determined by a pair of nodes, unlike (or ).
Definition 2 (Jump links).
For an edge with and , is recursively defined as follows:
if (thus ), and
Note that equals for .
For any edge with , consists of at least two edges.
Assume on the contrary that for some edge . This implies . By definition, is a proper suffix of , namely, there exists an integer such that . For any character which appears in , there is a (type-1 or type-2) node which represents as a child of the source of . This implies that there is an out-going edge of length from the source representing the first character of . This contradicts that only contains a single edge with . ∎∎
For a given , there is an algorithm that computes all jump links in time.
We explain how to obtain for an edge with . For all edge with , we manage a pointer to the first edge of by . We initially set for all . For all nodes with , let be an outgoing edge of with the same label character of . We check whether and, if so, we recursively compute , and then set . In this way all can be computed in time in total, where the is needed for selecting the out going edge. From Lemma 3, since there does not exist branching edge on each jump link, can be easily obtained from by traversing the path until encountered a type-1 node. ∎∎
An SLP for the L-CDAWG: We build an SLP which represents all edge labels in based on jump links. For each edge , let denote the variable which generates the string label . Let . For any with , we construct a production where is the label. For any with , let . We construct productions , , …, , and . We call a production whose left-hand size is an intermediate production. It is clear that generates and we introduced productions. If there is another edge () such that , then we construct a new production and reuse the other productions. Let be the path that spells out the text . We create productions which generates using the same technique as above for this path . Overall, the total number of intermediate productions is linear in the number of type-2 nodes in . Since there are non-intermediate productions, this SLP consists of productions.
Now, we have the main result of this subsection.
For a given , there is an algorithm that constructs an SLP which represents all edge labels in time.
By the above algorithm, if jump links are computed, we can obtain an SLP which represents all edge labels in time. From Theorem 5, we can compute all jump links in times. Overall, the total time of this algorithm is . ∎∎
Fig. 2 shows and enhanced with the SLP for string .
We associate to each edge label the corresponding variable of the SLP. By applying algorithms of Gasieniec et al.  (in Proposition 2) and Bille et al.  (in Proposition 3), we can show the following theorems.
For a text , can support pattern matching for a pattern of length in time.
For a text of length , has an SLP that derives . In addition, we can read any substring can be read in time.
The text of is represented by the longest path from the source to the sink. Remembering makes it possible to read any position of by using the Proposition 3. ∎∎
3.4 The main result
It is known that for a given string of length over an integer alphabet of size , can be constructed in time . Combining this with the preceding discussions, we obtain the main result of this paper.
For a text of length , supports pattern matching in time for a given pattern of length and substring extraction in time for any substring of length , and can be stored in bits of space (or words of space). If is already constructed, then can be constructed in total time. If is given as input, then can be constructed in total time for integer alphabets of size . After has been constructed, the input string can be discarded.
4 Conclusions and further work
In this paper, we presented a new repetition-aware data structure called Linear-size CDAWGs. takes linear space in the number of the left- and right-extensions of the maximal repeats in , which is known to be small for highly repetitive strings. The key idea is to introduce type-2 nodes following LSTries proposed by Crochemore et al. . Using a small SLP induced from edge-suffix links that is enhanced with random access and prefix extraction data structures, our supports efficient pattern matching and substring extraction. This SLP is repetition-aware, i.e., its size is linear in the number of left- and right-extensions of the maximal repeats in . We also showed how to efficiently construct .
Our future work includes implementation of and evaluation of its practical efficiency, when compared with previous compressed indexes for repetitive texts. An interesting open question is whether we can efficiently construct in an on-line manner for growing text.
-  D. Belazzougui, F. Cunial, T. Gagie, N. Prezza, and M. Raffinot. Composite repetition-aware data structures. In Combinatorial Pattern Matching, pages 26–39. Springer, 2015.
-  P. Bille, G. M. Landau, R. Raman, K. Sadakane, S. R. Satti, and O. Weimann. Random access to grammar-compressed strings and trees. SIAM Journal on Computing, 44(3):513–539, 2015.
-  A. Blumer, J. Blumer, D. Haussler, R. McConnell, and A. Ehrenfeucht. Complete inverted files for efficient text retrieval and analysis. Journal of the ACM (JACM), 34(3):578–595, 1987.
-  F. Claude and G. Navarro. Self-indexed grammar-based compression. Fundamenta Informaticae, 111(3):313–337, 2011.
-  M. Crochemore, C. Epifanio, R. Grossi, and F. Mignosi. Linear-size suffix tries. Theoretical Computer Science, 638:171–178, 2016.
-  M. Crochemore and W. Rytter. Jewels of stringology: text algorithms. World Scientific, 2003.
-  M. Crochemore and R. Vérin. Direct construction of compact directed acyclic word graphs. In Combinatorial Pattern Matching, pages 116–129. Springer, 1997.
-  L. Gasieniec, R. M. Kolpakov, I. Potapov, and P. Sant. Real-time traversal in grammar-based compressed files. In Data Compression Conference, page 458, 2005.
-  R. Grossi, A. Gupta, and J. S. Vitter. High-order entropy-compressed text indexes. In ACM-SIAM symposium on Discrete algorithms, pages 841–850. Society for Industrial and Applied Mathematics, 2003.
-  S. Kreft and G. Navarro. On compressing and indexing repetitive sequences. Theoretical Computer Science, 483:115–133, 2013.
-  V. Mäkinen, G. Navarro, J. Sirén, and N. Välimäki. Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology, 17(3):281–308, 2010.
-  K. Narisawa, S. Inenaga, H. Bannai, and M. Takeda. Efficient computation of substring equivalence classes with suffix arrays. Algorithmica, 2016.
-  G. Navarro. A self-index on block trees. arXiv preprint arXiv:1606.06617, 2016.
-  G. Navarro and V. Mäkinen. Compressed full-text indexes. ACM Computing Surveys (CSUR), 39(1):2, 2007.
-  M. Raffinot. On maximal repeats in strings. Information Processing Letters, 80(3):165 – 169, 2001.
-  Y. Takabatake, Y. Tabei, and H. Sakamoto. Improved ESP-index: a practical self-index for highly repetitive texts. In International Symposium on Experimental Algorithms, pages 338–350. Springer, 2014.
-  P. Weiner. Linear pattern-matching algorithms. In IEEE Annual Symposium on Switching and Automata Theory, pages 1–11, 1973.