Data Structure Lower Bounds for Document Indexing Problems

Data Structure Lower Bounds for Document Indexing Problems

Abstract

We study data structure problems related to document indexing and pattern matching queries and our main contribution is to show that the pointer machine model of computation can be extremely useful in proving high and unconditional lower bounds that cannot be obtained in any other known model of computation with the current techniques. Often our lower bounds match the known space-query time trade-off curve and in fact for all the problems considered, there is a very good and reasonable match between the our lower bounds and the known upper bounds, at least for some choice of input parameters.

The problems that we consider are set intersection queries (both the reporting variant and the semi-group counting variant), indexing a set of documents for two-pattern queries, or forbidden-pattern queries, or queries with wild-cards, and indexing an input set of gapped-patterns (or two-patterns) to find those matching a document given at the query time.

Data Structure Lower Bounds, Pointer Machine, Set Intersection, Pattern Matching
\Copyright

Peyman Afshani and Jesper Sindahl Nielsen \subjclassF.2.2 Nonnumerical Algorithms and Problems

1 Introduction

In this paper we study a number of data structure problems related to document indexing and prove space and query time lower bounds that in most cases are almost tight. Unlike many of the previous lower bounds, we disallow random accesses by working in the pointer machine model of computation, however, we obtain high and unconditional space and query time lower bounds that almost match the best known data structures for all the problems considered; at the moment, obtaining such unconditional and tight lower bounds in other models of computation is a hopelessly difficult problem. Furthermore, compared to the previous lower bounds in the area, our lower bounds probe deeper and thus are much more informative. Consequently, these results show the usefulness of the pointer machine model.

Document indexing is an important problem in the field of information retrieval. Generally, the input is a collection of documents with a total length of characters and usually the goal is to index them such that given a query, all the documents matching the query can be either found efficiently (the reporting variant) or counted efficiently (the searching variant). When the query is just one text pattern, the problem is classical and well-studied and there are linear space solutions with optimal query time [36]. Not surprisingly, there have been various natural extensions of this problem. We summarize the problems we study and our results below.

1.1 New and Previous Results

Problem Query Bound Space Lower Bound Assumptions
2P, FP, SI, 2FP
(counting)
Semi-group
2P, FP, SI, 2FP
(reporting)
PM
2P, FP, SI, 2FP
(reporting)
PM, a parameter
WCI (reporting)
PM, wild-cards,
WCI (reporting)
PM, wild-cards,
-GPI
PM,
Table 1: At every row, the 3rd cell presents our space lower bound for data structures that have a query time bounded by the 2nd cell. PM stands for the pointer machine model. is the input size and is the output size.

Two-pattern and the Related Queries

The two-pattern query problem (abbreviated as the 2P problem) was introduced in 2001 by Ferragina et al. [25] and since then it has attracted lots of attention. In the 2P problem, each query is composed of two patterns and a document matches the query if both patterns occur in the document. One can also define the Forbidden Pattern (FP) problem [26] where a document matches the query if it contains one pattern but not the other. For symmetry, we also introduce and consider the Two Forbidden Patterns (2FP) problem where none of the patterns are required to match the document.

Previous Results. Ferragina et al. [25] presented a number of solutions for the 2P problem with space and query times that depend on the “average size” of each document but the worst case query time and space is and , for any , respectively. Here and are the sizes of the query patterns and is the output size (see also [36]). Cohen and Porat [21] offered a solution that uses space with query time. The space was improved to and the query time to by Hon et al. [27]. The query time was reduced by a in [32] factor and finally the query time was achieved in [10].

The FP problem was introduced by Fischer et al. [26] and they presented a data structure that stores bits and answers queries in time. Another solution was given by Hon et al. [28] that uses space but has query time. For the searching variant (unweighted) their solution can answer queries in time. As Larsen et al. [32] remark, the factor can be removed by using range emptiness data structures and that the same data structure can be used to count the number of matching documents for the two pattern problem.

The difficulty of obtaining fast data structures using (near) linear space has led many to believe that very efficient solutions are impossible to obtain. Larsen et al. [32] specifically focus on proving such impossibility claims and they show that the 2P and FP problems are at least as hard as Boolean Matrix Multiplication, meaning, with current techniques, where and are the preprocessing and the query times of the data structure, respectively and is the exponent of the best matrix multiplication algorithm (currently ). If one assumes that there is no “combinatorial” matrix multiplication algorithm with better running time than , then the lower bound becomes . Other conditional lower bounds for the 2P and FP problems but from the integer 3SUM conjecture were obtained by Kopelowitz, et al. [30, 31].

The above results are conditional. Furthermore, they tell us nothing about the complexity of the space usage, , versus the query time which is what we are truly interested in for data structure problems. Furthermore, even under the relatively generous assumption3 that , the space and query lower bounds obtained from the above results have polynomial gaps compared with the current best data structures.

We need to remark that the only unconditional space lower bound is a pointer machine lower bound that shows with query time of the space must be  [26]. However this bound is very far away from the upper bounds (and also much lower than our lower bounds).

Our Results. We show that all the known data structures for 2P and FP problems are optimal within factors, at least in the pointer machine model of computation: Consider a pointer machine data structure that uses space and can report all the documents that match a given 2P query (or FP query, or 2FP query) in time. We prove that we must have . As a corollary of our lower bound construction, we also obtain that any data structure that can answer 2P query (or FP query, or 2FP query) in time, for any fixed constant must use super-linear space. As a side result, we show that surprisingly, the counting variant of the problem is in fact easier: in the semi-group model of computation (see [18] or Appendix G for a description of the semi-group model), we prove that we must have .

Set Intersection Queries.

The interest in set intersection problems has grown considerably in recent years and variants of the set intersection problem have appeared in many different contexts. Here, we work with the following variants of the problem. The input is sets, of total size , from a universe and queries are a pairs of indices and . The decision variant asks whether . The reporting variant asks for all the elements in . In the counting variant, the result should be . In the searching variant, the input also includes a weight function where is a semi-group. The query asks for .

Previous Results. The set intersection queries have appeared in many different formulations and variants (e.g., see [22, 39, 40, 1, 30, 21]). The most prominent conjecture is that answering the decision variant with constant query time requires space (see [39] for more details). For the reporting variant, Cohen and Porat [21] presented a data structure that uses linear space and answers queries in time, where is the output size. They also presented a linear-space data structure for the searching variant that answers queries in time. In [29] the authors study set intersection queries because of connections to dynamic graph connectivity problems. They offer very similar bounds to those offered by Cohen and Porat (with a factor worse space and query times) but they allow updates in time. It is commonly believed that all set intersection queries are hard. Explicitly stated conjectures on set intersection problems are used to obtain conditional lower bounds for problems such as distance oracles [39, 40, 22] while other well-known conjectures, such as the 3SUM conjecture, can be used to show conditional lower bounds for variants of set intersection problems [30, 1]. For other variants see [9, 37, 24].

Dietz et al. [24] considered set intersection queries in the semi-group model (a.k.a the arithmetic model) and they presented near optimal dynamic and offline lower bounds. They proved that given a sequence of updates and queries one must spend time (ignoring polylog factors); in the offline version a sequence of insertions and queries are used but in the dynamic version, the lower bound applies to a dynamic data structure that allows insertion and deletion of points, as well as set intersection queries.

Our Results. Our lower bounds for the 2P problem easily extend to the SI problem. Perhaps the most interesting revelation here is that the searching variant is much easier than the reporting variant ( for reporting versus for searching) 4. Based on this, we make another conjecture that even in the RAM model, reporting the elements in for two given query indices and , in time, for any fixed constant , requires space. Such a separation between counting and reporting is a rare phenomenon with often counting being the difficult variant.

Observe that conditional lower bounds based on the Boolean Matrix Multiplication or the integer 3SUM conjectures have limitations in distinguishing between the counting and the reporting variants: For example, consider the framework of Larsen et al. [32] and for the best outcome, assume and also that boolean matrix multiplication requires time; then their framework yields that . When this does not rule out the possibility of having (in fact the counting variant can be solved with linear space). However, our lower bound shows that even with the reporting variant requires space.

Wild Card Indexing

We study the document indexing problem of matching with “don’t cares”, also known as wild card matching or indexing (WCI). The setup is the same as for the 2P problem, except a query is a single pattern but it also contains wild cards denoted by a special character “”. A “” matches any one character. The task is to report all documents matched by the query. This is a well-studied problem from the upper bound perspective and there are generally two variations: either the maximum number of wild cards is bounded or it supports any number of wild cards. We consider the bounded version where patterns contain up to wild cards and is known in advance by the data structure.

Previous Results. Cole et al. [23] presented a data structure that uses words of space and answers queries in , where is the number of occurrences and is the length of the query pattern. The space has been improved to bits while keeping the same query time [33]. Another improvement came as a trade-off that increased query time to and reduced space usage for any where is the alphabet size and is the number of wild cards in the query [8]. In the same paper an alternate solution with query time and space usage was also presented. Other results have focused on reducing space while increasing the query time but their query time now depends on the alphabet size, e.g., in [34] the authors provide a data structure with space but with the query time of .

From these bounds we note three things, first that all solutions have some exponential dependency on and second the alternate solution by Bille et al. [8] has an odd factor, which is exponential on a quadratic function of as opposed to being exponential on a linear function of (such as , , or ). Third, when the query time is forced to be independent of , there is a discrepancy between query and space when varying : Increasing (when it is small) by one increases the space by approximately a factor, while to increase the query time by a factor, needs to be increased by . Based on the third point, it is quite likely that unlike SI or 2P problems, WCI does not have a simple trade-off curve.

Other results that are not directly comparable to ours include the following: One is an space index with query time [41] where is the number of occurrences of all the subpatterns separated by wild card characters. Note that could be much larger than and in fact, this can result in a worst case linear query time, even with small values of . Nonetheless, it could perform reasonably well in practice. Two, there are lower bounds for the partial match problem, which is a related problem (see [38] for more details).

Our Results. For WCI with wild cards, we prove two results, both in the pointer machine model. In summary, we show that the exponential dependency of space complexity or query time on generally cannot be avoided.

As our first result and for a binary alphabet (), we prove that for , any data structure that answers queries in time must consume space. This result rules out the possibility of lowering the factor in the alternate solution offered by Bille et al. [8], over all values of , to for any constant : by setting (and ), such a bound will be much smaller than our space lower bound (essentially involves comparing factor to a factor). While this does not rule out improving the space bound for small values of , it shows that the exponential dependency on is almost tight at least in a particular point in the range of parameters.

As our second result, we prove that answering WCI queries in time requires space, as long as . Note that this query time is assumed to be independent of . This result also has a number of consequences. One, it shows that any data structure with query time of requires space. Note that this is rather close to the space complexity of the data structure of Cole et al. [23] ( factor away). In other words, the data structure of Cole et al. cannot be significantly improved both in space complexity and query time; e.g., it is impossible to answer WCI queries in time using space, for any constant .

Two, is the smallest value where linear space becomes impossible with polylogarithmic query time. This is very nice since can be solved with almost linear space [33]. Furthermore this shows that the increase by a factor in the space complexity for every increase of is necessary (for small values of ).

Three, we can combine our second result with our first result when . As discussed before, our first result rules out fast queries (e.g., when ), unless the data structure uses large amounts of space, so consider the case when . In this case, we can rule out lowering the space usage of the data structure of Cole et al. to for any constant : apply our second lower bound with fewer wild cards, specifically, with wild cards, for a small enough constant that depends on . Observe that , so the second result lower bounds the space by , which for a sufficiently small is greater than .

As mentioned in the beginning, our results show that the exponential dependency of space or query time on cannot be improved in general. Furthermore, at a particular point in the range of parameters (when ), all the known exponential dependencies on are almost tight and cannot be lowered to an exponential dependency on (or on for the alternate solution) for any constant . Nonetheless, there are still gaps between our lower bounds and the known data structures. We believe it is quite likely both our lower bounds and the existing data structures can be improved to narrow the gap.

Gapped Pattern Indexing

A -gapped pattern is a pattern where and , are integers, and each , , is a string over an alphabet of size . Such a -gapped pattern matches a documents in which one can find one occurrence of every such that there are at least and at most characters between the occurrence of and the occurrence of , .

Previous Results. The gapped pattern indexing is often considered both in the online and the offline version (e.g., see [6, 43]). However, the result most relevant to us is [5], where they consider the following data structure problem: given a set of -gapped patterns of total size , where all the patterns are in the form of , store them in a data structure such that given a document of length at the query time, one can find all the gapped patterns that match the query document (in general we call this the -gapped pattern indexing (-GPI) problem). In [5], the authors give a number of upper bounds and conditional lower bounds for the problem. Among a number of results, they can build a data structure of linear size that can answer queries in where notation hides polylogarithmic factors and is the output size. For the lower bound and among a number of results, they can show that with linear space query time is needed.

Our Results. We consider the general -GPI problem where for all , and a prove lower bound that is surprisingly very high: any pointer machine data structure that can answer queries in time must use super polynomial space of . By construction, this result also holds if the input is a set of patterns where they all need to match the query document (regardless of their order and size of the gaps). In this case, answering queries in requires the same super-polynomial space. Note that in this case is the “dual” of the 2P problem: store a set of two-patterns in data structure such that given a query document, we can output the subset of two-patterns that match the document.

1.2 Technical Preliminaries

The Pointer Machine Model [42]. This models data structures that solely use pointers to access memory locations (e.g., any tree-based data structure)5. We focus on a variant that is the popular choice when proving lower bounds [17]. Consider an abstract “reporting” problem where the input includes a universe set where each query reports a subset, , of . The data structure is modelled as a directed graph with outdegree two (and a root node ) where each vertex represents one memory cell and each memory cell can store one element of ; edges between vertices represent pointers between the memory cells. All other information can be stored and accessed for free by the data structure. The only requirement is that given the query , the data structure must start at and explore a connected subgraph of and find its way to vertices of that store the elements of . The size of the subgraph explored is a lower bound on the query time and the size of is a lower bound on the space.

An important remark. The pointer-machine can be used to prove lower bounds for data structures with query time where is the output size and is “search overhead”. Since we can simulate any RAM algorithm on a pointer-machine with factor slow down, we cannot hope to get high unconditional lower bounds if we assume the query time is , since that would automatically imply RAM lower bounds for data structures with query time, something that is hopelessly impossible with current techniques. However, when restricted to query time of , the pointer-machine model is an attractive choice and it has an impressive track record of proving lower bounds that match the best known data structures up to very small factors, even when compared to RAM data structures; we mention two prominent examples here. For the fundamental simplex range reporting problem, all known solutions are pointer-machine data structures [35, 11, 20] and the known pointer machine lower bounds match these up to an factor [2, 19]. One can argue that it is difficult to use the power of RAM for simplex range reporting problem. However, for the other fundamental orthogonal range reporting, where it is easy to do various RAM tricks, the best RAM data structures save at most a factor compared to the best known pointer machine solutions (e.g., see [3, 4, 12]). Also, where cell-probe lower bounds cannot break the query barrier, very high lower bounds are known for the orthogonal range reporting problem in the pointer machine model [3, 4, 17].

Known Frameworks. The fundamental limitation in the pointer machine model is that starting from a memory cell , one can visit at most other memory cells using pointer navigations. There are two known methods that exploit this limitation and build two different frameworks for proving lower bounds.

The first lower bound framework was given by Bernard Chazelle [17, 19]. However, we will need a slightly improved version of his framework that is presented in the following lemma; essentially, we need a slightly tighter analysis on a parameter that was originally intended as a large constant. Due to lack of space we defer the proof to the Appendix A.

{restatable}

theoremthmlb Let be a set of input elements and a set of queries where each query outputs a subset of . Assume there exists a data structure that uses space and answers each query in time, where is the output size. Assume (i) the output size of any query , denoted by , is at least , for a parameter and (ii) for integers and , and indices, , . Then, .

The second framework is due to Afshani [2] and it is designed for “geometric stabbing problems”: given an input set of geometric regions, the goal is store them in a data structure such that given a query point , one can output the subset of regions that contain . The framework is summarized below. {theorem}[2] Assume one can construct a set of geometric regions inside the -dimensional unit cube such that (i) every point of the unit cube is contained in at least regions 6, and (ii) the volume of the intersection of every regions is at most , for some parameters , , and . Then, for any pointer-machine data structure that uses space and can answer geometric stabbing queries on the above input in time , where is the output size and is some increasing function, if then .

These two frameworks are not easily comparable. In fact, for many constructions, often only one of them gives a non-trivial lower bound. Furthermore, as remarked by Afshani [2], Theorem 1.2 does not need to be operated in the -dimensional unit cube and in fact any measure could be substituted instead of the -dimensional Lebesgue measure.

Our Techniques. Our first technical contribution is to use Theorem 1.2 in a non-geometric setting by representing queries as abstract points under a discrete measure and each input object as a range that contains all the matching query points. Our lower bound for the -GPI problem and one of our WCI lower bounds are proved in this way. The second technical contribution is actually building and analyzing proper input and query sets to be used in the lower bound frameworks. In general, this is not easy and in fact for some problems it is highly challenging7. Also see Section 5 (Conclusions) for a list of open problems.

In the rest of this article, we present the technical details behind our -GPI lower bound and most of the details of our first WCI lower bound. Due to lack of space, the rest of the technical details have been moved to the appendix.

We begin with the -GPI problem since it turns out for this particular problem we can get away with a simple deterministic construction. For WCI, we need a more complicated randomized construction to get the best result and thus it is presented next.

2 Gapped Pattern Lower Bound

In this section we deal with the data structure version of the gapped pattern problem. The input is a collection of -gapped patterns (typically called a dictionary), with total length (in characters). The goal is to store the input in a data structure such that given a document of size , one can report all the input gapped patterns that match the query. We focus on special -gapped patterns that we call standard: a standard -gapped pattern in the form of where each is a string (which we call a subpattern) and is an integer.

{restatable}

theoremthmgapped For and in the pointer machine model, answering -GPI queries in time requires space.

To prove this theorem, we build a particular input set of standard -gapped patterns. We pick the alphabet , and the gapped patterns only use . Each subpattern in the input is a binary string of length . The subpatterns in any gapped pattern are in lexicographic order, and a subpattern occurs at most once in a pattern (i.e., no two subpatterns in a pattern are identical). The input set, , contains all possible gapped patterns obeying these conditions. Thus, . Each query is a text composed of concatenation of substrings (for simplicity, we assume is an integer) and each substring is in the form ’’. We restrict the query substrings to be in lexicographic order and without repetitions (no two substrings in a query are identical). The set of all query texts satisfying these constraints is denoted by . Thus, .

{lemma}

For , any text matches gapped patterns in .

Proof.

Consider a text . To count the number of gapped patterns that can match it, we count the different ways of selecting positions that correspond to the starting positions of a matching subpattern. Each position starts immediately after ’#’, with at most characters between consecutive positions. Since , we have choices for picking the first position, i.e., the starting position of a gapped pattern matching . After fixing the first match, there are at most choices for the position of the next match between a subpattern and substring. However, if the first match happens in the first half of text , there are always choices for the position of each subpattern match (since ). Thus, we have choices. As input subpatterns are in lexicographically increasing order, different choices result in distinct gapped patterns that match the query. ∎

To apply Theorem 1.2, we consider each query text in as a “discrete” point with measure . Thus, the total measure of (i.e., the query space) is one and functions as the “unit cube” within the framework of Theorem 1.2. We consider an input gapped pattern in as a range that contains all the points of that match . Thus, to apply Theorem 1.2, we need to find a lower bound on the output size of every query (condition (i)) and an upper bound on , the measure of the intersection of inputs (condition (ii)). By the above lemma, meeting the first condition is quite easy: we pick (with the right hidden constant). Later we shall see that so this can be written as Thus, we only need to upper bound which we do below.

{lemma}

Consider patterns . At most texts in can match all patterns .

Proof.

Collectively, must contain at least distinct subpatterns: otherwise, we can form at most different gapped patterns, a contradiction. This in turn implies that any text matching must contain all these at least distinct subpatterns. Clearly, the number of such texts is at most . ∎

As the measure of each query in is , by the above theorem, we have . We now can apply Theorem 1.2. Each pattern in the input has characters and thus the total input size, , is . By the framework, and Lemmata 2 and 2, we know that the space usage of the data structure is at least

where to obtain the rightmost equation we expand the binomials, simplify, and constrain to lower bound with . Now we work out the parameters. We know that ; this is satisfied by setting for some constant . Observe that there is an implicit constraint on and : there should be sufficient bits in the subpatterns to fill out a query with distinct subpatterns, i.e. ; we pick for some other constant such that and thus . Using these values, the space lower bound is simplified to

where is another constant. We now optimize the lower bound by picking such that which solves to . Thus, for constant , the space complexity of the data structure is

where the last part follows from .

3 Wild Card Indexing

In this section we consider the wild card indexing (WCI) problem and prove both space and query lower bounds in the pointer machine model of computation. Note that our query lower bound applies to an alphabet size of two (i.e., binary strings).

3.1 The Query Lower Bound

Assume for any input set of documents of total size , we can build a data structure such that given a WCI query of length containing wild cards, we can find all the documents that match the query in time, where is the output size. Furthermore, assume is a fixed parameter known by the data structure and that the data structure consumes space. Our main result here is the following.

{theorem}

If and , then .

To prove the lower bound, we build a particular set of documents and patterns and prove that if the data structure can answer the queries fast, then it must consume lots of space, for this particular input, meaning, we get lower bounds for the function . We now present the details. We assume , as otherwise the theorem is trivial.

Documents and patterns. We build the set of documents in two stages. Consider the set of all bit strings of length with exactly “1”s. In the first stage, we sample each such string independently with probability where . Let be the set of sampled strings. In the second stage, for every set of indices, , where , we perform the following operation, given another parameter : if there are more than strings in that have “1”s only among positions , then we remove all such strings from . Consequently, among the remaining strings in , “1”s in every subset of strings will be spread over at least positions. The set of remaining strings will form our input set of documents. Now we consider the set of all the patterns of length that have exactly wild cards and “0”s. We remove from any pattern that matches fewer than documents from . The remaining patterns in will form our query set of patterns. In Appendix B, we prove the following. {restatable}lemmalemnicel With positive probability, we get a set of documents and a set of patterns such that (i) each pattern matches documents, and (ii) there are no documents whose “1”s are contained in a set of indices.

To prove the lower bound, we use Theorem 1.2. We use a discrete measure here: each pattern is modelled as a “discrete” point with measure , meaning, the space of all patterns has measure one. Each document forms a range: a document contains all the patterns (discrete points) that match . Thus, the measure of every document is , where is the number of patterns that match . We consider the measure of the intersection of documents (regions) . By Lemma 3.1, there are indices where one of these documents has a “1”; any pattern that matches all of these documents must have a wild card in all of those positions. This means, there are at most patterns that could match documents . This means, when we consider documents as ranges, the intersection of every documents has measure at most which is an upper bound for parameter in Theorem 1.2. For the two other parameters and in the theorem we have, (by Lemma 3.1) and . To obtain a space lower bound from Theorem 1.2, we must check if . Observe that since the binomials sum to for and is the largest one. As , we have . However, also involves an additive term. Thus, we must also have which will hold for our choice of parameters but we will verify it later.

By guaranteeing that , Theorem 1.2 gives a space lower bound of . However, we would like to create an input of size which means the number of sampled documents must be and thus we must have . As , it follows that . Thus, . Thus, we have that

where the last step follows from , and .

Now we bound in terms of . From we obtain that . Remember that . Based on this, we get that and since the term is dominated and we have . It remains to handle the extra factor in the space lower bound. From Lemma 3.1, we know that . Based on the value of , this means which means is also absorbed in factor. It remains to verify one last thing: previously, we claimed that we would verrify that . Using the bound this can be written as which translates to which clearly holds if .

3.2 The Space Lower Bound

We defer the details of our space lower bound to Appendix C, where we prove the following.

{restatable}

theoremthmwcis Any pointer-machine data structure that answers WCI queries with wild cards in time over an input of size must use space, as long as , where is the output size, and is the pattern length. Refer to the introduction for a discussion of the consequences of these lower bounds.

4 Two Pattern Document Indexing and Related Problems

Due to lack of space we only state the primary results for 2P, FP, 2FP, and SI. The proofs for Theorems 4 and 4 can be found in Appendix D and H respectively.

{restatable}

theoremthmreportmain Any data structure on the Pointer Machine for the 2P, FP, 2FP, and SI problems with query time and space usage must obey .
Also, if query time is for a constant , then .

The above theorem is proved using Theorem 1.2 which necessitates a randomized construction involving various high probability bounds. Unlike our lower bound for -GPI we were unable to find a deterministic construction that uses Theorem 1.2.

We also prove the following lower bound in the semi-group model which addresses the difficulty of the counting variants of 2P and the related problems. {restatable}theoremthmsearchmain Answering 2P, FP, 2FP, and SI queries in the semi-group model requires .

5 Conclusions

In this paper we proved unconditional and high space and query lower bounds for a number of problems in string indexing. Our main message is that the pointer machine model remains an extremely useful tool for proving lower bounds, that are close to the true complexity of many problems. We have successfully demonstrated this fact in the area of string and document indexing. Within the landscape of lower bound techniques, the pointer machine model, fortunately or unfortunately, is the only model where we can achieve unconditional, versatile, and high lower bounds and we believe more problems from the area of string and document indexing deserve to be considered in this model. To this end, we outline a number of open problems connected to our results.

  1. Is it possible to generalize the lower bound for 2P to the case where the two patterns are required to match within distance ? This is essentially a “dual” of the -GPI problem.

  2. Recall that our space lower bound for the WCI problem (Subsection 3.2) assumes that the query time is independent of the alphabet size . What if the query is allowed to increase with ?

  3. Our query lower bound for the WCI (Subsection 3.1) is proved for a binary alphabet. Is it possible to prove lower bounds that take into account? Intuitively the problem should become more difficult as increases, but we were unable to obtain such bounds.

  4. We require certain bounds on for the WCI problem. Is it possible to remove or at least loosen them? Or perhaps, can the upper bounds be substantially improved?

  5. What is the difficulty of the -GPI problem when is large?

Acknowledgements. We thank Karl Bringmann for simplifying the construction of our documents for the 2P problem. We thank Moshe Lewenstein for bringing the WCI problem to our attention. We also thank the anonymous referees for pointing us to [24].

Appendix A Proof of Lemma 1.2

\thmlb

*

The proof is very similar to the one found in [19], but we count slightly differently to get a better dependency on . Recall the data structure is a graph where each node stores pointers and some input element. At the query time, the algorithm must explore a subset of the graph. The main idea is to show that the subsets explored by different queries cannot overlap too much, which would imply that there must be many vertices in the graph, i.e., a space lower bound.

By the assumptions above a large fraction of the visited nodes during the query time will be output nodes (i.e., the algorithm must output the value stored in that memory cell). We count the number of nodes in the graph by partitioning each query into sets with output nodes. By assumption 2 each such set will at most be counted times. We need the following fact:

Fact 1 ([2], Lemma 2).

Any binary tree of size with marked nodes can be partitioned into subtrees such that there are subtrees each with marked nodes and size , for any .

In this way we decompose all queries into these sets and count them. There are different queries, each query gives us sets. Now we have counted each set at most times, thus there are at least distinct sets with output nodes. On the other hand we know that starting from one node and following at most pointers we can reach at most different sets (Catalan number). In each of those sets there are at most possibilities for having a subset with marked nodes. This gives us an upper bound of for the number of possible sets with marked nodes. In conclusion we get

Appendix B Proof of Lemma 3.1

\lemnicel

*

Proof.

Observe that the second stage of our construction guarantees property (ii). So it remains to prove the rest of the claims in the lemma.

Consider a sampled document (bit string) . Conditioned on the probability that has been sampled, we compute the probability that gets removed in the second stage of our sampling (due to conflict with other sampled documents).

Let be the set of indices that describe the position of “1”s in . We know that by construction. Consider a subset with . By our construction, we know that if there are other sampled documents whose “1”s are at positions , then we will remove (as well as all those documents). We first compute the probability that this happens for a fixed set and then use the union bound. The total number of documents that have “1”s in positions is

(1)

and thus we expect to sample at most of them. We bound the probability that instead documents among them are sampled. We use the Chernoff bound by picking , and and we obtain that the probability that of these documents are sampled is bounded by

We use the union bound now. The total number of possible ways for picking the set is which means the probability that we remove document at the second stage of our construction is less than

if we pick for a large enough constant . Thus, most documents are expected to survive the second stage of our construction. Now we turn our attention to the patterns.

For a fixed pattern , there are documents that could match and among them we expect to sample documents. An easy application of the Chernoff bound can prove that with high probability, we will sample a constant fraction of this expected value, for every pattern, in the first stage of our construction. Since the probability of removing every document in the second stage is at most , the probability that a pattern is removed at the second stage is less than and thus, we expect a constant fraction of them to survive the second stage. ∎

Appendix C Space Lower bound for wild card

For this problem we use Chazelle’s framework (Lemma 1.2) and give a randomized construction for the input and query sets. Note that here we no longer assume a binary alphabet. In fact, we will vary the alphabet size. To be more precise, we assume given any input of size , over an alphabet of size , there exists a data structure that can answer WCI queries of size with wild cards in time where is the output size. Note that this query time is forced to be independent of .

Unlike the case for the query lower bound, we build the set of queries in two stages. In the first stage, we consider all documents of length over the alphabet (that is ) and independently sample of them (with replacement) to form the initial set of input documents. And for queries, we consider the set of all strings of length over the alphabet containing exactly wild cards (recall that is the wild card character). In total we have queries. In the second stage, for a parameter , we consider all pairs of queries and remove both queries if the number of documents they both match is or more. No document is removed in this stage. We now want to find a value of such that we retain almost all of our queries after the second stage.

The probability that a fixed query matches a random document is . There are in total documents, meaning we expect a query output documents. By an easy application of Chernoff bound we can prove that with high probability, all queries output documents.

We now bound the number of queries that survive the second stage. First observe that if two queries do not have any wild card in the same position, then there is at most document that matches both. Secondly, observe that for a fixed query there are other queries sharing wild cards. We say these other queries are at distance from . For a fixed query, we prove that with constant probability it survives the second stage. This is accomplished by considering each individually and using a high concentration bound on each, and then using a union bound. Since there are different values for we bound the probability for each individual value by .

Now consider a pair of queries at distance . The expected number of documents in their intersection is . Letting to be the random variable indicating the number of documents in their intersection we get

Recall that there are values for and there are “neighbours” at distance , we want the following condition to hold:

We immediately observe that the query time, , should always be greater than (just for reading the query) and that there are never more than wild cards in a query. Picking for some and letting be sufficiently large, we can disregard the factors and