Tractable Lineages on Treelike Instances:Limits and Extensions

# Tractable Lineages on Treelike Instances: Limits and Extensions

## Abstract

Query evaluation on probabilistic databases is generally intractable (#P-hard). Existing dichotomy results [DBLP:conf/pods/DalviS07a, dalvi2012dichotomy, DBLP:conf/pods/FinkO14] have identified which queries are tractable (or safe), and connected them to tractable lineages [jha2013knowledge]. In our previous work [amarilli2015provenance], using different tools, we showed that query evaluation is linear-time on probabilistic databases for arbitrary monadic second-order queries, if we bound the treewidth of the instance.

In this paper, we study limitations and extensions of this result. First, for probabilistic query evaluation, we show that MSO tractability cannot extend beyond bounded treewidth: there are even FO queries that are hard on any efficiently constructible unbounded-treewidth class of graphs. This dichotomy relies on recent polynomial bounds on the extraction of planar graphs as minors [chekuri2014polynomial], and implies lower bounds in non-probabilistic settings, for query evaluation and match counting in subinstance-closed families. Second, we show how to explain our tractability result in terms of lineage: the lineage of MSO queries on bounded-treewidth instances can be represented as bounded-treewidth circuits, polynomial-size OBDDs, and linear-size d-DNNFs. By contrast, we can strengthen the previous dichotomy to lineages, and show that there are even UCQs with disequalities that have superpolynomial OBDDs on all unbounded-treewidth graph classes; we give a characterization of such queries. Last, we show how bounded-treewidth tractability explains the tractability of the inversion-free safe queries: we can rewrite their input instances to have bounded-treewidth.

\defaultbibliographystyle

abbrv \defaultbibliographyreferences

{bibunit}

## 1 Introduction

Many applications must deal with data which may be erroneous. This makes it necessary to extend relational database instances, to allow for uncertain facts. One of the simplest such formalisms [suciu2011probabilistic] is that of tuple-independent databases (TID): each tuple in the database is annotated with an independent probability of being present. The semantics of a TID instance is to see it as a concise representation of a probability distribution on standard non-probabilistic instances.

An important challenge when dealing with probabilistic data is that data management tasks become intractable. The main one is query evaluation: given an input database query , for instance a conjunctive query, and given a relational instance , determine the answers to on . When is Boolean, we just ask whether satisfies . The corresponding problem in the TID setting asks for the probability that , that is, the total probability weight of the possible subsets of the TID instance  that satisfy . The query  is usually assumed to be fixed, and we look at the complexity of this problem as a function of the input instance (or TID) , that is, the data complexity. Sadly, while this task is highly tractable and parallelizable (in the complexity class AC) in the non-probabilistic context, exact computation is generally intractable (#P-hard) on TID instances, even for the simple conjunctive query . See [dalvi2007efficient].

Faced with this intractability, two natural directions are possible. The first is to restrict the language of queries to focus on queries that are tractable on all instances, called safe. This has proven a very fruitful direction [DBLP:conf/pods/DalviS07a], culminating in the dichotomy result of Dalvi and Suciu [dalvi2012dichotomy]: the data complexity of a given union of conjunctive queries (UCQ) is either in PTIME or #P-hard. More recently, the safe non-repeated CQs with negation were characterized in [DBLP:conf/pods/FinkO14].

The second approach is to restrict the instances, to focus on instance families that are tractable for all queries in highly expressive languages. In a recent work [amarilli2015provenance], going through the setting of semiring provenance [green2007provenance], and using a celebrated result by Courcelle [courcelle1990monadic], we have started to explore this direction. We showed that, for queries in MSO (monadic second-order, a highly expressive language capturing UCQs), data complexity is linear-time on treelike instances, i.e., instances of treewidth bounded by a constant. Of course, this result says nothing of non-treelike instances, but covers use cases previously studied in their own right, such as probabilistic XML [cohen2009running] (without data values).

This new direction raises several important questions:

• First, is this the best that one can hope for? For probability evaluation, could the tractability on bounded-treewidth instances be generalized, e.g., to bounded clique-width instances [courcelle1993handle], as for MSO in the non-probabilistic case? More ambitiously, could we separate tractable and intractable instances with a dichotomy theorem?

• Second, can our bounded-treewidth tractability result be explained in terms of lineage? The lineage of a query intuitively represents how it can be satisfied on the instance, and can be used to compute its probability: for many fragments of safe queries [jha2013knowledge], tractability can be shown via a tractable representation of their lineage. In [amarilli2015provenance], we build a bounded-treewidth circuit representation of Boolean provenance. How does this compare to the usual lineage classes of OBDDs and d-DNNFs in knowledge compilation?

• Third, can we link the query-based tractability approach to our instance-based one? Can we explain the tractability of some safe queries by reducing them to query evaluation on treelike instances?

This paper answers all of these questions.

#### Contributions

Our first main result (in Section 4) is that bounded treewidth characterizes the tractable families of graphs for MSO queries in the probabilistic context. More precisely, we construct a query  for which probability evaluation is intractable on any unbounded-treewidth family of graphs satisfying mild constructibility requirements; query evaluation is precisely -complete under randomized polynomial-time (RP) reductions. Thus, tractability on bounded-treewidth instances is really the best we can get, on arity-2 signatures. Surprisingly, we show that can be taken to be a (non-monotone) FO query; this is in stark contrast with non-probabilistic query evaluation [kreutzer2010lower, ganian2014lower] where FO queries are fixed-parameter tractable under much milder conditions than bounded treewidth [kreutzer2008algorithmic]. This provides the lower bound of a dichotomy, the upper bound being our result in [amarilli2015provenance].

In Section 5, we explain how this dichotomy result can be adapted to non-probabilistic MSO query evaluation and match counting on subgraph-closed graph families. While the necessity of bounded-treewidth for non-probabilistic query evaluation was studied before [kreutzer2010lower, ganian2014lower], our use of a recent polynomial bound on grid minors [chekuri2014polynomial] allows us to obtain stronger results in this context, which we review. Our work thus answers the conjecture of [grohe2008logic] (Conjecture 8.3) for MSO, which [kreutzer2010lower] answered for MSO, under similar complexity-theoretic assumptions.

In Section LABEL:sec:lineages, we move from probability evaluation to the computation of tractable lineages. Our tractability result in [amarilli2015provenance] computes a bounded-treewidth lineage of linear size for MSO queries on bounded-treewidth instances. We revisit this upper bound and show that we can compute an OBDD lineage of polynomial size (by results in [jha2012tractability]) and a d-DNNF lineage of linear size (a new result). We show that on bounded-pathwidth instances (a notion more restrictive than that of bounded-treewidth), we obtain a bounded-pathwidth lineage, and hence a constant-width OBDD (by [jha2012tractability]). Further, all these representations can be efficiently constructed.

We then reexamine the choice of representing provenance as a circuit rather than a formula, because this is unusual in the semiring provenance context of [amarilli2015provenance]. We show in Section LABEL:sec:formulae that some of the previous tractability results for lineage representations cannot extend to formula representations, via conciseness bounds on Boolean circuits and formulae. This sheds some light on the conciseness gap between circuit and formula representations of lineage.

We then move in Section 6 to our second main result, which applies to tractable OBDD lineages rather than tractable query evaluation. It shows a dichotomy on arity-2 signatures, for the weaker query language of UCQs with disequalities: while bounded-treewidth instances admit efficient OBDDs for such queries, any constructible unbounded-treewidth instance family must have superpolynomial OBDDs for some query (depending only on the signature).

Last, in Section LABEL:sec:safe, we connect our approach to query-based tractability conditions [DBLP:conf/pods/DalviS07a, dalvi2012dichotomy]. We show that, for safe UCQs that admit a concise OBDD representation (that is, precisely inversion-free UCQs from [jha2013knowledge]), one can rewrite any instance to a bounded-treewidth instance (actually, to a bounded-pathwidth one), such that the query lineage, and hence the query probability, remain the same. Thus, in this sense, safe queries are tractable because their input instances may as well be bounded-pathwidth.

#### Related work

Bounded-treewidth has been shown to be a sufficient condition for tractability of query evaluation (this is by Courcelle’s theorem [courcelle1990monadic], generalized to arbitrary relational structures in [flum2002query]), counting of query matches [arnborg1991easy], and probabilistic query evaluation [amarilli2015provenance].

For MSO query evaluation on non-probabilistic instances, bounded-treewidth is known not to be necessary, e.g., query evaluation is tractable assuming bounded clique-width [courcelle1993handle]. FO query evaluation is tractable assuming milder conditions [kreutzer2008algorithmic]. Two separate lines of work investigated the necessity of bounding the treewidth of instances to ensure the tractability of other data management tasks.

First, in [DBLP:conf/focs/Marx07, marx2010can], Marx shows that treewidth-based algorithms for binary constraint-satisfaction problems (CSP) are, assuming the exponential-time hypothesis, almost optimal: they can only be improved by a logarithmic factor. These works do not rely on the graph minor theorem [robertson1986graph5] as we do, as they preceded the results of [chekuri2014polynomial] that provide polynomial bounds on the size of grid minors: see the discussion in the Introduction of [marx2010can]. Instead, they characterize high treewidth via embeddings of low depth. The results of [DBLP:conf/focs/Marx07, marx2010can] were further applied to inference in undirected [DBLP:conf/uai/ChandrasekaranSH08] and directed [DBLP:conf/ecai/KwisthoutBG10] graphical models. All these works are specific to the setting and problem that they study, namely CSP and inference.

Second, another line of work [makowsky2003tree, kreutzer2010lower, ganian2014lower] has shown necessity of bounded treewidth when a class of graphs is closed under some operations: extracting topological minors in [makowsky2003tree], extracting subgraphs in [kreutzer2010lower], and extracting subgraphs and vertex relabeling in [ganian2014lower]. This requires that there are sufficiently many instances of high treewidth, through notions of strong unboundedness [kreutzer2010lower] and dense unboundedness [ganian2014lower]. We strengthen the results of [ganian2014lower] in Section 5.2 of this paper, using our techniques. None of these works consider probabilistic evaluation or match counting, which we do here.

Other related work is discussed throughout the paper, where relevant; in particular works related to lineages in Sections LABEL:sec:lineages to 6 and to safe queries in Section LABEL:sec:safe.

The next section (Section 2) presents preliminaries, and Section 3 gives our formal problem statement. We then move to our new results in Section 4 onwards.

For space reasons, we omit the full proofs for Sections 4, 5, and 6: they can be found in Chapter 6 of [thesis]. Full proofs of the other sections can be found in the appendix.

## 2 Preliminaries

#### Instances

A relational signature is a set of relations , each having an arity denoted . The signature is arity- if is the maximum arity of a relation in .

A relational instance (or simply -instance or instance) is a finite set of ground facts on the signature , and a class or family of instances is just a (possibly infinite) set of instances. A subinstance of  is a subset of its facts. We follow the active domain semantics, where the domain of  is the finite set of elements that occur in facts. Hence, for , is the (possibly strict) subset of formed of the elements that occur in facts of . The size of , denoted , is its number of facts.

A homomorphism from a -instance  to a -instance  is a function such that, for all , we have . A homomorphism is an isomorphism if it is bijective and its inverse is also a homomorphism.

#### Graphs

Throughout the paper, a graph will always be undirected, simple, and unlabeled, unless otherwise specified. Formally, we see a graph as an instance of the graph signature with a single predicate of arity  such that: (i) ; and (ii) . As we follow the active domain semantics, this implies that we disallow isolated vertices in graphs. The facts of are called edges. The set of vertices (or nodes) of a graph , denoted , is its domain. Two vertices and  of a graph  are adjacent if , and are then called the endpoints of the edge, and the edge is incident to them; two edges are incident if they share a vertex.

The degree of a vertex is the number of its adjacent vertices. For , a graph is -regular if all vertices have degree . More generally, it is -regular, where is a finite set of integers, if every vertex has degree  for some . Finally, a graph is degree- if is the maximum of the degree of all its vertices, i.e., if it is -regular. A graph is planar if it can be drawn on the plane without edge crossings, in the standard sense [diestel].

A path of length  in a graph  is a set of edges that are all in ; the path is simple if all ’s are distinct. A cycle is a path of length where all vertices are distinct except that ; a graph is cyclic if it has a cycle. A graph is connected if there is a path from every vertex to every other vertex. A subdivision of a graph  is a graph obtained by replacing each edge by an arbitrary non-empty simple path (every node on this path being fresh except the endpoints of the original edge).

#### Treewidth and pathwidth

A tree  is an acyclic connected graph (remember that graphs are undirected). A tree decomposition of a graph  is a tree with a labeling function from its nodes (called bags) to sets of vertices of , ensuring: (i) for every edge , there is a bag such that contains both and ; (ii) for every node of , the subtree of  formed of all bags whose -image contains  must be connected. The width of is . The treewidth of a graph , denoted , is the minimum width of any tree decomposition of .

The treewidth of a relational instance , denoted , is defined as usual as the treewidth of its Gaifman graph, namely, the graph on the domain of that connects any two elements that co-occur in a fact. When the signature is arity-2, we can see an instance as a labeled graph, and the treewidth of  is then exactly the treewidth of this graph.

A path decomposition is a tree decomposition where is also a path. The pathwidth of a graph  is the minimum width of any path decomposition of the graph. The pathwidth of a relational instance is the pathwidth of its Gaifman graph.

#### Queries

A query on the signature  is a formula in second-order logic over predicates of , with its standard semantics. All queries that we consider have no constants; unless otherwise specified, they are Boolean, i.e., they have no free variable. We write whenever an instance satisfies the query . We will be especially interested in the language FO of first-order logical sentences (where second-order quantifications are disallowed) and the language MSO of monadic second-order logical sentences (where the only second-order quantifications are over unary predicates).

We will also consider the language CQ of conjunctive queries, i.e., existentially quantified conjunctions of atoms over the signature; the language of conjunctive queries where additional atoms of the form (called disequality atoms) are allowed, where and are variables appearing in some regular atom; the language UCQ of union of conjunctive queries, namely, disjunctions of CQs; the language of disjunctions of queries. The size of a query  is its total number of atoms, i.e., the sum of the number of atoms in each .

A homomorphism from a CQ to an instance is a mapping from the variables of to such that for each atom of  we have . For queries, we require that whenever contains the disequality atom . A homomorphism from a to  is a homomorphism from some disjunct of  to : it witnesses that . A match of a on an instance  is a subset of  which is the image of a homomorphism from to ; a minimal match is a match that is minimal for inclusion.

A query is monotone if and imply for any two instances . A query is closed under homomorphisms if we have whenever and has a homomorphism to , for any and . UCQ is an example of query class that is both monotone and closed under homomorphisms, while is monotone but not closed under homomorphisms.

## 3 Problem Statement

We study the problem of probability evaluation:

###### Definition 3.1.

Given an instance , a probability valuation is a function that maps each fact of to a value1 in . A probability valuation defines a probability distribution on subinstances of , which we also write by a slight abuse of notation. The distribution is intuitively obtained by seeing each fact as kept with probability and removed with probability , all such choices being independent. Formally, the probability of in this distribution is:

 π(I′)to0.0pt\raisebox1.29pt$⋅$\raisebox−1.29pt$⋅$=∏F∈I′π(F)∏F∈I∖I′(1−π(F))

The probability evaluation problem for a query on a class of relational instances asks, given an instance and a probability valuation on , what is the probability that holds according to the probability distribution, i.e., it is the problem of computing .

In other words, probability evaluation asks for the probability of  over a TID instance defined by  and . Note that we only consider classes of instances with no associated probabilities, and the probability valuation is given as an additional input — it is not indicated in . The complexity of the probability evaluation problem will always be studied in data complexity: the query and class is fixed, and the input is the instance and the probability valuation.

We also explore the problem of computing tractable lineages (or provenance), defined and studied in Section LABEL:sec:lineages onwards.

We rely on results of [amarilli2015provenance] that show the tractability in data complexity of provenance computation and probability evaluation on treelike (i.e., bounded-treewidth) instances. This holds for guarded second-order queries, but as such queries collapse to MSO under bounded treewidth [gradel2002back], we always use MSO queries here. First, [amarilli2015provenance] shows that we can construct Boolean circuits that represent the provenance of MSO queries on treelike instances; we can also construct monotone circuits for monotone queries. The results also apply to other semirings, but this will not be our focus here. Second, [amarilli2015provenance] shows that probability evaluation is then tractable, namely, ra-linear: in linear time up to the (polynomial) cost of arithmetic operations.

Our goal is thus to investigate to what extent we can generalize the following tractability result from [amarilli2015provenance]:

###### Theorem 3.2 ([amarilli2015provenance]).

For any signature , for any (monotone) MSO query , for any , there is an algorithm which, given an input instance of treewidth :

• Computes a (monotone) Boolean provenance circuit of on , in linear time in ;

• Given a probability valuation of , computes the probability of on , in ra-linear time.

We first focus on the second point (probability evaluation) in Section 4, followed by a digression about non-probabilistic evaluation in Section 5. We then study the first point (lineages) in Sections LABEL:sec:lineages6. We close with a connection to safe queries in Section LABEL:sec:safe.

## 4 Probability Evaluation

This section studies whether we can extend the above tractability result by lifting the bounded-treewidth requirement. We answer in the negative by a dichotomy result on arity-two signatures: there are queries for which probabilistic evaluation is tractable on bounded-treewidth families but is intractable on any efficiently constructible unbounded-treewidth family. A first technical issue is to formalize what we mean by efficiently constructible. We use the following notion:

###### Definition 4.1.

We call treewidth-constructible if for all , if contains instances of treewidth , we can construct one in polynomial time given  written in unary2.

In particular, this implies that must contain a subfamily of unbounded-treewidth instances that are small, i.e., have size polynomial in their treewidth. We discuss the impact of this choice of definition, and alternate definitions of efficiently constructible instances, in Section 5.

A second technical issue is that we need to restrict to signatures of arity . We will then show our dichotomy for any such signature. This suffices to show that our Theorem 3.2 cannot be generalized: its generalization should apply to any signature, in particular arity-2 ones. Yet, we do not know whether the dichotomy applies to signatures of arity .

Our main result on probability evaluation is as follows. In this result, is the class of function problems which can be solved in PTIME with a deterministic Turing machine having access to a #P-oracle, i.e., an oracle for counting problems that can be expressed as the number of accepting paths for a nondeterministic PTIME Turing machine.

###### Theorem 4.2.

Let be an arbitrary arity-2 signature. Let be a treewidth-constructible class of -instances. Then the following dichotomy holds:

• If there is such that for every , then for every MSO query , the probability evaluation problem for on instances of  is solvable in ra-linear time.

• Otherwise, there is an FO query (depending on  but not on ) such that the probability evaluation problem for  on is -complete under randomized polynomial time (RP) reductions.

The first part of this result is precisely the second point of Theorem 3.2. We thus sketch the proof of the hardness result of the second part. Pay close attention to the statement: while some FO queries (in particular, unsafe CQs [dalvi2012dichotomy]) may have -hard probability evaluation when all input instances are allowed, our goal here is to build a query that is hard even when input instances are restricted to arbitrary families satisfying our conditions, a much harder claim.

We reduce from the problem of counting graph matchings, namely, the number of edge subsets of a graph that have no pair of incident edges. This problem is known to be #P-hard on 3-regular planar graphs [xia2007computational]. We define a FO query that tests for matchings on such graphs (encoded in a certain way), and we rely on the connection between probability evaluation and model counting so that the probability of  on (an encoding of) a graph  reflects its number of matchings.

The main idea is that 3-regular planar graphs can be extracted from our family , using the following notion:

###### Definition 4.3.

An embedding of a graph  in a graph  is an injective mapping from the vertices of to the vertices of and a mapping  that maps the edges of to paths in from to , with all paths being vertex-disjoint. A graph is a topological minor of a graph if there is an embedding of  in .

We then use the following lemma, that rephrases the recent polynomial bound [chekuri2014polynomial] on Robertson and Seymour’s grid minor theorem [robertson1986graph5] to the realm of topological minors; in so doing, we use the folklore observation that a degree-3 minor of a graph is always a topological minor:

###### Lemma 4.4.

There is such that for any degree-3 planar graph , for any graph of treewidth , is a topological minor of and an embedding of  in can be computed in randomized polynomial time in .

Hence, intuitively, given an input 3-regular planar graph  (the input to the hard problem), we can extract it in randomized polynomial-time (RP) as a topological minor of (the Gaifman graph of) an instance  of our family that we obtain using treewidth-constructibility. Once it is extracted, we show that, by choosing the right probability valuation for , the probability of on allows us to reconstruct the answer to the original hard problem, namely, the number of matchings of . The minor extraction step is what complicates the design of , as must then test for matchings in a way which is invariant under subdivisions: this is especially tricky in FO as we can only make local tests.

#### Choice of hard query

Not only is our query independent from the class of instances , but it is also an FO query, so, in the non-probabilistic setting, its data complexity on any instance is in AC. In fact, our choice of  has also linear-time data complexity: one can determine in linear time in an input instance  whether . This contrasts sharply with the -completeness (under RP reductions) of probability evaluation for on any unbounded-treewidth instance class (if it is treewidth-constructible).

The query , however, is not monotone. We can alternatively show Theorem 4.2 for a MSO query which is monotone, but not in FO: more specifically, we use a query in , the class of conjunctive two-way regular path queries [DBLP:conf/kr/CalvaneseGLV00, DBLP:journals/tcs/CalvaneseGV05] where we additionally allow disequalities between variables.

We will show an analogue of Theorem 4.2 in the setting of tractable lineages in Section 6, which applies to , an even weaker language. We do not know whether Theorem 4.2 itself can be shown with such queries, or with a monotone FO query. However, we know that Theorem 4.2 could not be shown with a query closed under homomorphism; this is implied by Proposition 6.9.

#### Providing valuations with the instances

When we fix the instance family , the probability valuation is not prescribed as part of the family, but can be freely chosen. If the instances of were provided with their probability valuations, or if probability valuations were forced to be , then it is unlikely that an equivalent to Theorem 4.2 would hold.

Indeed, fix any query  such that, given any instance , it is in #P to count how many subinstances of  satisfy ; e.g., let be a CQ. Consider a family of instances with valuations such that there is only one instance in  per encoding length: e.g., take the class of -grids with probability on each edge, for some binary relation . Consider the problem, given the length of the encoding of an instance  (written in unary), of computing how many subinstances of  satisfy . This problem is in the class  [valiant1979complexity]. Hence, the probability computation problem for on is in : rewrite the encoding of the input instance  to a word of the same length in a unary alphabet, use the -oracle to compute the number of subinstances, and normalize the result by dividing by the number of possible worlds of .

It thus seems unlikely that probabilistic evaluation of on with its valuations is #P-hard, so that our dichotomy result probably does not adapt if input instance families are provided with their valuations.

## 5 Non-Probabilistic Evaluation

Theorem 4.2 in Section 4 uses the recent technology of [chekuri2014polynomial] that shows polynomial bounds for the grid minor theorem of [robertson1986graph5]. These improved bounds also yield new results in the non-probabilistic setting. We accordingly study in this section the problem of non-probabilistic query evaluation, again defined in terms of data complexity:

###### Definition 5.1.

The evaluation problem (or model-checking problem), for a fixed query on an instance family , is as follows: given an instance , decide whether .

Observe that the probability evaluation problem in Section 4 allowed the valuation to set edges to have probability . We could thus restrict to any subinstance of an instance in the class . In other words, the freedom to choose valuations in probability evaluation gave us at least the possibility of choosing subinstances for non-probabilistic query evaluation. This is why we will study in this section the non-probabilistic query evaluation problem on instance classes  which are closed under taking subinstances (or subinstance-closed), namely, for any and , we have .

As before, we will prove dichotomy results for this problem on unbounded-treewidth instance families, though we will use an MSO query rather than an FO query. We give two phrasings of our results. The first one, in Section 5.1, still requires treewidth-constructibility, and shows hardness for every level of the polynomial hierarchy, again under RP reductions. The second phrasing, in Section 5.2, is inspired by the results of [ganian2014lower], which it generalizes: it relies on complexity assumptions (namely, the non-uniform exponential time hypothesis) but works with a weaker notion of constructibility, namely, it requires treewidth to be strongly unbounded poly-logarithmically.

Last, we study in Section 5.3 the problem of match counting in the non-probabilistic setting, for which no analogous results seemed to exist.

As in Section 4, we restrict to signatures of arity 2.

### 5.1 Hardness Formulation

Our first dichotomy result for non-probabilistic MSO query evaluation is as follows; it is phrased using the notion of treewidth-constructibility. In this result, denotes the complexity class at the -th existential level of the polynomial hierarchy.

###### Theorem 5.2.

Let be an arbitrary arity-2 signature. Let be a class of -instances which is treewidth-constructible and subinstance-closed. The following dichotomy holds:

• If there exists such that for every , then for every MSO query , the evaluation problem for on  is solvable in linear time.

• Otherwise, for each , there is an MSO query (depending only on , not on ) such that the evaluation problem for on is -hard under RP reductions.

The upper bound is by Courcelle’s results [courcelle1990monadic, flum2002query], so our contribution is the hardness part, which we now sketch.

The main thing to change relative to the proof of Theorem 4.2 is the hard problems from which we reduce. We use hard problems on planar -regular graphs, which we obtain from the alternating coloring problem as [ganian2014lower, ganian2010there], restricted to such graphs using techniques shown there, plus an additional construction to remove vertex labellings. Here is our formal claim about the existence of such hard problems:

###### Lemma 5.3.

For any , there exists an MSO formula on the signature of graphs such that the evaluation of on planar -regular graphs is -hard. Moreover, for any such graph , we have iff for any subdivision of .

The rest of the proof of Theorem 5.2 proceeds similarly as that of Theorem 4.2.

#### Hypotheses

Theorem 5.2 relies crucially on the class  being subinstance-closed. Otherwise, considering the class of cliques of a single binary relation , this class is clearly unbounded-treewidth and treewidth-constructible, yet it has bounded clique-width so MSO query evaluation has linear data complexity on this class [courcelle2000linear].

Further, the hypothesis of treewidth-constructibility is also crucial. Without this assumption, Proposition 32 of [makowsky2003tree] shows the existence of graph families of unbounded treewidth which are subinstance-closed yet for which MSO query evaluation is in PTIME.

### 5.2 Alternate Formulation

We now give an alternative phrasing of Theorem 5.2 which connects it to the existing results of [kreutzer2010lower, ganian2014lower]. Table 1 tersely summarizes their results in comparison to our own results and other related results. As [kreutzer2010lower, ganian2014lower] are phrased in terms of graphs, and not arbitrary arity-2 relational instances, we do so as well in this subsection. Before stating our result, we summarize these earlier works to explain how our work relates to them.

[kreutzer2010lower, ganian2014lower] show the intractability of MSO on any subgraph-closed unbounded-treewidth families of graphs, under finer notions than our treewidth-constructibility. Kreutzer and Tazari [kreutzer2010lower] proposed the notion of families of graphs with treewidth strongly unbounded poly-logarithmically and showed that MSO (MSO with quantifications over both vertex- and edge-sets) over any such graph families is not fixed-parameter tractable in a strong sense (it is not in XP), unless the exponential-time hypothesis (ETH) fails. Ganian et al. [ganian2014lower] proved a related result, introducing the weaker notion of densely unbounded poly-logarithmically but requiring graph families to be closed under vertex relabeling; in such a setting, Theorem 4.1 of [ganian2014lower] shows that MSO (with vertex labels) cannot be fixed-parameter quasi-polynomial unless the non-uniform exponential-time hypothesis fails.

These two results of [kreutzer2010lower] and [ganian2014lower] are incomparable: [kreutzer2010lower] requires a stronger unboundedness notion (strongly unbounded vs densely unbounded) and a stronger query language (MSO vs MSO), but it does not require vertex relabeling, and makes a weaker complexity theory assumption (ETH vs non-uniform ETH). See the Introduction of [ganian2014lower] for a detailed comparison.

Our Theorem 5.2 uses MSO and no vertex labeling, but it requires treewidth-constructibility, which is stronger than densely/strongly poly-logarithmic unboundedness: strongly unboundedness only requires constructibility in and densely unboundedness does not require constructibility at all. The advantage of treewidth-constructibility is that we were able to show hardness of our problem (under RP reductions), without making any complexity assumptions. However, if we make the same complexity-theoretic hypotheses as [ganian2014lower], we now show that we can phrase our results in a similar way to theirs, and thus strengthen them.

We accordingly recall the notion of densely poly-logarithmic unboundedness, i.e., Definition 3.3 of [ganian2014lower]:

###### Definition 5.4.

A graph class  has treewidth densely unbounded poly-logarithmically if for all , for all , there exists a graph such that and .

We now state our intractability result on densely unbounded poly-logarithmically graph classes. It is identical to Theorem 5.5 of [ganian2014lower] but applies to arbitrary MSO formulae, without a need for vertex relabeling: in the result, denotes the polynomial hierarchy. This result answers Conjecture 8.3 of [grohe2008logic] (as we pointed out in the Introduction).

###### Theorem 5.5.

Unless , there is no graph class satisfying all three properties:

1. is closed under taking subgraphs;

2. the treewidth of is densely unbounded poly-logarithmically;

3. the evaluation problem for any MSO query on is quasi-polynomial, i.e., in time for , an arbitrary constant , and some computable function .

The proof technique is essentially the same as in [ganian2014lower] up to using the newer results of [chekuri2014polynomial]. It is immediate that an analogous result holds for probability query evaluation, as standard query evaluation obviously reduces to it (take the probability valuation giving probability  to each fact).

### 5.3 Match Counting

We conclude this section by moving to the problem of match counting, i.e., counting how many assignments satisfy a non-Boolean MSO formula. Match counting should not be confused with model counting (counting how many subinstances satisfy a Boolean formula) which is closely related3 to probability evaluation.

To our knowledge, no dichotomy-like result on match counting for MSO queries was known. This section shows such a result; as in Section 5.1, we assume treewidth-constructibility, closure under subinstances, and arity-2 signatures.

We define the match counting problem as follows:

###### Definition 5.6.

The counting problem for an MSO formula (with free second-order variables) on an instance family is the problem, given an instance , of counting how many vectors of domain subsets are such that satisfies .

The restriction to free second-order variables is without loss of generality, as free first-order variables can be rewritten to free second-order ones, asserting in the formula that they must be interpreted as singletons.

We show the following dichotomy result:

###### Theorem 5.7.

Let be an arbitrary arity-2 signature. Let be a subinstance-closed and treewidth-constructible class of -instances. The following dichotomy holds:

• If there exists such that for every , then for every MSO query with free second-order variables, the counting problem for on  is solvable in ra-linear time.

• Otherwise, there is an MSO query (depending only on , not on ) with one free second order variable such that the counting problem for on  is -complete under RP reductions.

The first claim is shown in [arnborg1991easy]. The proof of the second claim proceeds as for Theorem 4.2. We reduce from the problem of counting Hamiltonian cycles in planar 3-regular graphs , which is #P-hard [LiskiewiczOT03], and which we express in MSO on the incidence graph of .

Unlike in Theorem 4.2, the query does not have tractable model checking (as opposed to probability evaluation). We do not know whether we can show a similar result with such a tractable query.

\mysec

Lineage Upper Boundssec:lineages From probability evaluation in Section 4 (and its non-probabilistic variants in Section 5), we now turn to our second problem: the study of tractable lineage representations.

Indeed, a common way to achieve tractable probability evaluation is to represent the lineage of queries on input instances in a tractable formalism [jha2013knowledge]. This section shows how the tractability of MSO probability evaluation on bounded-treewidth instances can be explained via lineages: Table 2 (upper part) summarizes the upper bounds that we prove.

Intuitively, the lineage of a query on an instance describes how the query depends on the facts of the instance. Formally:

###### Definition 5.8.

The lineage of a query on an instance  is a Boolean function whose variables are the facts of , such that, for any , iff the corresponding valuation makes true. If is monotone, then is a monotone Boolean function, in which case it can equivalently be called the -provenance [green2007provenance] of on .

Lineages are related to probability evaluation, because evaluating the probability of query  under a probability valuation of instance  amounts to evaluating the probability of the lineage , under the corresponding probability valuation on variables. Thus, if we can represent in a formalism that enjoys tractable probability computation, then we can tractably evaluate the probability of  on .

In this section, we show that MSO queries on bounded-treewidth instances admit tractable lineage representations in many common formalisms: they have linear-size bounded-treewidth Boolean circuits (as shown in [amarilli2015provenance]), but also have polynomial-size OBDDs [bryant1992symbolic, olteanu2008using] (with a stronger claim for bounded-pathwidth), as well as linear-size d-DNNFs [darwiche2001tractable]. Further, as we show, all these lineage representations can be efficiently computed. Note that in all these results, as in [amarilli2015provenance], tractability only refers to data complexity, with large constant factors in the query and instance width: we leave to future work the study of query and instance classes for which lineage computation enjoys a lower combined complexity.

After our results on tractable lineage computations in this section, we will investigate in the next section whether we can represent the lineage as Boolean formulae (such as read-once formulae [jha2013knowledge]), and we will show superlinear lower bounds for them. Section 6 will then study in which sense bounded-treewidth is necessary to obtain tractable lineages.

This section applies to signatures of arbitrary arity.

#### Bounded-treewidth circuits

We first recall our results from [amarilli2015provenance] and introduce a first representation of lineages: Boolean circuits, called provenance circuits in [amarilli2015provenance] and expression DAGs in [jha2012tractability]:

###### Definition 5.9.

A lineage circuit for query and instance  is a Boolean circuit with input gates and with NOT, OR, and AND internal gates, whose inputs are the facts of the database, and which computes the lineage of on . A monotone lineage circuit has no NOT gate. The treewidth and pathwidth of a lineage circuit are that of the circuit’s graph.

As recalled in Theorem 3.2, [amarilli2015provenance] showed that we can compute (monotone) lineage circuits for (monotone) MSO queries on bounded-treewidth instances in linear time. Further, these circuits themselves have bounded-treewidth, which is why probability evaluation is tractable on them, using message passing algorithms [lauritzen1988local]. Hence:

###### Theorem 5.10 (([amarilli2015provenance], Theorems 4.4 and 5.3)).

For any fixed MSO query and constant , given an input instance of treewidth , we can compute in linear time a bounded-treewidth lineage circuit of on .

If is monotone then we can take to be monotone.

We study how to adapt this to other tractable lineage representations.

#### OBDDs

We start by defining OBDDs, a common tractable representation of Boolean functions [bryant1992symbolic, olteanu2008using]:

###### Definition 5.11.

An ordered binary decision diagram (or OBDD) is a rooted directed acyclic graph (DAG) whose leaves are labeled or , and whose non-leaf nodes are labeled with a variable and have two outgoing edges labeled and . We require that there exists a total order on the variables such that, for every path from the root to a leaf, no variable occurs in two different internal nodes on the path, and the order in which the variables occur is compatible with .

An OBDD defines a Boolean function on its variables: each valuation is mapped to the value of the leaf reached from the root by following the path given by the valuation.

The size of an OBDD is its number of nodes, and its width is the maximum number of nodes at every level, where a level is the set of nodes reachable by enumerating all possible values of variables in a prefix of .

Probability evaluation for OBDDs is tractable [olteanu2008using]. Our result is that we can compute polynomial-size OBDDs for MSO queries on bounded-treewidth instances in PTIME:

###### Theorem 5.12.

For any fixed MSO query and constant , there is such that, given an input instance of treewidth , one can compute in time an OBDD (of size ) for the lineage of  on .

We show this using Corollary 2.14 of [jha2012tractability]: any bounded-treewidth Boolean circuit can be represented by an equivalent OBDD of polynomial width. We complete this result and show that the OBDD can also be computed in polynomial time, which clearly implies Theorem 5.12 (using Theorem 5.10):

{lemmarep}

lem:makeobddtw For any , there is such that, given a Boolean circuit of treewidth , we can compute an equivalent OBDD in time .

{proofsketch}

The variable order is that of [jha2012tractability] and can be constructed in PTIME. We then show that we can build the polynomial-size OBDD level by level, by testing the equivalence of partial valuations in PTIME. We do it using message passing, thanks to the tree decomposition of . As in [jha2012tractability], is doubly exponential in .

###### Proof.

We rely on Corollary 2.14 of [jha2012tractability]: there is a doubly exponential function  such that, for any , there is such that, for any tree decomposition of width of , the OBDD obtained for a certain variable order has width .

The order on variables is defined following an in-order traversal of  where children are ordered by the number of variables in this subtree; clearly this quantity can be computed over the entire tree in PTIME, so the order can be computed in PTIME. We show that we can construct the OBDD in PTIME as well, in a level-wise manner inspired by [jha2013knowledge].

Write , and construct level-by-level in the following way. Assuming that we have constructed up to level , create two children for each node at level (depending on the value of variable ), and then merge all such children and that are equivalent. To define this, call equivalent two partial valuations and of variables if the Boolean function represented by on the other variables under is the same as under . Now, call and equivalent if, for any partial valuation leading to (represented by a path from the root of  to ) and any partial valuation leading to , these two partial valuations of are equivalent. As we will always ensure in the construction, any paths leading to the parent node of are equivalent partial valuations of , so and are equivalent iff, picking any two valuations for and for by following a path from the root to  and to  respectively, and are equivalent.

Hence, it suffices to show that there is a function such that we can test in time whether two partial valuations are equivalent. Indeed, we can then build in the indicated time, because the maximal number of node pairs to test at any level of the OBDD is : we had at most at the previous level, and each of them creates two children, before we merge the equivalent children. Hence, if we can test equivalence in the indicated time, then clearly we can construct in time for (the first term accounts for the linear number of levels, and the second term accounts for the linear time required to find a partial valuation for a node).

We thus show that the equivalence of partial valuations can be tested in time for some function . Considering two partial valuations and of the same set of variables , let and be the two circuits obtained from by substituting the input gates for with constant gates according to and respectively. Note that and have the same set of input gates , formed precisely of the variables not in . We rename the internal gates of  so that the only gates shared between and are the input gates . Now, be the circuit obtained by taking the union of and (on the same set of variables), and adding an output gate and a constant number of gates such that the output gate is true iff the output gates of and  carry different values (this can be done with 5 additional gates in total). It is easy to see that there is a valuation of that makes the circuit evaluate to true iff the partial valuations and are not equivalent. Now, observe that we can immediately construct from a tree decomposition  of width of . Indeed, it is obvious that is a tree decomposition of , and we can rename gates to obtain from a tree decomposition of , such that and  both have the same width and the same skeleton. Now, construct that has same skeleton as and where each bag is the union of the corresponding bags of  and , adding the 5 intermediate gates to each bag. The result clearly has width and it is immediate that it is a tree decomposition of .

We can then use message-passing techniques [lauritzen1988local, huang1996inference] to determine in time exponential in and polynomial in  whether the bounded-treewidth circuit has a satisfying assignment, from which we deduce whether and are equivalent. For details, see, e.g., Theorem D.2 of [amarilli2015provenance]. ∎

#### Bounded-pathwidth

We have explained the tractability of MSO probability evaluation on bounded-treewidth instances, showing that we could compute bounded-treewidth lineage circuits for them. We strengthen these results in the case of bounded-pathwidth instances, showing that we can compute constant-width OBDDs:

###### Theorem 5.13.

For any fixed MSO query and constant , given an input instance of pathwidth , one can compute in polynomial time an OBDD of constant width for the lineage of  on .

To prove the result, we first observe (adapting [amarilli2015provenance]) that we can compute bounded-pathwidth lineage circuits in linear time on bounded-pathwidth instances:

{propositionrep}

prp:makecircuitpath For any fixed and (monotone) MSO query , for any -instance  of pathwidth , we can construct a (monotone) lineage circuit of  on  in time . The pathwidth of  only depends on  and  (not on ).

###### Proof.

Given a path decomposition of an instance , which is a tree decomposition with a linear tree, the resulting tree encoding  of  (see [amarilli2015provenance, amarilli2015provenanceb]) is clearly also a linear tree. From the proof of Theorem 4.4 of [amarilli2015provenanceb], we observe that the lineage circuit that we construct has a tree decomposition which can be made to be a path decomposition in this case, because it follows the structure of . Hence, the circuit has bounded pathwidth. ∎

By Corollary 2.13 of [jha2012tractability], this implies the existence of a constant-width OBDD representation, which we again show to be computable, proving Theorem 5.13.

{lemmarep}

lem:makeobddpw For any , for any Boolean circuit of pathwidth , we can compute in polynomial time in  an OBDD equivalent to whose width depends only on .

###### Proof.

As in the proof of Lemma LABEL:lem:makeobddtw, we can compute in PTIME the order on variables, and we can compute the OBDD under this order in the same way. This uses the fact that a path decomposition of circuit  is in particular a tree decomposition of . ∎

#### d-DNNFs

We now turn to the more expressive tractable lineage formalism of d-DNNFs, introduced in [darwiche2001tractable]; we follow the definitions of [jha2013knowledge]:

###### Definition 5.14.

A deterministic, decomposable negation normal form (d-DNNF) is a Boolean circuit that satisfies the following conditions:

1. Negation is only applied to input gates: the input of any NOT gate must always be an input gate.

2. The inputs of AND-gates depend on disjoint sets of input gates. Formally, for any AND-gate , for any two gates which are inputs of , there is no input gate which is reachable (as a possibly indirect input) from both and .

3. The inputs of OR-gates are mutually exclusive. Formally, for any OR-gate , for any two gates which are inputs of , there is no valuation of the inputs of  under which and  both evaluate to true.

It is tractable to evaluate the probability of a d-DNNF [darwiche2001tractable], and d-DNNFs capture the tractability of probability evaluation for many safe queries (see [jha2013knowledge]). We show that it also explains the ra-linearity of MSO probability evaluation on bounded-treewidth instances, as we can construct linear d-DNNFs for them:

{theoremrep}

thm:makeddnnf For any fixed MSO query and constant , given an input instance of treewidth , one can compute in time a d-DNNF representation of the lineage of  on .

{proofsketch}

Our construction in [amarilli2015provenance] applies a tree automaton for the query to an annotated tree encoding of the instance. This yields a bounded-treewidth circuit representation of the lineage, in linear time in the instance (but with a constant factor that is nonelementary in the query). We show that, if the automaton is deterministic, the circuit that we obtain is already a d-DNNF. The result follows, as one can always make a tree automaton deterministic [tata], at the cost of an increased constant factor in the data complexity.

###### Proof.

We define a bottom-up deterministic tree automaton on alphabet (or -bDTA) in the standard manner. We start by adapting the proof of Proposition 3.1 of [amarilli2015provenanceb] to show the following result instead: a provenance d-DNNF of a deterministic -bDTA on a -tree can be constructed in time . We construct the circuit exactly as in the proof of Proposition 3.1 of [amarilli2015provenanceb] and show that it is a d-DNNF.

First, observe that the only NOT gates that we use are the , which are NOT gates of the , which are input gates; so we only apply negation to leaf nodes.

Second, we show that the sets of leaves reachable from the children of any AND gate are pairwise disjoint. The AND gates that we create and that have multiple inputs are:

• The , which are the AND of and ; now, only depends on the input gates for nodes of the subtree of  rooted at , and likewise only depends on input gates in the right subtree;

• The , which are the AND of and ; now, the do not depend on , only on input gates for a strict descendant of in ;

• The , which are the AND of and ; now, the do not depend on the sole input gate under , i.e., , but only on input gates for a strict descendant of in .

Third, we show that the children of any OR gate are mutually exclusive. The OR gates that we create and that have multiple inputs are the following:

• The when is a leaf node of , for which the claim is immediate, as the only two possible children are and which are clearly mutually exclusive.

• The when is an internal node of , which are the OR of gates of the form or over several pairs .

To observe that these gates are mutually exclusive, remember that, for a valuation of the tree , the gate is true iff there is a run  of on the subtree of  rooted at  such that . However, as is deterministic, for each , there is at most one state  for which this is possible. Hence, for any valuation of the circuit , for our node , there is at most one such that is true under valuation , and only at most one such that is true under . Hence, by definition of the , there is at most one of them which can be true under valuation , namely, , which also means that only the gate and the gate can be true under . But these two gates are clearly mutually exclusive (only one can evaluate to true, depending on the value of ), which proves the claim.

• The output gate which is the OR of gates of the form for the root node of . Again, as is deterministic, for any valuation  of , letting be the corresponding valuation of the -tree , there is only one state such that has a run on  with , so at most one state such that is true under .

Hence, the circuit constructed in the proof of Proposition 3.1 of [amarilli2015provenance] is a d-DNNF representation of the lineage of the automaton on the tree which has linear size.

We now adapt the proof of Theorem 4.2 of [amarilli2015provenanceb]. The theorem proceeds by constructing a bNTA for the query  [courcelle1990monadic] on the alphabet and modifying it to obtain a bNTA on . We now additionally convert to a bDTA on the same alphabet, which we can do using standard techniques [tata]. All of this is performed independently of the instance.

Now, we conclude using the rest of the proof of Theorem 4.2 of [amarilli2015provenance]. The resulting circuit is the result of (bijectively) renaming the input gates, and replacing some input gates by constant gates, on the circuit produced by Proposition 3.1 of [amarilli2015provenance]. However, by our previous observation, is actually a d-DNNF circuit, so also is (up to evaluating negations of constant gates as constant gates). Hence, we have produced the desired d-DNNF, which by the statement of Theorem 4.2 of [amarilli2015provenance] is of linear size and is computed in linear time. ∎

\mysec

Formula Lower Boundssec:formulae We have shown that MSO queries on bounded-treewidth instances have tractable lineages, and even linear-sized ones (bounded-treewidth circuits and d-DNNFs). All lineage representations that we have studied, however, are based on DAGs (circuits or OBDDs). In this section, we study whether we could obtain linear lineage representations as Boolean formulae, e.g., as read-once formulae [jha2013knowledge].

This question is natural because existing work on probabilistic data seldom represents query lineages as Boolean circuits: they tend to use Boolean formulae, or other representations such as OBDDs, FBDDs and d-DNNFs [jha2013knowledge]. Further, most prior works on provenance focus on formula representations of provenance (with the notable exception of [deutch2014circuits], see below).

Intuitively, an important difference between formula and circuit representations is that circuits can share common subformulae whereas formulae cannot. We show in this section that this difference matters: formula-based representations of the lineage of MSO queries on bounded-treewidth instances cannot be linear in general, because of superlinear lower bounds. More specifically, we show that formula-based representations are superlinear even for some queries, and that they are quadratic for some MSO queries.

A similar result was already known for lineage representations ([deutch2014circuits], Theorem 1), which showed that circuit representations of provenance can be more concise than formulae; but this result applies to arbitrary instances, not bounded-treewidth ones. Hence, this section also sheds additional light on the compactness gain offered by the recent circuit representations of provenance in [deutch2014circuits].

Our results in this section rely on classical lower bounds on the size of formulae expressing certain Boolean functions [wegener1987complexity]. They apply to signatures of arbitrary arity. We summarize them in the lower part of Table 2.

#### CQ≠ queries

We first show a mild conciseness gap for the comparatively simple language of . Formally, we exhibit a query whose lineage cannot be represented by a linear-size formula, even on treelike instance families. By contrast, from the previous section, we know that its lineage has linear-size circuit representations.

{propositionrep}

prp:provlowercq There is a query , and a family  of relational instances with treewidth , such that, for any , for any Boolean formula capturing the lineage of  on , we have .

{proofsketch}

We show we can express the threshold function over variables, for which [wegener1987complexity] gives a lower bound on the formula size.

###### Proof.

Consider the signature with a single unary predicate , and consider the . Consider the family of instances  defined as for all . Clearly, for any , the lineage of  on  is the threshold function checking whether at least two of its inputs are true.

By Chapter 8, Theorem 5.2 of [wegener1987complexity], any formula using , , expressing the threshold function over variables has size , the desired bound. ∎

As queries are monotone, we can also ask for monotone lineage representations. From the previous section, we still have linear representations of the lineage as a monotone circuit. By contrast, if we restrict to monotone Boolean formulae, we obtain an improved lower bound:

{propositionrep}

prp:provlowercqpos There is a query , and a family  of relational instances with treewidth , such that, for any , for any monotone Boolean formula capturing the lineage of  on , we have .

###### Proof.

We use the same proof as for Proposition LABEL:prp:provlowercq but relying on [hansel1964nombre] (also Chapter 8, Theorem 1.2 of [wegener1987complexity]), which shows that, for built over the monotone basis and , we have . ∎

#### MSO queries

For the more expressive language of MSO queries, we show a wider gap: there is a query for which formula-based lineage representations must be quadratic, whereas we know that there are linear circuit representations of the lineage:

{propositionrep}

prp:provlowertree There is an MSO query , and a family  of relational instances with treewidth , such that, for any , for any Boolean formula capturing the lineage of  on , we have .

{proofsketch}

The proof is more subtle and constructs an MSO formula expressing the parity of the number of facts of a unary predicate, using a second auxiliary relation. The lower bound for parity comes again from [wegener1987complexity].

We leave open the question of whether this bound can be further improved, but we note that even an lower bound would require new developments in the study of Boolean formulae. Indeed, the best currently known lower bound on formula size, for any explicit function of  variables with linear circuits, is in  [hastad1998shrinkage].

###### Proof.

Consider the signature with a unary predicate and binary predicate . We define the family with having domain and facts for each and for each . Clearly, all instances in  have treewidth . We consider the MSO formula that intuitively uses the -facts to test whether the number of -facts is odd. Formally, we define as follows, inspired by the definition of an automaton:

 qto0.0pt\raisebox1.29pt$⋅$\raisebox−1.29pt$⋅$=∀X0X1 Part(X0,X1)∧Tr(X0,X1)∧% Init(X0,X1) ⇒∀x (¬∃y E(y,x)⇒x∈X1)

where asserts that and partition the domain (where denotes exclusive OR):

 Part(X0,X1)to0.0pt\raisebox1.29pt$⋅$\raisebox−1.29pt$⋅$=∀x (x∈X0)⊕(x∈X1)

is the conjunction of the following transition rules, for each in :

 ∀xy E(x,y)∧y∈Xb∧L(x) ⇒x∈Xb′ ∀xy E(x,y)∧y∈Xb∧¬L(x) ⇒x∈Xb

and asserts the initial states:

 ∀x (¬∃y E(x,y))∧¬L(x) ⇒x∈X0 ∀x (¬∃y E(x,y))∧L(x) ⇒x∈X1

Intuitively, on any possible world of  where all -facts are present, is true whenever the number of -facts is odd. Indeed, it is clear that there is a unique choice of and in such worlds, defined by putting in or depending on whether holds, and, for , letting such that and be  or  depending on whether holds or not, putting in . Hence, in worlds containing all -facts, there is a unique choice of and where we have iff the number of facts with is odd. Hence, is satisfied iff, in this unique assignment, (the only node with no incoming -edge) is in , that is, if the overall number of -facts is odd.

Hence, pick and let be a formula representation of the lineage of  on . Replacing the input gates for the -facts by constant -gates, we obtain a formula of the same size that computes the parity function of the inputs corresponding to the -gates, the number of which is .

Now, by Theorem 8.2, Chapter 8 of [wegener1987complexity], any formula using , , expressing the parity function over variables has a number of variable occurrences that is at least . Hence, we deduce that , as claimed. ∎

## 6 OBDD Size Bounds

We have shown in the previous sections that MSO queries on bounded-treewidth instances have tractable lineage representations as circuits and OBDDs. This section focuses on OBDDs and shows our second main dichotomy result: bounded-treewidth is necessary for MSO query lineages to have polynomial OBDDs.

We first state this result in Section 6.1. Its upper bound is Theorem 5.12, and its lower bound applies to a specific (which only depends on the signature). We show that has no polynomial-width OBDDs on any arity-2 instance family with treewidth densely unbounded poly-logarithmically. This second dichotomy result thus shows that bounded-treewidth is necessary for some queries to have tractable OBDDs; it applies to a more restricted class than the FO query of our first main dichotomy result (Theorem 4.2), but applies to a different task (the computation of OBDD lineages, rather than probability evaluation).

We then study in Section 6.2 the language of connected . For this language, we show that queries can be classified in a meta-dichotomy result: we characterize the intricate queries, such as , which have no polynomial OBDDs on any unbounded treewidth family in the sense above; and we show that non-intricate queries actually have constant-width OBDDs on some well-chosen unbounded-treewidth instance family. Hence, if a connected has polynomial OBDDs on some unbounded-treewidth instance family, then it must have constant-width OBDDs on some other such family.

Finally, we investigate in Section 6.3 whether our second dichotomy result holds for more restricted fragments than . First, we show that connected queries are never intricate, so we cannot show our dichotomy result with such queries. Second, we show the same for connected ; in fact, we show that no query closed under homomorphisms could be used. We last show that our meta-dichotomy fails for disconnected queries.

As in Sections 4 and 5, we limit ourselves to arity-2 signatures in this section.

### 6.1 A Dichotomy on OBDD Size

This section shows that our Theorem 5.12 on the existence of tractable lineage representations as OBDDs is unlikely to extend to milder conditions than bounded-treewidth. Indeed, there are even queries that have no polynomial-width OBDDs on any unbounded-treewidth input instance with treewidth densely unbounded poly-logarithmically, again on arity-two signatures. Here is our second main dichotomy result which shows this:

###### Theorem 6.1.

There exists a constant such that the following holds. Let be an arbitrary arity-2 signature and be a class of -instances. Assume there is a function such that, for all , if contains instances of treewidth , one of them has size . We have the following dichotomy:

• If there is such that for every , then for every MSO query , an OBDD of on  can be computed in time polynomial in .

• Otherwise, there is a query (depending on but not on ) such that the width of any OBDD of on  cannot be bounded by any polynomial in .

This does not require treewidth-constructibility, and imposes instead a slight weakening4 of densely unbounded poly-logarithmic treewidth. It does not require to be subinstance-closed either, unlike in Section 5.

The first part of the theorem is by Theorem 5.12, so we sketch the proof of the second part. Our choice of   intuitively tests the existence of a path of length  in the Gaifman graph of the instance, i.e., a violation of the fact that the possible world is a matching of the original instance. Again, while we know that probability evaluation for is -hard if we allow arbitrary input instances (as counting matchings reduces to it), our task is to show that has no polynomial-width OBDDs when restricting to any instance family that satisfies the conditions, a much harder task.

To show this, we draw a link between treewidth and OBDD width for on individual instances, with the following result (which is specific to ):

###### Lemma 6.2.

Let be an arity-2 signature. There exist constants  such that for any instance on  of treewidth , the width of an OBDD for on  is .

{proofsketch}

The proof is technical and uses Lemma 4.4 to extract high-treewidth topological minors of a specific shape: skewed grids. We then show that any variable order that enumerates the edges of the skewed grid must have a prefix that shatters the grid, i.e., sufficiently many independent grid nodes have both an enumerated incident edge and a non-enumerated one. This forces any OBDD for  to remember the exact configurations of the enumerated edges at the level for this shattering prefix of its variable order. Thus, the OBDD has superpolynomial width. We formalize this via the structure of the prime implicants of the lineage.

### 6.2 A Meta-Dichotomy for UCQ≠

For which queries does Theorem 6.1 adapt? It does not extend to all unsafe [dalvi2012dichotomy] queries, as a query may be unsafe and still be tractable on some unbounded-treewidth instance family: for instance, the standard unsafe query from [dalvi2007efficient] has trivial OBDDs on the family of -grids without unary relations.

We answer this question, again on arity-2 signatures, by introducing a notion of intricate queries. We show that it precisely characterizes the connected queries for which the dichotomy of Theorem 6.1 applies. Let us first recall the definition of connected queries:

###### Definition 6.3.

A is connected if, building the graph  on its atoms that connects those that share a variable (ignoring -atoms), is connected (in particular it has no isolated vertices, unless it consists of a single isolated vertex). A is connected if all its disjuncts are connected.

We now give our definition of intricate queries. We characterize them by looking at line instances:

###### Definition 6.4.

A line instance is an instance  of the following form: a domain , and, for , one single binary fact between and : either for some or for some binary . (Recall that, as is arity-two, its maximal arity is two, so it must include at least one binary relation.)

The intuition is that a query is intricate if, on any sufficiently long line instance, it must have a minimal match that contains the two middle facts (i.e., the ones that are incident to the middle element). Here is the formal definition of intricate queries:

###### Definition 6.5.

A is -intricate for if, for every line instance with , letting and be the two facts of  incident to the middle element , there is a minimal match of  on  that includes both  and .

We call intricate if it is -intricate.

Observe that queries with clearly cannot be intricate. Further, if a query has no matches that include only binary facts, then it cannot be intricate; in other words, any disjunct that contains an atom for a unary relation can be ignored when determining whether a query is intricate. By contrast, our query of Theorem 6.1 was designed to be intricate, in fact is -intricate. Also note that an -intricate query is always -intricate for any : consider the restriction of any line instance of size to a line instance of size , and find a match in the restriction.

We note that we can decide whether queries are intricate or not, by enumerating line instances. We do not know the precise complexity of this task:

###### Lemma 6.6.

Given a connected , we can decide in PSPACE whether is intricate.

We can now state our meta-dichotomy: a dichotomy such as Theorem 6.1 holds for a connected   if and only if it is intricate. Further, non-intricate queries must actually have constant-width OBDD on some counterexample unbounded-treewidth family:

###### Theorem 6.7.

For any connected on an arity-2 signature:

• If is not intricate, there is a treewidth-constructible and unbounded-treewidth family of instances such that has constant-width OBDDs on ; the OBDDs can be computed in PTIME from the input instance.

• If is intricate, then Theorem 6.1 applies to : in particular, for any unbounded-treewidth family of instances satisfying the hypotheses, does not have polynomial-width OBDDs on .

{proofsketch}

We construct for non-intricate as a family of grids from a line instance which is a counterexample to intricacy. As we can disconnect facts that do not co-occur in a match, we can disconnect the grids to bounded-pathwidth instances in a lineage-preserving fashion.

Conversely, we adapt the hardness proof of Theorem 6.1 to any intricate query , extracting independent matches from any sufficiently subdivided skewed grid minor thanks to intricacy.

### 6.3 Other Query Classes

We finish by investigating the status of other query classes relative to our meta-dichotomy, to see whether Theorem 6.1 could be shown for queries in an even less expressive class than , such as or .

#### Connected CQ≠ queries

We classify the connected queries relative to Theorem 6.7, by showing that a connected can never be intricate. This explains why, for instance, the query is not intricate, as is witnessed by the family of -grids.

###### Proposition 6.8.

A connected is never intricate.

{proofsketch}

The signature must contain at most one binary relation , as otherwise we can find for any a family of grids where the query never holds, so that would have trivial constant-width OBDDs. Now, if contains a join pattern of the form , then has no matches on line instances with -facts of alternating directions. If does not contain such a pattern, we consider line instances with a path of -facts in the same direction, and show that has no match that involves the two middle facts.

By Theorem 6.7, this implies that any query has an unbounded-treewidth, treewidth-constructible family of instances