Tractable Lineages on Treelike Instances: Limits and Extensions
Abstract
Query evaluation on probabilistic databases is generally intractable (#Phard). Existing dichotomy results [DBLP:conf/pods/DalviS07a, dalvi2012dichotomy, DBLP:conf/pods/FinkO14] have identified which queries are tractable (or safe), and connected them to tractable lineages [jha2013knowledge]. In our previous work [amarilli2015provenance], using different tools, we showed that query evaluation is lineartime on probabilistic databases for arbitrary monadic secondorder queries, if we bound the treewidth of the instance.
In this paper, we study limitations and extensions of this result. First, for probabilistic query evaluation, we show that MSO tractability cannot extend beyond bounded treewidth: there are even FO queries that are hard on any efficiently constructible unboundedtreewidth class of graphs. This dichotomy relies on recent polynomial bounds on the extraction of planar graphs as minors [chekuri2014polynomial], and implies lower bounds in nonprobabilistic settings, for query evaluation and match counting in subinstanceclosed families. Second, we show how to explain our tractability result in terms of lineage: the lineage of MSO queries on boundedtreewidth instances can be represented as boundedtreewidth circuits, polynomialsize OBDDs, and linearsize dDNNFs. By contrast, we can strengthen the previous dichotomy to lineages, and show that there are even UCQs with disequalities that have superpolynomial OBDDs on all unboundedtreewidth graph classes; we give a characterization of such queries. Last, we show how boundedtreewidth tractability explains the tractability of the inversionfree safe queries: we can rewrite their input instances to have boundedtreewidth.
abbrv \defaultbibliographyreferences
1 Introduction
Many applications must deal with data which may be erroneous. This makes it necessary to extend relational database instances, to allow for uncertain facts. One of the simplest such formalisms [suciu2011probabilistic] is that of tupleindependent databases (TID): each tuple in the database is annotated with an independent probability of being present. The semantics of a TID instance is to see it as a concise representation of a probability distribution on standard nonprobabilistic instances.
An important challenge when dealing with probabilistic data is that data management tasks become intractable. The main one is query evaluation: given an input database query , for instance a conjunctive query, and given a relational instance , determine the answers to on . When is Boolean, we just ask whether satisfies . The corresponding problem in the TID setting asks for the probability that , that is, the total probability weight of the possible subsets of the TID instance that satisfy . The query is usually assumed to be fixed, and we look at the complexity of this problem as a function of the input instance (or TID) , that is, the data complexity. Sadly, while this task is highly tractable and parallelizable (in the complexity class AC) in the nonprobabilistic context, exact computation is generally intractable (#Phard) on TID instances, even for the simple conjunctive query . See [dalvi2007efficient].
Faced with this intractability, two natural directions are possible. The first is to restrict the language of queries to focus on queries that are tractable on all instances, called safe. This has proven a very fruitful direction [DBLP:conf/pods/DalviS07a], culminating in the dichotomy result of Dalvi and Suciu [dalvi2012dichotomy]: the data complexity of a given union of conjunctive queries (UCQ) is either in PTIME or #Phard. More recently, the safe nonrepeated CQs with negation were characterized in [DBLP:conf/pods/FinkO14].
The second approach is to restrict the instances, to focus on instance families that are tractable for all queries in highly expressive languages. In a recent work [amarilli2015provenance], going through the setting of semiring provenance [green2007provenance], and using a celebrated result by Courcelle [courcelle1990monadic], we have started to explore this direction. We showed that, for queries in MSO (monadic secondorder, a highly expressive language capturing UCQs), data complexity is lineartime on treelike instances, i.e., instances of treewidth bounded by a constant. Of course, this result says nothing of nontreelike instances, but covers use cases previously studied in their own right, such as probabilistic XML [cohen2009running] (without data values).
This new direction raises several important questions:

First, is this the best that one can hope for? For probability evaluation, could the tractability on boundedtreewidth instances be generalized, e.g., to bounded cliquewidth instances [courcelle1993handle], as for MSO in the nonprobabilistic case? More ambitiously, could we separate tractable and intractable instances with a dichotomy theorem?

Second, can our boundedtreewidth tractability result be explained in terms of lineage? The lineage of a query intuitively represents how it can be satisfied on the instance, and can be used to compute its probability: for many fragments of safe queries [jha2013knowledge], tractability can be shown via a tractable representation of their lineage. In [amarilli2015provenance], we build a boundedtreewidth circuit representation of Boolean provenance. How does this compare to the usual lineage classes of OBDDs and dDNNFs in knowledge compilation?

Third, can we link the querybased tractability approach to our instancebased one? Can we explain the tractability of some safe queries by reducing them to query evaluation on treelike instances?
This paper answers all of these questions.
Contributions
Our first main result (in Section 4) is that bounded treewidth characterizes the tractable families of graphs for MSO queries in the probabilistic context. More precisely, we construct a query for which probability evaluation is intractable on any unboundedtreewidth family of graphs satisfying mild constructibility requirements; query evaluation is precisely complete under randomized polynomialtime (RP) reductions. Thus, tractability on boundedtreewidth instances is really the best we can get, on arity2 signatures. Surprisingly, we show that can be taken to be a (nonmonotone) FO query; this is in stark contrast with nonprobabilistic query evaluation [kreutzer2010lower, ganian2014lower] where FO queries are fixedparameter tractable under much milder conditions than bounded treewidth [kreutzer2008algorithmic]. This provides the lower bound of a dichotomy, the upper bound being our result in [amarilli2015provenance].
In Section 5, we explain how this dichotomy result can be adapted to nonprobabilistic MSO query evaluation and match counting on subgraphclosed graph families. While the necessity of boundedtreewidth for nonprobabilistic query evaluation was studied before [kreutzer2010lower, ganian2014lower], our use of a recent polynomial bound on grid minors [chekuri2014polynomial] allows us to obtain stronger results in this context, which we review. Our work thus answers the conjecture of [grohe2008logic] (Conjecture 8.3) for MSO, which [kreutzer2010lower] answered for MSO, under similar complexitytheoretic assumptions.
In Section LABEL:sec:lineages, we move from probability evaluation to the computation of tractable lineages. Our tractability result in [amarilli2015provenance] computes a boundedtreewidth lineage of linear size for MSO queries on boundedtreewidth instances. We revisit this upper bound and show that we can compute an OBDD lineage of polynomial size (by results in [jha2012tractability]) and a dDNNF lineage of linear size (a new result). We show that on boundedpathwidth instances (a notion more restrictive than that of boundedtreewidth), we obtain a boundedpathwidth lineage, and hence a constantwidth OBDD (by [jha2012tractability]). Further, all these representations can be efficiently constructed.
We then reexamine the choice of representing provenance as a circuit rather than a formula, because this is unusual in the semiring provenance context of [amarilli2015provenance]. We show in Section LABEL:sec:formulae that some of the previous tractability results for lineage representations cannot extend to formula representations, via conciseness bounds on Boolean circuits and formulae. This sheds some light on the conciseness gap between circuit and formula representations of lineage.
We then move in Section 6 to our second main result, which applies to tractable OBDD lineages rather than tractable query evaluation. It shows a dichotomy on arity2 signatures, for the weaker query language of UCQs with disequalities: while boundedtreewidth instances admit efficient OBDDs for such queries, any constructible unboundedtreewidth instance family must have superpolynomial OBDDs for some query (depending only on the signature).
Last, in Section LABEL:sec:safe, we connect our approach to querybased tractability conditions [DBLP:conf/pods/DalviS07a, dalvi2012dichotomy]. We show that, for safe UCQs that admit a concise OBDD representation (that is, precisely inversionfree UCQs from [jha2013knowledge]), one can rewrite any instance to a boundedtreewidth instance (actually, to a boundedpathwidth one), such that the query lineage, and hence the query probability, remain the same. Thus, in this sense, safe queries are tractable because their input instances may as well be boundedpathwidth.
Related work
Boundedtreewidth has been shown to be a sufficient condition for tractability of query evaluation (this is by Courcelle’s theorem [courcelle1990monadic], generalized to arbitrary relational structures in [flum2002query]), counting of query matches [arnborg1991easy], and probabilistic query evaluation [amarilli2015provenance].
For MSO query evaluation on nonprobabilistic instances, boundedtreewidth is known not to be necessary, e.g., query evaluation is tractable assuming bounded cliquewidth [courcelle1993handle]. FO query evaluation is tractable assuming milder conditions [kreutzer2008algorithmic]. Two separate lines of work investigated the necessity of bounding the treewidth of instances to ensure the tractability of other data management tasks.
First, in [DBLP:conf/focs/Marx07, marx2010can], Marx shows that treewidthbased algorithms for binary constraintsatisfaction problems (CSP) are, assuming the exponentialtime hypothesis, almost optimal: they can only be improved by a logarithmic factor. These works do not rely on the graph minor theorem [robertson1986graph5] as we do, as they preceded the results of [chekuri2014polynomial] that provide polynomial bounds on the size of grid minors: see the discussion in the Introduction of [marx2010can]. Instead, they characterize high treewidth via embeddings of low depth. The results of [DBLP:conf/focs/Marx07, marx2010can] were further applied to inference in undirected [DBLP:conf/uai/ChandrasekaranSH08] and directed [DBLP:conf/ecai/KwisthoutBG10] graphical models. All these works are specific to the setting and problem that they study, namely CSP and inference.
Second, another line of work [makowsky2003tree, kreutzer2010lower, ganian2014lower] has shown necessity of bounded treewidth when a class of graphs is closed under some operations: extracting topological minors in [makowsky2003tree], extracting subgraphs in [kreutzer2010lower], and extracting subgraphs and vertex relabeling in [ganian2014lower]. This requires that there are sufficiently many instances of high treewidth, through notions of strong unboundedness [kreutzer2010lower] and dense unboundedness [ganian2014lower]. We strengthen the results of [ganian2014lower] in Section 5.2 of this paper, using our techniques. None of these works consider probabilistic evaluation or match counting, which we do here.
Other related work is discussed throughout the paper, where relevant; in particular works related to lineages in Sections LABEL:sec:lineages to 6 and to safe queries in Section LABEL:sec:safe.
2 Preliminaries
Instances
A relational signature is a set of relations , each having an arity denoted . The signature is arity if is the maximum arity of a relation in .
A relational instance (or simply instance or instance) is a finite set of ground facts on the signature , and a class or family of instances is just a (possibly infinite) set of instances. A subinstance of is a subset of its facts. We follow the active domain semantics, where the domain of is the finite set of elements that occur in facts. Hence, for , is the (possibly strict) subset of formed of the elements that occur in facts of . The size of , denoted , is its number of facts.
A homomorphism from a instance to a instance is a function such that, for all , we have . A homomorphism is an isomorphism if it is bijective and its inverse is also a homomorphism.
Graphs
Throughout the paper, a graph will always be undirected, simple, and unlabeled, unless otherwise specified. Formally, we see a graph as an instance of the graph signature with a single predicate of arity such that: (i) ; and (ii) . As we follow the active domain semantics, this implies that we disallow isolated vertices in graphs. The facts of are called edges. The set of vertices (or nodes) of a graph , denoted , is its domain. Two vertices and of a graph are adjacent if , and are then called the endpoints of the edge, and the edge is incident to them; two edges are incident if they share a vertex.
The degree of a vertex is the number of its adjacent vertices. For , a graph is regular if all vertices have degree . More generally, it is regular, where is a finite set of integers, if every vertex has degree for some . Finally, a graph is degree if is the maximum of the degree of all its vertices, i.e., if it is regular. A graph is planar if it can be drawn on the plane without edge crossings, in the standard sense [diestel].
A path of length in a graph is a set of edges that are all in ; the path is simple if all ’s are distinct. A cycle is a path of length where all vertices are distinct except that ; a graph is cyclic if it has a cycle. A graph is connected if there is a path from every vertex to every other vertex. A subdivision of a graph is a graph obtained by replacing each edge by an arbitrary nonempty simple path (every node on this path being fresh except the endpoints of the original edge).
Treewidth and pathwidth
A tree is an acyclic connected graph (remember that graphs are undirected). A tree decomposition of a graph is a tree with a labeling function from its nodes (called bags) to sets of vertices of , ensuring: (i) for every edge , there is a bag such that contains both and ; (ii) for every node of , the subtree of formed of all bags whose image contains must be connected. The width of is . The treewidth of a graph , denoted , is the minimum width of any tree decomposition of .
The treewidth of a relational instance , denoted , is defined as usual as the treewidth of its Gaifman graph, namely, the graph on the domain of that connects any two elements that cooccur in a fact. When the signature is arity2, we can see an instance as a labeled graph, and the treewidth of is then exactly the treewidth of this graph.
A path decomposition is a tree decomposition where is also a path. The pathwidth of a graph is the minimum width of any path decomposition of the graph. The pathwidth of a relational instance is the pathwidth of its Gaifman graph.
Queries
A query on the signature is a formula in secondorder logic over predicates of , with its standard semantics. All queries that we consider have no constants; unless otherwise specified, they are Boolean, i.e., they have no free variable. We write whenever an instance satisfies the query . We will be especially interested in the language FO of firstorder logical sentences (where secondorder quantifications are disallowed) and the language MSO of monadic secondorder logical sentences (where the only secondorder quantifications are over unary predicates).
We will also consider the language CQ of conjunctive queries, i.e., existentially quantified conjunctions of atoms over the signature; the language of conjunctive queries where additional atoms of the form (called disequality atoms) are allowed, where and are variables appearing in some regular atom; the language UCQ of union of conjunctive queries, namely, disjunctions of CQs; the language of disjunctions of queries. The size of a query is its total number of atoms, i.e., the sum of the number of atoms in each .
A homomorphism from a CQ to an instance is a mapping from the variables of to such that for each atom of we have . For queries, we require that whenever contains the disequality atom . A homomorphism from a to is a homomorphism from some disjunct of to : it witnesses that . A match of a on an instance is a subset of which is the image of a homomorphism from to ; a minimal match is a match that is minimal for inclusion.
A query is monotone if and imply for any two instances . A query is closed under homomorphisms if we have whenever and has a homomorphism to , for any and . UCQ is an example of query class that is both monotone and closed under homomorphisms, while is monotone but not closed under homomorphisms.
3 Problem Statement
We study the problem of probability evaluation:
Definition 3.1.
Given an instance , a probability valuation is a function
that maps each fact of to a value
The probability evaluation problem for a query on a class of relational instances asks, given an instance and a probability valuation on , what is the probability that holds according to the probability distribution, i.e., it is the problem of computing .
In other words, probability evaluation asks for the probability of over a TID instance defined by and . Note that we only consider classes of instances with no associated probabilities, and the probability valuation is given as an additional input — it is not indicated in . The complexity of the probability evaluation problem will always be studied in data complexity: the query and class is fixed, and the input is the instance and the probability valuation.
We also explore the problem of computing tractable lineages (or provenance), defined and studied in Section LABEL:sec:lineages onwards.
We rely on results of [amarilli2015provenance] that show the tractability in data complexity of provenance computation and probability evaluation on treelike (i.e., boundedtreewidth) instances. This holds for guarded secondorder queries, but as such queries collapse to MSO under bounded treewidth [gradel2002back], we always use MSO queries here. First, [amarilli2015provenance] shows that we can construct Boolean circuits that represent the provenance of MSO queries on treelike instances; we can also construct monotone circuits for monotone queries. The results also apply to other semirings, but this will not be our focus here. Second, [amarilli2015provenance] shows that probability evaluation is then tractable, namely, ralinear: in linear time up to the (polynomial) cost of arithmetic operations.
Our goal is thus to investigate to what extent we can generalize the following tractability result from [amarilli2015provenance]:
Theorem 3.2 ([amarilli2015provenance]).
For any signature , for any (monotone) MSO query , for any , there is an algorithm which, given an input instance of treewidth :

Computes a (monotone) Boolean provenance circuit of on , in linear time in ;

Given a probability valuation of , computes the probability of on , in ralinear time.
4 Probability Evaluation
This section studies whether we can extend the above tractability result by lifting the boundedtreewidth requirement. We answer in the negative by a dichotomy result on aritytwo signatures: there are queries for which probabilistic evaluation is tractable on boundedtreewidth families but is intractable on any efficiently constructible unboundedtreewidth family. A first technical issue is to formalize what we mean by efficiently constructible. We use the following notion:
Definition 4.1.
We call
treewidthconstructible if for all ,
if contains instances of treewidth , we can construct
one
in polynomial time given written in unary
In particular, this implies that must contain a subfamily of unboundedtreewidth instances that are small, i.e., have size polynomial in their treewidth. We discuss the impact of this choice of definition, and alternate definitions of efficiently constructible instances, in Section 5.
A second technical issue is that we need to restrict to signatures of arity . We will then show our dichotomy for any such signature. This suffices to show that our Theorem 3.2 cannot be generalized: its generalization should apply to any signature, in particular arity2 ones. Yet, we do not know whether the dichotomy applies to signatures of arity .
Our main result on probability evaluation is as follows. In this result, is the class of function problems which can be solved in PTIME with a deterministic Turing machine having access to a #Poracle, i.e., an oracle for counting problems that can be expressed as the number of accepting paths for a nondeterministic PTIME Turing machine.
Theorem 4.2.
Let be an arbitrary arity2 signature. Let be a treewidthconstructible class of instances. Then the following dichotomy holds:

If there is such that for every , then for every MSO query , the probability evaluation problem for on instances of is solvable in ralinear time.

Otherwise, there is an FO query (depending on but not on ) such that the probability evaluation problem for on is complete under randomized polynomial time (RP) reductions.
The first part of this result is precisely the second point of Theorem 3.2. We thus sketch the proof of the hardness result of the second part. Pay close attention to the statement: while some FO queries (in particular, unsafe CQs [dalvi2012dichotomy]) may have hard probability evaluation when all input instances are allowed, our goal here is to build a query that is hard even when input instances are restricted to arbitrary families satisfying our conditions, a much harder claim.
We reduce from the problem of counting graph matchings, namely, the number of edge subsets of a graph that have no pair of incident edges. This problem is known to be #Phard on 3regular planar graphs [xia2007computational]. We define a FO query that tests for matchings on such graphs (encoded in a certain way), and we rely on the connection between probability evaluation and model counting so that the probability of on (an encoding of) a graph reflects its number of matchings.
The main idea is that 3regular planar graphs can be extracted from our family , using the following notion:
Definition 4.3.
An embedding of a graph in a graph is an injective mapping from the vertices of to the vertices of and a mapping that maps the edges of to paths in from to , with all paths being vertexdisjoint. A graph is a topological minor of a graph if there is an embedding of in .
We then use the following lemma, that rephrases the recent polynomial bound [chekuri2014polynomial] on Robertson and Seymour’s grid minor theorem [robertson1986graph5] to the realm of topological minors; in so doing, we use the folklore observation that a degree3 minor of a graph is always a topological minor:
Lemma 4.4.
There is such that for any degree3 planar graph , for any graph of treewidth , is a topological minor of and an embedding of in can be computed in randomized polynomial time in .
Hence, intuitively, given an input 3regular planar graph (the input to the hard problem), we can extract it in randomized polynomialtime (RP) as a topological minor of (the Gaifman graph of) an instance of our family that we obtain using treewidthconstructibility. Once it is extracted, we show that, by choosing the right probability valuation for , the probability of on allows us to reconstruct the answer to the original hard problem, namely, the number of matchings of . The minor extraction step is what complicates the design of , as must then test for matchings in a way which is invariant under subdivisions: this is especially tricky in FO as we can only make local tests.
Choice of hard query
Not only is our query independent from the class of instances , but it is also an FO query, so, in the nonprobabilistic setting, its data complexity on any instance is in AC. In fact, our choice of has also lineartime data complexity: one can determine in linear time in an input instance whether . This contrasts sharply with the completeness (under RP reductions) of probability evaluation for on any unboundedtreewidth instance class (if it is treewidthconstructible).
The query , however, is not monotone. We can alternatively show Theorem 4.2 for a MSO query which is monotone, but not in FO: more specifically, we use a query in , the class of conjunctive twoway regular path queries [DBLP:conf/kr/CalvaneseGLV00, DBLP:journals/tcs/CalvaneseGV05] where we additionally allow disequalities between variables.
We will show an analogue of Theorem 4.2 in the setting of tractable lineages in Section 6, which applies to , an even weaker language. We do not know whether Theorem 4.2 itself can be shown with such queries, or with a monotone FO query. However, we know that Theorem 4.2 could not be shown with a query closed under homomorphism; this is implied by Proposition 6.9.
Providing valuations with the instances
When we fix the instance family , the probability valuation is not prescribed as part of the family, but can be freely chosen. If the instances of were provided with their probability valuations, or if probability valuations were forced to be , then it is unlikely that an equivalent to Theorem 4.2 would hold.
Indeed, fix any query such that, given any instance , it is in #P to count how many subinstances of satisfy ; e.g., let be a CQ. Consider a family of instances with valuations such that there is only one instance in per encoding length: e.g., take the class of grids with probability on each edge, for some binary relation . Consider the problem, given the length of the encoding of an instance (written in unary), of computing how many subinstances of satisfy . This problem is in the class [valiant1979complexity]. Hence, the probability computation problem for on is in : rewrite the encoding of the input instance to a word of the same length in a unary alphabet, use the oracle to compute the number of subinstances, and normalize the result by dividing by the number of possible worlds of .
It thus seems unlikely that probabilistic evaluation of on with its valuations is #Phard, so that our dichotomy result probably does not adapt if input instance families are provided with their valuations.
5 NonProbabilistic Evaluation
Theorem 4.2 in Section 4 uses the recent technology of [chekuri2014polynomial] that shows polynomial bounds for the grid minor theorem of [robertson1986graph5]. These improved bounds also yield new results in the nonprobabilistic setting. We accordingly study in this section the problem of nonprobabilistic query evaluation, again defined in terms of data complexity:
Definition 5.1.
The evaluation problem (or modelchecking problem), for a fixed query on an instance family , is as follows: given an instance , decide whether .
Observe that the probability evaluation problem in Section 4 allowed the valuation to set edges to have probability . We could thus restrict to any subinstance of an instance in the class . In other words, the freedom to choose valuations in probability evaluation gave us at least the possibility of choosing subinstances for nonprobabilistic query evaluation. This is why we will study in this section the nonprobabilistic query evaluation problem on instance classes which are closed under taking subinstances (or subinstanceclosed), namely, for any and , we have .
As before, we will prove dichotomy results for this problem on unboundedtreewidth instance families, though we will use an MSO query rather than an FO query. We give two phrasings of our results. The first one, in Section 5.1, still requires treewidthconstructibility, and shows hardness for every level of the polynomial hierarchy, again under RP reductions. The second phrasing, in Section 5.2, is inspired by the results of [ganian2014lower], which it generalizes: it relies on complexity assumptions (namely, the nonuniform exponential time hypothesis) but works with a weaker notion of constructibility, namely, it requires treewidth to be strongly unbounded polylogarithmically.
Last, we study in Section 5.3 the problem of match counting in the nonprobabilistic setting, for which no analogous results seemed to exist.
As in Section 4, we restrict to signatures of arity 2.
5.1 Hardness Formulation
Our first dichotomy result for nonprobabilistic MSO query evaluation is as follows; it is phrased using the notion of treewidthconstructibility. In this result, denotes the complexity class at the th existential level of the polynomial hierarchy.
Theorem 5.2.
Let be an arbitrary arity2 signature. Let be a class of instances which is treewidthconstructible and subinstanceclosed. The following dichotomy holds:

If there exists such that for every , then for every MSO query , the evaluation problem for on is solvable in linear time.

Otherwise, for each , there is an MSO query (depending only on , not on ) such that the evaluation problem for on is hard under RP reductions.
The upper bound is by Courcelle’s results [courcelle1990monadic, flum2002query], so our contribution is the hardness part, which we now sketch.
The main thing to change relative to the proof of Theorem 4.2 is the hard problems from which we reduce. We use hard problems on planar regular graphs, which we obtain from the alternating coloring problem as [ganian2014lower, ganian2010there], restricted to such graphs using techniques shown there, plus an additional construction to remove vertex labellings. Here is our formal claim about the existence of such hard problems:
Lemma 5.3.
For any , there exists an MSO formula on the signature of graphs such that the evaluation of on planar regular graphs is hard. Moreover, for any such graph , we have iff for any subdivision of .
Hypotheses
Theorem 5.2 relies crucially on the class being subinstanceclosed. Otherwise, considering the class of cliques of a single binary relation , this class is clearly unboundedtreewidth and treewidthconstructible, yet it has bounded cliquewidth so MSO query evaluation has linear data complexity on this class [courcelle2000linear].
Further, the hypothesis of treewidthconstructibility is also crucial. Without this assumption, Proposition 32 of [makowsky2003tree] shows the existence of graph families of unbounded treewidth which are subinstanceclosed yet for which MSO query evaluation is in PTIME.
5.2 Alternate Formulation
We now give an alternative phrasing of Theorem 5.2 which connects it to the existing results of [kreutzer2010lower, ganian2014lower]. Table 1 tersely summarizes their results in comparison to our own results and other related results. As [kreutzer2010lower, ganian2014lower] are phrased in terms of graphs, and not arbitrary arity2 relational instances, we do so as well in this subsection. Before stating our result, we summarize these earlier works to explain how our work relates to them.
[kreutzer2010lower, ganian2014lower] show the intractability of MSO on any subgraphclosed unboundedtreewidth families of graphs, under finer notions than our treewidthconstructibility. Kreutzer and Tazari [kreutzer2010lower] proposed the notion of families of graphs with treewidth strongly unbounded polylogarithmically and showed that MSO (MSO with quantifications over both vertex and edgesets) over any such graph families is not fixedparameter tractable in a strong sense (it is not in XP), unless the exponentialtime hypothesis (ETH) fails. Ganian et al. [ganian2014lower] proved a related result, introducing the weaker notion of densely unbounded polylogarithmically but requiring graph families to be closed under vertex relabeling; in such a setting, Theorem 4.1 of [ganian2014lower] shows that MSO (with vertex labels) cannot be fixedparameter quasipolynomial unless the nonuniform exponentialtime hypothesis fails.
These two results of [kreutzer2010lower] and [ganian2014lower] are incomparable: [kreutzer2010lower] requires a stronger unboundedness notion (strongly unbounded vs densely unbounded) and a stronger query language (MSO vs MSO), but it does not require vertex relabeling, and makes a weaker complexity theory assumption (ETH vs nonuniform ETH). See the Introduction of [ganian2014lower] for a detailed comparison.
Our Theorem 5.2 uses MSO and no vertex labeling, but it requires treewidthconstructibility, which is stronger than densely/strongly polylogarithmic unboundedness: strongly unboundedness only requires constructibility in and densely unboundedness does not require constructibility at all. The advantage of treewidthconstructibility is that we were able to show hardness of our problem (under RP reductions), without making any complexity assumptions. However, if we make the same complexitytheoretic hypotheses as [ganian2014lower], we now show that we can phrase our results in a similar way to theirs, and thus strengthen them.
We accordingly recall the notion of densely polylogarithmic unboundedness, i.e., Definition 3.3 of [ganian2014lower]:
Definition 5.4.
A graph class has treewidth densely unbounded polylogarithmically if for all , for all , there exists a graph such that and .
We now state our intractability result on densely unbounded polylogarithmically graph classes. It is identical to Theorem 5.5 of [ganian2014lower] but applies to arbitrary MSO formulae, without a need for vertex relabeling: in the result, denotes the polynomial hierarchy. This result answers Conjecture 8.3 of [grohe2008logic] (as we pointed out in the Introduction).
Theorem 5.5.
Unless , there is no graph class satisfying all three properties:

is closed under taking subgraphs;

the treewidth of is densely unbounded polylogarithmically;

the evaluation problem for any MSO query on is quasipolynomial, i.e., in time for , an arbitrary constant , and some computable function .
The proof technique is essentially the same as in [ganian2014lower] up to using the newer results of [chekuri2014polynomial]. It is immediate that an analogous result holds for probability query evaluation, as standard query evaluation obviously reduces to it (take the probability valuation giving probability to each fact).
Upper bounds (Section LABEL:sec:lineages): computation bounds imply size bounds  
Instance  Queries  Representation  Note  Time  Source  
boundedpw  MSO  OBDD  width  here  Thm. 5.13  
boundedpw  (monotone) MSO  (monotone) circuit  boundedpw  here  Prop. LABEL:prp:makecircuitpath  
boundedtw  MSO  OBDD  here  Thm. 5.12  
boundedtw  (monotone) MSO  (monotone) circuit  boundedtw  [amarilli2015provenance]  Thm. 4.2  
boundedtw  MSO  dDNNF  here  Thm. LABEL:thm:makeddnnf  
any  inversionfree UCQ  OBDD  width  [jha2013knowledge]  Prop. 5  
any  positive relational algebra  monotone formula  [DBLP:journals/jacm/ImielinskiL84]  Thm. 7.1  
any  Datalog  monotone circuit  [deutch2014circuits]  Thm. 2  
Lower bounds (Section LABEL:sec:formulae)  
Instance  Queries  Representation  Size  Source  
tree  CQ  formula  here  Prop. LABEL:prp:provlowercq  
tree  CQ  monotone formula  here  Prop. LABEL:prp:provlowercqpos  
tree  MSO  formula  here  Prop. LABEL:prp:provlowertree  
any  Datalog  monotone formula  [deutch2014circuits]  Thm. 1 
5.3 Match Counting
We conclude this section by moving to the problem of match
counting, i.e., counting how many assignments satisfy a nonBoolean
MSO formula.
Match counting should not be confused with model counting
(counting how many subinstances satisfy a Boolean formula) which
is closely related
To our knowledge, no dichotomylike result on match counting for MSO queries was known. This section shows such a result; as in Section 5.1, we assume treewidthconstructibility, closure under subinstances, and arity2 signatures.
We define the match counting problem as follows:
Definition 5.6.
The counting problem for an MSO formula (with free secondorder variables) on an instance family is the problem, given an instance , of counting how many vectors of domain subsets are such that satisfies .
The restriction to free secondorder variables is without loss of generality, as free firstorder variables can be rewritten to free secondorder ones, asserting in the formula that they must be interpreted as singletons.
We show the following dichotomy result:
Theorem 5.7.
Let be an arbitrary arity2 signature. Let be a subinstanceclosed and treewidthconstructible class of instances. The following dichotomy holds:

If there exists such that for every , then for every MSO query with free secondorder variables, the counting problem for on is solvable in ralinear time.

Otherwise, there is an MSO query (depending only on , not on ) with one free second order variable such that the counting problem for on is complete under RP reductions.
The first claim is shown in [arnborg1991easy]. The proof of the second claim proceeds as for Theorem 4.2. We reduce from the problem of counting Hamiltonian cycles in planar 3regular graphs , which is #Phard [LiskiewiczOT03], and which we express in MSO on the incidence graph of .
Unlike in Theorem 4.2, the query does not have tractable model checking (as opposed to probability evaluation). We do not know whether we can show a similar result with such a tractable query.
Lineage Upper Boundssec:lineages From probability evaluation in Section 4 (and its nonprobabilistic variants in Section 5), we now turn to our second problem: the study of tractable lineage representations.
Indeed, a common way to achieve tractable probability evaluation is to represent the lineage of queries on input instances in a tractable formalism [jha2013knowledge]. This section shows how the tractability of MSO probability evaluation on boundedtreewidth instances can be explained via lineages: Table 2 (upper part) summarizes the upper bounds that we prove.
Intuitively, the lineage of a query on an instance describes how the query depends on the facts of the instance. Formally:
Definition 5.8.
The lineage of a query on an instance is a Boolean function whose variables are the facts of , such that, for any , iff the corresponding valuation makes true. If is monotone, then is a monotone Boolean function, in which case it can equivalently be called the provenance [green2007provenance] of on .
Lineages are related to probability evaluation, because evaluating the probability of query under a probability valuation of instance amounts to evaluating the probability of the lineage , under the corresponding probability valuation on variables. Thus, if we can represent in a formalism that enjoys tractable probability computation, then we can tractably evaluate the probability of on .
In this section, we show that MSO queries on boundedtreewidth instances admit tractable lineage representations in many common formalisms: they have linearsize boundedtreewidth Boolean circuits (as shown in [amarilli2015provenance]), but also have polynomialsize OBDDs [bryant1992symbolic, olteanu2008using] (with a stronger claim for boundedpathwidth), as well as linearsize dDNNFs [darwiche2001tractable]. Further, as we show, all these lineage representations can be efficiently computed. Note that in all these results, as in [amarilli2015provenance], tractability only refers to data complexity, with large constant factors in the query and instance width: we leave to future work the study of query and instance classes for which lineage computation enjoys a lower combined complexity.
After our results on tractable lineage computations in this section, we will investigate in the next section whether we can represent the lineage as Boolean formulae (such as readonce formulae [jha2013knowledge]), and we will show superlinear lower bounds for them. Section 6 will then study in which sense boundedtreewidth is necessary to obtain tractable lineages.
This section applies to signatures of arbitrary arity.
Boundedtreewidth circuits
We first recall our results from [amarilli2015provenance] and introduce a first representation of lineages: Boolean circuits, called provenance circuits in [amarilli2015provenance] and expression DAGs in [jha2012tractability]:
Definition 5.9.
A lineage circuit for query and instance is a Boolean circuit with input gates and with NOT, OR, and AND internal gates, whose inputs are the facts of the database, and which computes the lineage of on . A monotone lineage circuit has no NOT gate. The treewidth and pathwidth of a lineage circuit are that of the circuit’s graph.
As recalled in Theorem 3.2, [amarilli2015provenance] showed that we can compute (monotone) lineage circuits for (monotone) MSO queries on boundedtreewidth instances in linear time. Further, these circuits themselves have boundedtreewidth, which is why probability evaluation is tractable on them, using message passing algorithms [lauritzen1988local]. Hence:
Theorem 5.10 (([amarilli2015provenance], Theorems 4.4 and 5.3)).
For any fixed MSO query and constant , given an input instance of treewidth , we can compute in linear time a boundedtreewidth lineage circuit of on .
If is monotone then we can take to be monotone.
We study how to adapt this to other tractable lineage representations.
OBDDs
We start by defining OBDDs, a common tractable representation of Boolean functions [bryant1992symbolic, olteanu2008using]:
Definition 5.11.
An ordered binary decision diagram (or OBDD) is a rooted directed acyclic graph (DAG) whose leaves are labeled or , and whose nonleaf nodes are labeled with a variable and have two outgoing edges labeled and . We require that there exists a total order on the variables such that, for every path from the root to a leaf, no variable occurs in two different internal nodes on the path, and the order in which the variables occur is compatible with .
An OBDD defines a Boolean function on its variables: each valuation is mapped to the value of the leaf reached from the root by following the path given by the valuation.
The size of an OBDD is its number of nodes, and its width is the maximum number of nodes at every level, where a level is the set of nodes reachable by enumerating all possible values of variables in a prefix of .
Probability evaluation for OBDDs is tractable [olteanu2008using]. Our result is that we can compute polynomialsize OBDDs for MSO queries on boundedtreewidth instances in PTIME:
Theorem 5.12.
For any fixed MSO query and constant , there is such that, given an input instance of treewidth , one can compute in time an OBDD (of size ) for the lineage of on .
We show this using Corollary 2.14 of [jha2012tractability]: any boundedtreewidth Boolean circuit can be represented by an equivalent OBDD of polynomial width. We complete this result and show that the OBDD can also be computed in polynomial time, which clearly implies Theorem 5.12 (using Theorem 5.10):
lem:makeobddtw For any , there is such that, given a Boolean circuit of treewidth , we can compute an equivalent OBDD in time .
The variable order is that of [jha2012tractability] and can be constructed in PTIME. We then show that we can build the polynomialsize OBDD level by level, by testing the equivalence of partial valuations in PTIME. We do it using message passing, thanks to the tree decomposition of . As in [jha2012tractability], is doubly exponential in .
Proof.
We rely on Corollary 2.14 of [jha2012tractability]: there is a doubly exponential function such that, for any , there is such that, for any tree decomposition of width of , the OBDD obtained for a certain variable order has width .
The order on variables is defined following an inorder traversal of where children are ordered by the number of variables in this subtree; clearly this quantity can be computed over the entire tree in PTIME, so the order can be computed in PTIME. We show that we can construct the OBDD in PTIME as well, in a levelwise manner inspired by [jha2013knowledge].
Write , and construct levelbylevel in the following way. Assuming that we have constructed up to level , create two children for each node at level (depending on the value of variable ), and then merge all such children and that are equivalent. To define this, call equivalent two partial valuations and of variables if the Boolean function represented by on the other variables under is the same as under . Now, call and equivalent if, for any partial valuation leading to (represented by a path from the root of to ) and any partial valuation leading to , these two partial valuations of are equivalent. As we will always ensure in the construction, any paths leading to the parent node of are equivalent partial valuations of , so and are equivalent iff, picking any two valuations for and for by following a path from the root to and to respectively, and are equivalent.
Hence, it suffices to show that there is a function such that we can test in time whether two partial valuations are equivalent. Indeed, we can then build in the indicated time, because the maximal number of node pairs to test at any level of the OBDD is : we had at most at the previous level, and each of them creates two children, before we merge the equivalent children. Hence, if we can test equivalence in the indicated time, then clearly we can construct in time for (the first term accounts for the linear number of levels, and the second term accounts for the linear time required to find a partial valuation for a node).
We thus show that the equivalence of partial valuations can be tested in time for some function . Considering two partial valuations and of the same set of variables , let and be the two circuits obtained from by substituting the input gates for with constant gates according to and respectively. Note that and have the same set of input gates , formed precisely of the variables not in . We rename the internal gates of so that the only gates shared between and are the input gates . Now, be the circuit obtained by taking the union of and (on the same set of variables), and adding an output gate and a constant number of gates such that the output gate is true iff the output gates of and carry different values (this can be done with 5 additional gates in total). It is easy to see that there is a valuation of that makes the circuit evaluate to true iff the partial valuations and are not equivalent. Now, observe that we can immediately construct from a tree decomposition of width of . Indeed, it is obvious that is a tree decomposition of , and we can rename gates to obtain from a tree decomposition of , such that and both have the same width and the same skeleton. Now, construct that has same skeleton as and where each bag is the union of the corresponding bags of and , adding the 5 intermediate gates to each bag. The result clearly has width and it is immediate that it is a tree decomposition of .
We can then use messagepassing techniques [lauritzen1988local, huang1996inference] to determine in time exponential in and polynomial in whether the boundedtreewidth circuit has a satisfying assignment, from which we deduce whether and are equivalent. For details, see, e.g., Theorem D.2 of [amarilli2015provenance]. ∎
Boundedpathwidth
We have explained the tractability of MSO probability evaluation on boundedtreewidth instances, showing that we could compute boundedtreewidth lineage circuits for them. We strengthen these results in the case of boundedpathwidth instances, showing that we can compute constantwidth OBDDs:
Theorem 5.13.
For any fixed MSO query and constant , given an input instance of pathwidth , one can compute in polynomial time an OBDD of constant width for the lineage of on .
To prove the result, we first observe (adapting [amarilli2015provenance]) that we can compute boundedpathwidth lineage circuits in linear time on boundedpathwidth instances:
prp:makecircuitpath For any fixed and (monotone) MSO query , for any instance of pathwidth , we can construct a (monotone) lineage circuit of on in time . The pathwidth of only depends on and (not on ).
Proof.
Given a path decomposition of an instance , which is a tree decomposition with a linear tree, the resulting tree encoding of (see [amarilli2015provenance, amarilli2015provenanceb]) is clearly also a linear tree. From the proof of Theorem 4.4 of [amarilli2015provenanceb], we observe that the lineage circuit that we construct has a tree decomposition which can be made to be a path decomposition in this case, because it follows the structure of . Hence, the circuit has bounded pathwidth. ∎
By Corollary 2.13 of [jha2012tractability], this implies the existence of a constantwidth OBDD representation, which we again show to be computable, proving Theorem 5.13.
lem:makeobddpw For any , for any Boolean circuit of pathwidth , we can compute in polynomial time in an OBDD equivalent to whose width depends only on .
Proof.
As in the proof of Lemma LABEL:lem:makeobddtw, we can compute in PTIME the order on variables, and we can compute the OBDD under this order in the same way. This uses the fact that a path decomposition of circuit is in particular a tree decomposition of . ∎
dDNNFs
We now turn to the more expressive tractable lineage formalism of dDNNFs, introduced in [darwiche2001tractable]; we follow the definitions of [jha2013knowledge]:
Definition 5.14.
A deterministic, decomposable negation normal form (dDNNF) is a Boolean circuit that satisfies the following conditions:

Negation is only applied to input gates: the input of any NOT gate must always be an input gate.

The inputs of ANDgates depend on disjoint sets of input gates. Formally, for any ANDgate , for any two gates which are inputs of , there is no input gate which is reachable (as a possibly indirect input) from both and .

The inputs of ORgates are mutually exclusive. Formally, for any ORgate , for any two gates which are inputs of , there is no valuation of the inputs of under which and both evaluate to true.
It is tractable to evaluate the probability of a dDNNF [darwiche2001tractable], and dDNNFs capture the tractability of probability evaluation for many safe queries (see [jha2013knowledge]). We show that it also explains the ralinearity of MSO probability evaluation on boundedtreewidth instances, as we can construct linear dDNNFs for them:
thm:makeddnnf For any fixed MSO query and constant , given an input instance of treewidth , one can compute in time a dDNNF representation of the lineage of on .
Our construction in [amarilli2015provenance] applies a tree automaton for the query to an annotated tree encoding of the instance. This yields a boundedtreewidth circuit representation of the lineage, in linear time in the instance (but with a constant factor that is nonelementary in the query). We show that, if the automaton is deterministic, the circuit that we obtain is already a dDNNF. The result follows, as one can always make a tree automaton deterministic [tata], at the cost of an increased constant factor in the data complexity.
Proof.
We define a bottomup deterministic tree automaton on alphabet (or bDTA) in the standard manner. We start by adapting the proof of Proposition 3.1 of [amarilli2015provenanceb] to show the following result instead: a provenance dDNNF of a deterministic bDTA on a tree can be constructed in time . We construct the circuit exactly as in the proof of Proposition 3.1 of [amarilli2015provenanceb] and show that it is a dDNNF.
First, observe that the only NOT gates that we use are the , which are NOT gates of the , which are input gates; so we only apply negation to leaf nodes.
Second, we show that the sets of leaves reachable from the children of any AND gate are pairwise disjoint. The AND gates that we create and that have multiple inputs are:

The , which are the AND of and ; now, only depends on the input gates for nodes of the subtree of rooted at , and likewise only depends on input gates in the right subtree;

The , which are the AND of and ; now, the do not depend on , only on input gates for a strict descendant of in ;

The , which are the AND of and ; now, the do not depend on the sole input gate under , i.e., , but only on input gates for a strict descendant of in .
Third, we show that the children of any OR gate are mutually exclusive. The OR gates that we create and that have multiple inputs are the following:

The when is a leaf node of , for which the claim is immediate, as the only two possible children are and which are clearly mutually exclusive.

The when is an internal node of , which are the OR of gates of the form or over several pairs .
To observe that these gates are mutually exclusive, remember that, for a valuation of the tree , the gate is true iff there is a run of on the subtree of rooted at such that . However, as is deterministic, for each , there is at most one state for which this is possible. Hence, for any valuation of the circuit , for our node , there is at most one such that is true under valuation , and only at most one such that is true under . Hence, by definition of the , there is at most one of them which can be true under valuation , namely, , which also means that only the gate and the gate can be true under . But these two gates are clearly mutually exclusive (only one can evaluate to true, depending on the value of ), which proves the claim.

The output gate which is the OR of gates of the form for the root node of . Again, as is deterministic, for any valuation of , letting be the corresponding valuation of the tree , there is only one state such that has a run on with , so at most one state such that is true under .
Hence, the circuit constructed in the proof of Proposition 3.1 of [amarilli2015provenance] is a dDNNF representation of the lineage of the automaton on the tree which has linear size.
We now adapt the proof of Theorem 4.2 of [amarilli2015provenanceb]. The theorem proceeds by constructing a bNTA for the query [courcelle1990monadic] on the alphabet and modifying it to obtain a bNTA on . We now additionally convert to a bDTA on the same alphabet, which we can do using standard techniques [tata]. All of this is performed independently of the instance.
Now, we conclude using the rest of the proof of Theorem 4.2 of [amarilli2015provenance]. The resulting circuit is the result of (bijectively) renaming the input gates, and replacing some input gates by constant gates, on the circuit produced by Proposition 3.1 of [amarilli2015provenance]. However, by our previous observation, is actually a dDNNF circuit, so also is (up to evaluating negations of constant gates as constant gates). Hence, we have produced the desired dDNNF, which by the statement of Theorem 4.2 of [amarilli2015provenance] is of linear size and is computed in linear time. ∎
Formula Lower Boundssec:formulae We have shown that MSO queries on boundedtreewidth instances have tractable lineages, and even linearsized ones (boundedtreewidth circuits and dDNNFs). All lineage representations that we have studied, however, are based on DAGs (circuits or OBDDs). In this section, we study whether we could obtain linear lineage representations as Boolean formulae, e.g., as readonce formulae [jha2013knowledge].
This question is natural because existing work on probabilistic data seldom represents query lineages as Boolean circuits: they tend to use Boolean formulae, or other representations such as OBDDs, FBDDs and dDNNFs [jha2013knowledge]. Further, most prior works on provenance focus on formula representations of provenance (with the notable exception of [deutch2014circuits], see below).
Intuitively, an important difference between formula and circuit representations is that circuits can share common subformulae whereas formulae cannot. We show in this section that this difference matters: formulabased representations of the lineage of MSO queries on boundedtreewidth instances cannot be linear in general, because of superlinear lower bounds. More specifically, we show that formulabased representations are superlinear even for some queries, and that they are quadratic for some MSO queries.
A similar result was already known for lineage representations ([deutch2014circuits], Theorem 1), which showed that circuit representations of provenance can be more concise than formulae; but this result applies to arbitrary instances, not boundedtreewidth ones. Hence, this section also sheds additional light on the compactness gain offered by the recent circuit representations of provenance in [deutch2014circuits].
Our results in this section rely on classical lower bounds on the size of formulae expressing certain Boolean functions [wegener1987complexity]. They apply to signatures of arbitrary arity. We summarize them in the lower part of Table 2.
queries
We first show a mild conciseness gap for the comparatively simple language of . Formally, we exhibit a query whose lineage cannot be represented by a linearsize formula, even on treelike instance families. By contrast, from the previous section, we know that its lineage has linearsize circuit representations.
prp:provlowercq There is a query , and a family of relational instances with treewidth , such that, for any , for any Boolean formula capturing the lineage of on , we have .
We show we can express the threshold function over variables, for which [wegener1987complexity] gives a lower bound on the formula size.
Proof.
Consider the signature with a single unary predicate , and consider the . Consider the family of instances defined as for all . Clearly, for any , the lineage of on is the threshold function checking whether at least two of its inputs are true.
By Chapter 8, Theorem 5.2 of [wegener1987complexity], any formula using , , expressing the threshold function over variables has size , the desired bound. ∎
As queries are monotone, we can also ask for monotone lineage representations. From the previous section, we still have linear representations of the lineage as a monotone circuit. By contrast, if we restrict to monotone Boolean formulae, we obtain an improved lower bound:
prp:provlowercqpos There is a query , and a family of relational instances with treewidth , such that, for any , for any monotone Boolean formula capturing the lineage of on , we have .
Proof.
We use the same proof as for Proposition LABEL:prp:provlowercq but relying on [hansel1964nombre] (also Chapter 8, Theorem 1.2 of [wegener1987complexity]), which shows that, for built over the monotone basis and , we have . ∎
MSO queries
For the more expressive language of MSO queries, we show a wider gap: there is a query for which formulabased lineage representations must be quadratic, whereas we know that there are linear circuit representations of the lineage:
prp:provlowertree There is an MSO query , and a family of relational instances with treewidth , such that, for any , for any Boolean formula capturing the lineage of on , we have .
The proof is more subtle and constructs an MSO formula expressing the parity of the number of facts of a unary predicate, using a second auxiliary relation. The lower bound for parity comes again from [wegener1987complexity].
We leave open the question of whether this bound can be further improved, but we note that even an lower bound would require new developments in the study of Boolean formulae. Indeed, the best currently known lower bound on formula size, for any explicit function of variables with linear circuits, is in [hastad1998shrinkage].
Proof.
Consider the signature with a unary predicate and binary predicate . We define the family with having domain and facts for each and for each . Clearly, all instances in have treewidth . We consider the MSO formula that intuitively uses the facts to test whether the number of facts is odd. Formally, we define as follows, inspired by the definition of an automaton:
where asserts that and partition the domain (where denotes exclusive OR):
is the conjunction of the following transition rules, for each in :
and asserts the initial states:
Intuitively, on any possible world of where all facts are present, is true whenever the number of facts is odd. Indeed, it is clear that there is a unique choice of and in such worlds, defined by putting in or depending on whether holds, and, for , letting such that and be or depending on whether holds or not, putting in . Hence, in worlds containing all facts, there is a unique choice of and where we have iff the number of facts with is odd. Hence, is satisfied iff, in this unique assignment, (the only node with no incoming edge) is in , that is, if the overall number of facts is odd.
Hence, pick and let be a formula representation of the lineage of on . Replacing the input gates for the facts by constant gates, we obtain a formula of the same size that computes the parity function of the inputs corresponding to the gates, the number of which is .
Now, by Theorem 8.2, Chapter 8 of [wegener1987complexity], any formula using , , expressing the parity function over variables has a number of variable occurrences that is at least . Hence, we deduce that , as claimed. ∎
6 OBDD Size Bounds
We have shown in the previous sections that MSO queries on boundedtreewidth instances have tractable lineage representations as circuits and OBDDs. This section focuses on OBDDs and shows our second main dichotomy result: boundedtreewidth is necessary for MSO query lineages to have polynomial OBDDs.
We first state this result in Section 6.1. Its upper bound is Theorem 5.12, and its lower bound applies to a specific (which only depends on the signature). We show that has no polynomialwidth OBDDs on any arity2 instance family with treewidth densely unbounded polylogarithmically. This second dichotomy result thus shows that boundedtreewidth is necessary for some queries to have tractable OBDDs; it applies to a more restricted class than the FO query of our first main dichotomy result (Theorem 4.2), but applies to a different task (the computation of OBDD lineages, rather than probability evaluation).
We then study in Section 6.2 the language of connected . For this language, we show that queries can be classified in a metadichotomy result: we characterize the intricate queries, such as , which have no polynomial OBDDs on any unbounded treewidth family in the sense above; and we show that nonintricate queries actually have constantwidth OBDDs on some wellchosen unboundedtreewidth instance family. Hence, if a connected has polynomial OBDDs on some unboundedtreewidth instance family, then it must have constantwidth OBDDs on some other such family.
Finally, we investigate in Section 6.3 whether our second dichotomy result holds for more restricted fragments than . First, we show that connected queries are never intricate, so we cannot show our dichotomy result with such queries. Second, we show the same for connected ; in fact, we show that no query closed under homomorphisms could be used. We last show that our metadichotomy fails for disconnected queries.
6.1 A Dichotomy on OBDD Size
This section shows that our Theorem 5.12 on the existence of tractable lineage representations as OBDDs is unlikely to extend to milder conditions than boundedtreewidth. Indeed, there are even queries that have no polynomialwidth OBDDs on any unboundedtreewidth input instance with treewidth densely unbounded polylogarithmically, again on aritytwo signatures. Here is our second main dichotomy result which shows this:
Theorem 6.1.
There exists a constant such that the following holds. Let be an arbitrary arity2 signature and be a class of instances. Assume there is a function such that, for all , if contains instances of treewidth , one of them has size . We have the following dichotomy:

If there is such that for every , then for every MSO query , an OBDD of on can be computed in time polynomial in .

Otherwise, there is a query (depending on but not on ) such that the width of any OBDD of on cannot be bounded by any polynomial in .
This does not require treewidthconstructibility, and imposes
instead a slight weakening
The first part of the theorem is by Theorem 5.12, so we sketch the proof of the second part. Our choice of intuitively tests the existence of a path of length in the Gaifman graph of the instance, i.e., a violation of the fact that the possible world is a matching of the original instance. Again, while we know that probability evaluation for is hard if we allow arbitrary input instances (as counting matchings reduces to it), our task is to show that has no polynomialwidth OBDDs when restricting to any instance family that satisfies the conditions, a much harder task.
To show this, we draw a link between treewidth and OBDD width for on individual instances, with the following result (which is specific to ):
Lemma 6.2.
Let be an arity2 signature. There exist constants such that for any instance on of treewidth , the width of an OBDD for on is .
The proof is technical and uses Lemma 4.4 to extract hightreewidth topological minors of a specific shape: skewed grids. We then show that any variable order that enumerates the edges of the skewed grid must have a prefix that shatters the grid, i.e., sufficiently many independent grid nodes have both an enumerated incident edge and a nonenumerated one. This forces any OBDD for to remember the exact configurations of the enumerated edges at the level for this shattering prefix of its variable order. Thus, the OBDD has superpolynomial width. We formalize this via the structure of the prime implicants of the lineage.
6.2 A MetaDichotomy for UCQ
For which queries does Theorem 6.1 adapt? It does not extend to all unsafe [dalvi2012dichotomy] queries, as a query may be unsafe and still be tractable on some unboundedtreewidth instance family: for instance, the standard unsafe query from [dalvi2007efficient] has trivial OBDDs on the family of grids without unary relations.
We answer this question, again on arity2 signatures, by introducing a notion of intricate queries. We show that it precisely characterizes the connected queries for which the dichotomy of Theorem 6.1 applies. Let us first recall the definition of connected queries:
Definition 6.3.
A is connected if, building the graph on its atoms that connects those that share a variable (ignoring atoms), is connected (in particular it has no isolated vertices, unless it consists of a single isolated vertex). A is connected if all its disjuncts are connected.
We now give our definition of intricate queries. We characterize them by looking at line instances:
Definition 6.4.
A line instance is an instance of the following form: a domain , and, for , one single binary fact between and : either for some or for some binary . (Recall that, as is aritytwo, its maximal arity is two, so it must include at least one binary relation.)
The intuition is that a query is intricate if, on any sufficiently long line instance, it must have a minimal match that contains the two middle facts (i.e., the ones that are incident to the middle element). Here is the formal definition of intricate queries:
Definition 6.5.
A is intricate for if, for every line instance with , letting and be the two facts of incident to the middle element , there is a minimal match of on that includes both and .
We call intricate if it is intricate.
Observe that queries with clearly cannot be intricate. Further, if a query has no matches that include only binary facts, then it cannot be intricate; in other words, any disjunct that contains an atom for a unary relation can be ignored when determining whether a query is intricate. By contrast, our query of Theorem 6.1 was designed to be intricate, in fact is intricate. Also note that an intricate query is always intricate for any : consider the restriction of any line instance of size to a line instance of size , and find a match in the restriction.
We note that we can decide whether queries are intricate or not, by enumerating line instances. We do not know the precise complexity of this task:
Lemma 6.6.
Given a connected , we can decide in PSPACE whether is intricate.
We can now state our metadichotomy: a dichotomy such as Theorem 6.1 holds for a connected if and only if it is intricate. Further, nonintricate queries must actually have constantwidth OBDD on some counterexample unboundedtreewidth family:
Theorem 6.7.
For any connected on an arity2 signature:

If is not intricate, there is a treewidthconstructible and unboundedtreewidth family of instances such that has constantwidth OBDDs on ; the OBDDs can be computed in PTIME from the input instance.

If is intricate, then Theorem 6.1 applies to : in particular, for any unboundedtreewidth family of instances satisfying the hypotheses, does not have polynomialwidth OBDDs on .
We construct for nonintricate as a family of grids from a line instance which is a counterexample to intricacy. As we can disconnect facts that do not cooccur in a match, we can disconnect the grids to boundedpathwidth instances in a lineagepreserving fashion.
Conversely, we adapt the hardness proof of Theorem 6.1 to any intricate query , extracting independent matches from any sufficiently subdivided skewed grid minor thanks to intricacy.
6.3 Other Query Classes
We finish by investigating the status of other query classes relative to our metadichotomy, to see whether Theorem 6.1 could be shown for queries in an even less expressive class than , such as or .
Connected queries
We classify the connected queries relative to Theorem 6.7, by showing that a connected can never be intricate. This explains why, for instance, the query is not intricate, as is witnessed by the family of grids.
Proposition 6.8.
A connected is never intricate.
The signature must contain at most one binary relation , as otherwise we can find for any a family of grids where the query never holds, so that would have trivial constantwidth OBDDs. Now, if contains a join pattern of the form , then has no matches on line instances with facts of alternating directions. If does not contain such a pattern, we consider line instances with a path of facts in the same direction, and show that has no match that involves the two middle facts.
By Theorem 6.7, this implies that any query has an unboundedtreewidth, treewidthconstructible family of instances