Stratified Negation in Limit Datalog Programs

# Stratified Negation in Limit Datalog Programs

Mark Kaminski,  Bernardo Cuenca Grau,  Egor V. Kostylev,  Boris Motik    Ian Horrocks
Department of Computer Science, University of Oxford, UK
{mark.kaminski, bernardo.cuenca.grau, egor.kostylev, boris.motik, ian.horrocks}@cs.ox.ac.uk
###### Abstract

There has recently been an increasing interest in declarative data analysis, where analytic tasks are specified using a logical language, and their implementation and optimisation are delegated to a general-purpose query engine. Existing declarative languages for data analysis can be formalised as variants of logic programming equipped with arithmetic function symbols and/or aggregation, and are typically undecidable. In prior work, the language of limit programs was proposed, which is sufficiently powerful to capture many analysis tasks and has decidable entailment problem. Rules in this language, however, do not allow for negation. In this paper, we study an extension of limit programs with stratified negation-as-failure. We show that the additional expressive power makes reasoning computationally more demanding, and provide tight data complexity bounds. We also identify a fragment with tractable data complexity and sufficient expressivity to capture many relevant tasks.

Stratified Negation in Limit Datalog Programs

Mark Kaminski,  Bernardo Cuenca Grau,  Egor V. Kostylev,  Boris Motik and Ian Horrocks Department of Computer Science, University of Oxford, UK {mark.kaminski, bernardo.cuenca.grau, egor.kostylev, boris.motik, ian.horrocks}@cs.ox.ac.uk

## 1 Introduction

Data analysis tasks are becoming increasingly important in information systems. Although these tasks are currently implemented using code written in standard programming languages, in recent years there has been a significant shift towards declarative solutions where the definition of the task is clearly separated from its implementation [??????].

Languages for declarative data analysis are typically rule-based, and they have already been implemented in reasoning engines such as BOOM [?], DeALS [?], Myria [?], SociaLite [?], Overlog [?], Dyna [?], and Yedalog [?].

Formally, such declarative languages can be seen as variants of logic programming equipped with means for capturing quantitative aspects of the data, such as arithmetic function symbols and aggregates. It is, however, well-known since the ’90s that the combination of recursion with numeric computations in rules easily leads to semantic difficulties [????????], and/or undecidability of reasoning [??]. In particular, undecidability carries over to the languages underpinning the aforementioned reasoning engines for data analysis.

? [?] have recently proposed the language of limit Datalog programs—a decidable variant of negation-free Datalog equipped with arithmetic functions over the integers that is expressive enough to capture many data analysis tasks. The key feature of limit programs is that all intensional predicates with a numeric argument are limit predicates, the extension of which represents minimal () or maximal () bounds of numeric values. For instance, if we encode a weighted directed graph as facts over a ternary predicate and a unary predicate in the obvious way, then the following rules encode the all-pairs shortest path problem, where the ternary limit predicate is used to encode the distance from any node to any other node in the graph as the length of a shortest path between them.

 node(x) →d(x,x,0) (1) d(x,y,m)∧edge(y,z,n) →d(x,z,m+n) (2)

The semantics of predicates is defined such that a fact is entailed from these rules and a dataset if and only if the distance from to is at most ; as a result, all facts with are also entailed. This is in contrast to standard first order predicates, where there is no semantic relationship between and . The intended semantics of limit predicates can be axiomatised using rules over standard predicates; in particular, our example limit program is equivalent to a standard logic program consisting of rules (1), (2), and the following rule (3), where is now treated as a regular first-order predicate:

 d(x,y,k)∧(k≤k′) →d(x,y,k′). (3)

? [?] showed that, under certain restrictions on the use of multiplication, reasoning (i.e., fact entailment) over limit programs is decidable and coNP-complete in data complexity; then, they proposed a practical fragment with tractable data complexity.

Limit Datalog programs as defined in prior work are, however, positive and hence do not allow for negation-as-failure in the body of rules. Non-monotonic negation applied to limit atoms can be useful, not only to express a wider range of data analysis tasks, but also to declaratively obtain solutions to problems where the cost of such solutions is defined by a positive limit program. For instance, our example limit program consisting of rules (1) and (2) provides the length of a shortest path between any two nodes, but does not provide access to any of the paths themselves—an issue that we will be able to solve using non-monotonic negation.

In this paper, we study the language of limit programs with stratified negation-as-failure. Our language extends both positive limit Datalog as defined in prior work and plain (function-free) Datalog with stratified negation. We argue that our language provides useful additional expressivity, but at the expense of increased complexity of reasoning; for programs with restricted use of multiplication, complexity jumps from coNP-completeness in the case of positive programs, to -completeness for programs with stratified negation. We also show that the tractable fragment of positive limit programs defined in [?] can be seamlessly extended with stratified negation while preserving tractability of reasoning; furthermore, the extended fragment is sufficiently expressive to capture the relevant data analysis tasks.

The proofs of all our results are given in the appendix.

## 2 Preliminaries

In this section we recapitulate the syntax and semantics of Datalog programs with integer arithmetic and stratified negation (see e.g., [?] for an excellent survey).

Syntax  We assume a fixed vocabulary of countably infinite, mutually disjoint sets of predicates equipped with non-negative arities, objects, object variables, and numeric variables. Each position of an -ary predicate is of either object or numeric sort. An object term is an object or an object variable. A numeric term is an integer, a numeric variable, or of the form , , or where and are numeric terms and , , and are the standard arithmetic functions. A constant is an object or an integer. A standard atom is of the form , with an -ary predicate and each a term matching the sort of the -th position of . A (standard) positive literal is a standard atom, and a (standard) negative literal is of the form , for a standard atom. A comparison atom is of the form or , with and the usual comparison predicates over the integers, and and numeric terms. We write as an abbreviation for . A term, atom or literal is ground if it has no variables.

A rule has the form , where the body is a possibly empty conjunction of standard literals and comparison atoms , and the head is a standard atom. We assume without loss of generality that standard body literals are function-free; indeed, a conjunction with a functional term can be equivalently rewritten by replacing with a fresh variable and adding to the conjunction. A rule is safe if each object variable in occurs in a positive literal in the body of . A ground instance of is obtained from by substituting each variable by a constant of the right sort.

A fact is a rule with empty body and a function-free standard atom in the head that has no variables in object positions and no repeated variables in numeric positions. Intuitively, a variable in a fact says that the fact holds for every integer in the position. As a convention, we will omit and use symbol instead of variables when writing facts. A dataset is a finite set of facts. Dataset is ordered if (i) it contains facts , , , , for some repetition-free enumeration of all objects in ; and (ii) it contains no other facts over predicates , , and . A program is a finite set of safe rules; without loss of generality we assume that distinct rules do not share variables. A predicate is intensional (IDB) in a program if occurs in in the head of a rule that is not a fact; otherwise, is extensional (EDB) in . Program is positive if it has no negative literals, and it is semi-positive if negation occurs only in front of EDB atoms. A stratification of is a function mapping each predicate to a positive integer such that, for each rule with the head over a predicate and each standard body literal over , we have if is positive, and if is negative. Program is stratified if it admits a stratification. Given a stratification , we write for the -th stratum of over —that is, the set of all rules in whose head predicates satisfy . Note that each stratum is a semi-positive program.

Semantics  A (Herbrand) interpretation is a possibly infinite set of ground facts (i.e., facts without ). Interpretation satisfies a ground atom , written , if either (i) is a standard atom such that evaluation of the arithmetic functions in under the usual semantics over integers produces a fact in ; or (ii) is a comparison atom that evaluates to under the usual semantics. Interpretation satisfies a ground negative literal , written , if . The notion of satisfaction is extended to conjunctions of ground literals, rules, and programs as in first-order logic, with all variables in rules implicitly universally quantified. If satisfies a program , then is a model of . For a Herbrand interpretation and a (possibly infinite) semi-positive set of rules, let be the set of facts such that is a ground instance of a rule in and . Given a program and a stratification of , for each we define interpretation by induction on and :

 Ij0 =I0i=∅; Ij+1i+1 =SP[i+1]∪I∞i(Iji+1); I∞i =⋃j≥0Iji.

The materialisation of is the interpretation , for the greatest number such that . The materialisation of a program does not depend on the chosen stratification. A stratified program entails a fact , written , if for every ground instance of . For positive programs, this definition coincides with the usual first-order notion of entailment: for positive and a fact, if and only if holds for all .

Reasoning  We study the computational properties of checking whether , for a program, a dataset, and a fact. We are interested in data complexity, which assumes that only and form the input while is fixed. Unless otherwise stated, all numbers in the input are coded in binary, and the size of is the size of its representation. Checking is undecidable even if the only arithmetic function in is [?] and predicates have at most one numeric position [?].

We use standard definitions of the basic complexity classes such as P, NP, coNP, and FP. Given a complexity class , is the class of decision problems solvable in polynomial time by deterministic Turing machines with an oracle for a problem in ; functional class is defined similarly. Finally, is a synonym for .

## 3 Stratified Limit Programs

We introduce stratified limit programs as a language that can be seen as either a semantic or a syntactic restriction of Datalog with integer arithmetic and stratified negation. Our language is also an extension of that in [?] with stratified negation.

###### Definition 1.

A stratified limit program is a pair where

• is a stratified program where each predicate either has no numeric position, in which case it is an object predicate, or only its last position is numeric, in which case it is a numeric predicate, and

• is a partial function from numeric predicates to that is total on the IDB predicates in and on predicates occurring in non-ground facts.

A numeric predicate is a (or ) limit predicate if (or , respectively). Numeric predicates that are not limit predicates are ordinary. An atom, fact or literal is numeric, limit, etc. if so is the used predicate.

All notions defined on ordinary Datalog programs (such as EDB and IDB predicates, stratification, etc.) transfer to limit programs by applying them to . We often abuse notation and write instead of when is clear from the context or immaterial. Whenever we consider a union of two limit programs, we silently assume that they coincide on . Finally, we denote (or ) by if is a (or, respectively, ) limit predicate.

Intuitively, a limit fact says that the value of for a tuple of objects is or more, if is , or or less, if is . For example, a limit fact in our all-pairs shortest path example says that node is reachable from node via a path with cost or less. The intended semantics of limit predicates can be axiomatised using standard rules as given next.

###### Definition 2.

An interpretation satisfies a limit program if it satisfies the program , where contains the following rule for each limit predicate in :

 A(→x,m)∧(n⪯Am) →A(→x,n).

The materialisation of is ; and entails , written , if .

We next demonstrate the use of stratified negation on examples. One of the main uses of negation of a limit atom is to ‘access’ the limit value (e.g., the length of a shortest path) attained by the atom in the materialisation of previous strata, and then exploit such values in further computations. To facilitate such use of negation in examples, we introduce a new operator as syntactic sugar in the language.

###### Definition 3.

The least upper bound expression of a (or ) limit atom is the conjunction where (or , respectively) and is a fresh variable.

Clearly, for an interpretation and a ground atom if is the limit integer such that .

###### Example 4.

An input of the single-pair shortest path problem can be encoded in the obvious way as a dataset using a ternary ordinary numeric predicate to represent the graph’s weighted edges, and unary facts and to identify the source and target nodes and , respectively. The stratified limit program given next computes, together with (where all edge weights are positive), a DAG over a binary object predicate such that every maximal path in the DAG is a shortest path from to .

 source(x) →ds(x,0) (4) ds(x,m)∧edge(x,y,n) →ds(y,m+n) (5) ⌈ds(x,m1)⌉∧⌈ds(y,m2)⌉edge(x,y,n)∧target(y)(m1+n≐m2) ∧∧→sp-edge(x,y) (6) ⌈ds(x,m1)⌉∧⌈ds(y,m2)⌉edge(x,y,n)∧sp-edge(y,z)(m1+n≐m2) ∧∧→sp-edge(x,y) (7)

The first stratum consists of rules (4) and (5), and computes the length of a shortest path from to all other nodes using the predicate ; in particular, if and only if is the length of a shortest path from to . Then, in a second stratum, the program computes the predicate such that if and only if the edge is part of a shortest path from to . ∎

###### Example 5.

The closeness centrality of a node in a strongly connected weighted directed graph is a measure of how central the node is in the graph [?]; variants of this measure are useful, for instance, for the analysis of market potential. Most commonly, closeness centrality of a node is defined as , where is the length of a shortest path from to ; the sum in the denominator is often called the farness centrality of . We next give a limit program computing a node of maximal closeness centrality in a given directed graph. We encode a graph as an ordered dataset using, as before, a unary object predicate and a ternary ordinary numeric predicate . Program consists of rules (8)–(16), where , and are predicates, and and are object predicates.

 node(x) →d(x,x,0) (8) d(x,y,m)∧edge(y,z,n) →d(x,z,m+n) (9) first(y)∧d(x,y,n) →fness′(x,y,n) (10) next(y,z)fness′(x,y,m)∧d(x,z,n) ∧→fness′(x,z,m+n) (11) fness′(x,y,n)∧last(y) →fness(x,n) (12) first(x) →centre′(x,x) (13) next(x,y)∧centre′(x,z)⌈fness(z,n)⌉∧⌈fness(y,m)⌉(m

The first stratum consists of rules (8)–(12). Rules (8) and (9) compute the distance (length of a shortest path) between any two nodes. Rules (10)–(12) then compute the farness centrality of each node based on the aforementioned distances; for this, the program exploits the order predicates to iterate over the nodes in the graph while recording the best value obtained so far in the iteration using an auxiliary predicate . In the second stratum (rules (13)–(16)), the program uses negation to compute the node of minimum farness centrality (and hence of maximum closeness centrality), which is recorded using the predicate; the order is again exploited to iterate over nodes, and an auxiliary predicate is used to record the current node of the iteration and the node with the best centrality encountered so far. ∎

## 4 Stratified Limit-Linear Programs

By results in [?], checking fact entailment is undecidable even for positive limit programs. Essentially, this follows from the fact that checking rule applicability over a set of facts requires solving arbitrary non-linear inequalities over integers—that is, solving the 10th Hilbert problem, which is undecidable. To regain decidability, they proposed a restriction on positive limit programs, called limit-linearity, which ensures that every program satisfying the restriction can be transformed using a grounding technique so that all numeric terms in the resulting program are linear. In particular, this implies that rule applicability can be determined by solving a system of linear inequalities, which is feasible in NP. As a result, fact entailment for positive limit-linear programs is coNP-complete in data complexity.

We next extend the notion of limit-linearity to programs with stratified negation, and define semi-grounding as a way to simplify a limit-linear program by replacing certain types of variables with constants. We then prove that fact entailment is -complete in data complexity for such programs. All programs in our previous examples are limit-linear as per the definition given next.

###### Definition 6.

A numeric variable is guarded in a rule of a stratified limit program if

• either occurs in a positive ordinary literal in ;

• or the body of contains the literals

 A(→s,n1),notA(→s,n2),(n2≐n1+t),

where is a (or ) predicate, (or , respectively), and .

Rule is limit-linear if each numeric term in is of the form , where each is a distinct numeric variable not occurring in in a (positive or negative) ordinary numeric literal, term uses only variables occurring in a positive ordinary literal in , and terms with use only variables that are guarded in and do not use . A limit-linear program contains only limit-linear rules.

A rule is semi-ground if all variables in are numeric and occur only in limit and comparison atoms. The semi-grounding of a program is obtained by replacing, in every rule in , each object variable and each numeric variable occurring in an ordinary numeric atom in with a constant in  in all possible ways.

It is easily seen that the semi-grounding of a limit-linear program entails the same facts as for every dataset. Furthermore, as in prior work, Definition 6 ensures that the semi-grounding of a positive limit-linear program contains only linear numeric terms; finally, for programs with stratified negation, it ensures that negation can be eliminated while preserving limit-linearity when the program is materialised stratum-by-stratum, as we will discuss in detail later on.

Decidability of fact entailment for positive limit-linear programs is established by first semi-grounding the program and then reducing fact entailment over the resulting program to the validity problem of Presburger formulas [?]—that is, first-order formulas interpreted over the integers and composed using only variables, constants and , functions and , and the comparisons.

The extension of such a reduction to stratified limit programs, however, is complicated by the fact that in the presence of negation-as-failure, entailment no longer coincides with classical first-order entailment. We thus adopt a different approach, where we show decidability and establish data complexity upper bounds according to the following steps.

Step 1. We extend the results in [?] for positive programs by showing that, for every positive limit-linear program and dataset , we can compute in a finite representation of its (possibly infinite) materialisation (see Lemma 4 and Corollary 4). This representation is called the pseudo-materialisation of .

Step 2. We further extend the results in Step 1 to semi-positive limit-linear programs, where negation occurs only in front of EDB predicates. For this, we show that fact entailment for such programs can be reduced in polynomial time in the size of the data to fact entailment over semi-ground positive limit-linear programs by exploiting the notion of a reduct (see Definition 10 and Lemma 4). Thus, we can assume existence of an oracle for computing the pseudo-materialisation of a semi-positive limit-linear program.

Step 3. We provide an algorithm (see Algorithm 1) that decides entailment of a fact by a stratified limit-linear program using oracle from Step 2. The algorithm maintains a pseudo-materialisation , which is initially empty and is constructed bottom-up stratum by stratum. In each step , the algorithm updates the pseudo-materialisation by applying to the union of the pseudo-materialisation for stratum and the rules in the -th stratum. The final , from which entailment of is obtained, is computed using a constant number of oracle calls in the size of the data, which yields a data complexity upper bound (Proposition 4 and Theorem 4).

In what follows, we specify each of these steps. We start by formally defining the notion of a pseudo-materialisation of a stratified limit program , which compactly represents the materialisation . Intuitively, can be infinite because it can contain, for any limit predicate and tuple of objects of suitable arity, an infinite number of facts of the form . However, if the materialisation has facts of this form, then either there is a limit value such that for each and for each , or for every integer . As argued in prior work, it then suffices for the pseudo-materialisation to contain only a single fact in the former case, or in the latter case.

###### Definition 7.

A pseudo-interpretation is a set of facts such that occurs only in facts over limit predicates and holds for all facts and in with limit .

The pseudo-materialisation of a limit program , written , is the (unique) pseudo-interpretation such that

1. an object or ordinary numeric fact is contained in if and only if it is contained in ; and

2. for each limit predicate , object tuple , and integer ,

• if and only if and for all , and

• if and only if for all integers .

We now strengthen the results in [?] by establishing a bound on the size of pseudo-materialisations of positive, limit-linear programs.

\thmt@toks\thmt@toks

Let be a semi-ground, positive, limit-linear program, and let be a limit dataset. Then and the magnitude of each integer in is bounded polynomially in the largest magnitude of an integer in , exponentially in , and double-exponentially in , where stands for the size of the representation of assuming that all numbers take unit space.

###### Lemma 8.

By Lemma 4, the pseudo-materialisation of contains at most linearly many facts; furthermore, the size of each such fact is bounded polynomially once is considered fixed. Hence, the pseudo-materialisation of can be computed in in data complexity, even if is not semi-ground.

\thmt@toks\thmt@toks

Let be a positive, limit-linear program. Then the function mapping each limit dataset to is computable in in .

###### Corollary 9.

In our second step, we extend this result to semi-positive programs. For this, we start by defining the notion of a reduct of a semi-positive limit-linear program . The reduct is obtained by first computing a semi-ground instance of and then eliminating all negative literals in while preserving fact entailment. Intuitively, negative literals can be eliminated because they involve only EDB predicates; as a result, their extension can be computed in polynomial time from the facts in alone. To eliminate a ground negative literal , it suffices to check whether is entailed by the facts in and simplify all rules containing accordingly; in turn, limit literals involving a numeric variable can be rewritten as comparisons of with a constant computed from the facts in .

###### Definition 10.

Let be a semi-positive, limit-linear program and let be the subset of all facts in . The reduct of is obtained by first computing the semi-grounding of and then applying the following transformations to each rule and each negative body literal in :

1. if , for a ground atom, delete if , and delete from otherwise,

2. if is a non-ground limit literal, then

• delete if for each integer ;

• delete from if for each ; and

• replace in with otherwise, where .

Note that semi-ground programs disallow non-ground negative literals over ordinary numeric predicates, which is why these are not considered in Definition 10. As shown by the following lemma, reducts allow us to reduce fact entailment for semi-positive, limit-linear programs to semi-ground, positive, limit-linear programs.

\thmt@toks\thmt@toks

For a semi-positive, limit-linear program and a limit dataset, the reduct of , and a fact, we have if and only if . Moreover can be computed in polynomial time in , is polynomially bounded in , and .

###### Lemma 11.

The results in Lemma 4 and Lemma 4 imply that the pseudo-materialisation of a semi-positive, limit-linear program can be computed in in data complexity.

\thmt@toks\thmt@toks

Let be a semi-positive, limit-linear program. Then the function mapping each limit dataset to is computable in in .

###### Lemma 12.

We are now ready to present Algorithm 1, which decides entailment of a fact by a stratified limit-linear program . The algorithm uses an oracle for computing the pseudo-materialisation of a semi-positive program. The existence of such oracle and its computational bounds are ensured by Lemma 4. Algorithm 1 constructs the pseudo-materialisation of stratum by stratum in a bottom-up fashion. For each stratum , the algorithm uses oracle to compute the pseudo-materialisation of the program consisting of the rules in the current stratum and the facts in the pseudo-materialisation computed for the previous stratum. Once has been constructed, entailment of is checked directly over .

Correctness of the algorithm is immediate by the properties of and the correspondence between pseudo-materialisations and materialisations. Moreover, if oracle runs in in data complexity, for some complexity class , then it can only return a pseudo-interpretation that is polynomially bounded in data complexity; as a result, Algorithm 1 runs in since the number of strata of does not depend on the input dataset.

\thmt@toks\thmt@toks

If oracle is computable in in data complexity, then Algorithm 1 runs in in data complexity.

###### Proposition 13.

The following upper bound immediately follows from the correctness of Algorithm 1 and Proposition 4.

\thmt@toks\thmt@toks

For a stratified, limit-linear program and a fact, deciding is in in data complexity.

###### Lemma 14.

The matching lower bound is obtained by reduction from the OddMinSAT problem [?]. An instance of OddMinSAT consists of a repetition-free tuple of variables and a satisfiable propositional formula over these variables. The question is whether the truth assignment satisfying for which the tuple is lexicographically minimal, assuming , among all satisfying truth assignments of has . In our reduction, is encoded as a dataset using object predicates and to encode the structure of and numeric predicates to encode the order of variables in ; a fixed, two-strata program then goes through all assignments in the ascending lexicographic order and evaluates the encoding of on until it finds some that makes true; then derives fact if and only if . Thus, if and only if belongs to the language of OddMinSAT.

\thmt@toks\thmt@toks

For a stratified, limit-linear program and a fact, deciding is -complete in data complexity. The lower bound holds already for programs with two strata.

## 5 A Tractable Fragment

Tractability in data complexity is an important requirement in data-intensive applications. In this section, we propose a syntactic restriction on stratified, limit-linear programs that is sufficient to ensure tractability of fact entailment in data complexity. Our restriction extends that of type consistency in prior work to account for negation. The programs in Examples 4 and 5 are type-consistent.

###### Definition 16.

A semi-ground, limit-linear rule is type-consistent if

• each numeric term in is of the form where is an integer and each , , is a nonzero integer, called the coefficient of variable in ;

• each numeric variable occurs in exactly one standard body literal;

• each numeric variable in a negative literal is guarded;

• if the head of is a limit atom, then each unguarded variable occurring in with a positive (or negative) coefficient also occurs in the body in a (unique) positive limit literal that is of the same (or different, respectively) type (i.e., vs. ) as ;

• for each comparison or in , each unguarded variable occurring in with a positive (or negative) coefficient also occurs in a (unique) positive (or , respectively) body literal, and each unguarded variable occurring in with a positive (or negative) coefficient occurs in a (unique) positive (or , respectively) body literal.

A semi-ground, stratified, limit-linear program is type-consistent if all of its rules are type-consistent. A stratified limit-linear program is type-consistent if the program obtained by first semi-grounding and then simplifying all numeric terms as much as possible is type-consistent.

Similarly to type-consistency for positive programs, Definition 16 ensures that divergence of limit facts to can be detected in polynomial time when constructing a pseudo-materialisation (see [?] for details). Furthermore, the conditions in Definition 16 have been crafted such that the reduct of a semi-positive type-consistent program (and hence of any intermediate program considered while materialising a stratified program) can be trivially rewritten into a positive type-consistent program. For this, it is essential to require a guarded use of negation (see third condition in Definition 16).

\thmt@toks\thmt@toks

For a semi-positive, type-consistent program and a limit dataset, the reduct of is polynomially rewritable to a positive, semi-ground, type-consistent program such that, for each fact , if and only if .

###### Lemma 17.

Lemma 5 allows us to extend the polytime algorithm in [?] for computing the pseudo-materialisation of a positive type-consistent program to semi-positive programs, thus obtaining a tractable implementation of oracle restricted to type-consistent programs. This suffices since Algorithm 1, when given a type-consistent program as input, only applies to type-consistent programs. Thus, by Proposition 4, we obtain a polynomial time upper bound on the data complexity of fact entailment for type-consistent programs with stratified negation. Since plain Datalog is already P-hard in data complexity, this upper bound is tight.

\thmt@toks\thmt@toks

For a stratified, type-consistent program and a fact, deciding is P-complete in data complexity.

###### Theorem 18.

Finally, as we show next, our extended notion of type consistency can be efficiently recognised.

\thmt@toks\thmt@toks

Checking whether a stratified, limit-linear program is type-consistent is in LogSpace.

## 6 Conclusion and Future Work

Motivated by declarative data analysis applications, we have extended the language of limit programs with stratified negation-as-failure. We have shown that the additional expressive power provided by our extended language comes at a computational cost, but we have also identified sufficient syntactic conditions that ensure tractability of reasoning in data complexity. There are many avenues for future work. First, it would be interesting to formally study the expressive power of our language. Since type-consistent programs extend plain (function-free) Datalog with stratified negation, it is clear that they capture P on ordered datasets [?], and we conjecture that the full language of stratified limit-linear programs captures . From a more practical perspective, we believe that limit programs can naturally express many tasks that admit a dynamic programming solution (e.g., variants of the knapsack problem, and many others). Conceptually, a dynamic programming approach can be seen as a three-stage process: first, one constructs an acyclic ‘graph of subproblems’ that orders the subproblems from smallest to largest; then, one computes a shortest/longest path over this graph to obtain the value of optimal solutions; finally, one backwards-computes the actual solution by tracing back in the graph. Capturing the third stage seems to always require non-monotonic negation (as illustrated in our path computation example), whereas the first stage may or may not require it depending on the problem. Finally, the second stage can be realised with a (recursive) positive program. Second, our formalism should be extended with aggregate functions. Although certain forms of aggregation can be simulated using arithmetic functions and iterating over the object domain by exploiting the ordering, having aggregation explicitly would allow us to express certain tasks in a more natural way. Third, we would like to go beyond stratified negation and investigate the theoretical properties of limit Datalog under well-founded [?] or the stable model semantics [?]. Finally, we plan to implement our reasoning algorithms and test them in practice.

## Acknowledgments

This research was supported by the EPSRC projects DBOnto, MaSI, and ED.

## References

• [Alvaro et al., 2010] Peter Alvaro, Tyson Condie, Neil Conway, Khaled Elmeleegy, Joseph M. Hellerstein, and Russell Sears. BOOM analytics: exploring data-centric, declarative programming for the cloud. In EuroSys 2010, pages 223–236. ACM, 2010.
• [Beeri et al., 1991] Catriel Beeri, Shamim A. Naqvi, Oded Shmueli, and Shalom Tsur. Set constructors in a logic database language. J. Log. Program., 10(3&4):181–232, 1991.
• [Chin et al., 2015] Brian Chin, Daniel von Dincklage, Vuk Ercegovac, Peter Hawkins, Mark S. Miller, Franz Josef Och, Christopher Olston, and Fernando Pereira. Yedalog: Exploring knowledge at scale. In SNAPL 2015, volume 32 of LIPIcs, pages 63–78. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2015.
• [Chistikov and Haase, 2016] Dmitry Chistikov and Christoph Haase. The taming of the semi-linear set. In ICALP, volume 55 of LIPIcs, pages 128:1–128:13. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2016.
• [Consens and Mendelzon, 1993] Mariano P. Consens and Alberto O. Mendelzon. Low complexity aggregation in GraphLog and Datalog. Theor. Comput. Sci., 116(1):95–116, 1993.
• [Dantsin et al., 2001] Evgeny Dantsin, Thomas Eiter, Georg Gottlob, and Andrei Voronkov. Complexity and expressive power of logic programming. ACM Comput. Surv., 33(3):374–425, 2001.
• [Eisner and Filardo, 2011] Jason Eisner and Nathaniel Wesley Filardo. Dyna: Extending datalog for modern AI. In Datalog 2010, volume 6702 of LNCS, pages 181–220. Springer, 2011.
• [Ganguly et al., 1995] Sumit Ganguly, Sergio Greco, and Carlo Zaniolo. Extrema predicates in deductive databases. J. Comput. Syst. Sci., 51(2):244–259, 1995.
• [Gelfond and Lifschitz, 1988] Michael Gelfond and Vladimir Lifschitz. The stable model semantics for logic programming. In ICLP/SLP 1988, pages 1070–1080. MIT Press, 1988.
• [Kaminski et al., 2017] Mark Kaminski, Bernardo Cuenca Grau, Egor V. Kostylev, Boris Motik, and Ian Horrocks. Foundations of declarative data analysis using limit datalog programs. In IJCAI 2017, pages 1123–1130. ijcai.org, 2017.
• [Kemp and Stuckey, 1991] David B. Kemp and Peter J. Stuckey. Semantics of logic programs with aggregates. In ISLP 1991, pages 387–401. MIT Press, 1991.
• [Krentel, 1988] Mark W. Krentel. The complexity of optimization problems. J. Comput. System Sci., 36(3):490–509, 1988.
• [Loo et al., 2009] Boon Thau Loo, Tyson Condie, Minos N. Garofalakis, David E. Gay, Joseph M. Hellerstein, Petros Maniatis, Raghu Ramakrishnan, Timothy Roscoe, and Ion Stoica. Declarative networking. Commun. ACM, 52(11):87–95, 2009.
• [Markl, 2014] Volker Markl. Breaking the chains: On declarative data analysis and data independence in the big data era. PVLDB, 7(13):1730–1733, 2014.
• [Mazuran et al., 2013] Mirjana Mazuran, Edoardo Serra, and Carlo Zaniolo. Extending the power of datalog recursion. VLDB J., 22(4):471–493, 2013.
• [Mumick et al., 1990] Inderpal Singh Mumick, Hamid Pirahesh, and Raghu Ramakrishnan. The magic of duplicates and aggregates. In VLDB 1990, pages 264–277. Morgan Kaufmann, 1990.
• [Ross and Sagiv, 1997] Kenneth A. Ross and Yehoshua Sagiv. Monotonic aggregation in deductive databases. J. Comput. System Sci., 54(1):79–97, 1997.
• [Sabidussi, 1966] Gert Sabidussi. The centrality index of a graph. Psychometrika, 31(4):581–603, 1966.
• [Seo et al., 2015] Jiwon Seo, Stephen Guo, and Monica S. Lam. SociaLite: An efficient graph query language based on datalog. IEEE Trans. Knowl. Data Eng., 27(7):1824–1837, 2015.
• [Shkapsky et al., 2016] Alexander Shkapsky, Mohan Yang, Matteo Interlandi, Hsuan Chiu, Tyson Condie, and Carlo Zaniolo. Big data analytics with datalog queries on Spark. In SIGMOD 2016, pages 1135–1149. ACM, 2016.
• [Van Gelder et al., 1991] Allen Van Gelder, Kenneth A. Ross, and John S. Schlipf. The well-founded semantics for general logic programs. J. ACM, 38(3):620–650, 1991.
• [Van Gelder, 1992] Allen Van Gelder. The well-founded semantics of aggregation. In PODS 1992, pages 127–138. ACM Press, 1992.
• [Wang et al., 2015] Jingjing Wang, Magdalena Balazinska, and Daniel Halperin. Asynchronous and fault-tolerant recursive datalog evaluation in shared-nothing engines. PVLDB, 8(12):1542–1553, 2015.

## Appendix A Proofs for Section 4

Before proceeding to the proofs of our theorems in the main body of the paper, we restate some notions from [?]. All models of a limit program are easily seen to satisfy the following closure property.

###### Definition A.1.

An interpretation is limit-closed (for a limit program ) if, for each fact where is a limit predicate, holds for each integer with .

There is a one-to-one correspondence between pseudo-interpretations and limit-closed interpretations, and thus each model of a program can be equivalently represented by a pseudo-interpretation.

###### Definition A.2.

A limit-closed interpretation corresponds to a pseudo-interpretation if the following conditions hold:

• an object or ordinary numeric fact is contained in if and only if it is contained in ; and

• for each limit predicate , each tuple of objects , and each integer , (i) for all if and only if , and (ii) and for all and is a limit predicate if and only if .

Let and be pseudo-interpretations corresponding to interpretations and . Then, satisfies a ground atom , written , if ; is a pseudo-model of a program , written , if ; finally, holds if .

? [?] then define an immediate consequence operator for positive limit programs that works on pseudo-interpretations and show that the pseudo-materialisation of a positive limit program can be computed as the pseudo-interpretation inductively defined as follows, where , for a set of pseudo-interpretations, is the supremum of w.r.t. :

 J0 =∅ Jj+1 =TP(Jj) J∞ =supj∈NJj

We call pseudo-interpretations partial pseudo-materialisations of .

The coNP upper bound for fact entailment in [?] is shown by a reduction to validity of Presburger formulas of a certain shape. We next extend this reduction as given in [?] for a (semi-ground and positive) limit-linear program to account for datasets involving .

###### Definition A.3.

For each -ary object predicate , each -ary ordinary numeric predicate , each -ary limit predicate , each -tuple of objects , and each integer , let , , and be distinct propositional variables, and let a distinct integer variable.

For a semi-ground, positive, limit-linear program, is the Presburger formula where is the formula (with the same quantifier block as ) that is obtained by replacing each atom in with its encoding defined as follows:

• if is a comparison atom;

• if is an object atom of the form ;

• if is an ordinary numeric atom of the form where is a ground numeric term evaluating to ;111Note that all ordinary numeric atoms in have this form since is semi-ground.

• if is a limit atom of the form where ; and

• if is a limit atom of the form .

Let be a pseudo-interpretation, and let be an assignment of Boolean and integer variables. Then, corresponds to if all of the following conditions hold for all , , , and as specified above, for each integer :

• if and only if ;

• if and only if ;

• if and only if or there exists such that ;

• and if and only if .

Note that in Definition A.3 ranges over all integers (which excludes ), is equal to some integer , and is a pseudo-interpretation and thus cannot contain both and ; thus, implies .

The key property of the Presburger encoding in [?] is established by the following lemma, which we easily re-prove for our variant of the encoding.

###### Lemma A.4.

Let be a pseudo-interpretation and let be a variable assignment such that corresponds to . Then,

1. if and only if for each ground atom , and

2. if and only if for each semi-ground, positive rule .

###### Proof.

Claim 1 follows analogously to the respective argument in [?] except for having an extra case, namely , for a limit predicate. The proof of this case is analogous but simpler to the case for where . Claim 2 then follows from Claim 1 same as before. ∎

Using Lemma A.4, ? [?] establish the following correspondence between entailment for positive limit-linear programs and validity of Presburger sentences.

\thmt@toks\thmt@toks

For a semi-ground, positive, limit-linear program and a fact, there exists a Presburger sentence that is valid if and only if . Each is a conjunction of possibly negated atoms. Moreover, and each are bounded polynomially by . Number is bounded polynomially by and exponentially by . Finally, the magnitude of each integer in is bounded by the maximal magnitude of an integer in and .

###### Lemma A.5.

By a more precise analysis of the Presburger formulas in the proof of Lemma A, we can sharpen the bounds provided by the lemma as follows, where (resp. , , etc.) stands for the size of the representation of (resp. , , etc.) assuming that all numbers take unit space.

\thmt@toks\thmt@toks

For a semi-ground, positive, limit-linear program and a fact, there exists a Presburger sentence that is valid if and only if . Each is a conjunction of possibly negated atoms. Moreover, is bounded polynomially in and each is bounded polynomially in . Number is bounded polynomially in and exponentially in . Finally, the magnitude of each integer in is bounded by the maximal magnitude of an integer in and .

###### Lemma A.6.

Analogously to the notion of a model for an interpretation, we call With Lemma A at hand, ? [?] then show the following theorem, which bounds the magnitude of integers in counter-pseudo-models for entailment (the proof of the theorem adapts to our setting as is).

###### Theorem A.7.

For a semi-ground, positive, limit-linear program, a limit dataset, and a fact, if and only if a pseudo-model of exists where , , and the magnitude of each integer in is bounded polynomially in the largest magnitude of an integer in , exponentially in , and double-exponentially in .

Furthermore, the double-exponential bound in can be trivially sharpened to by employing Lemma A in place of Lemma A. Building on the proof of Theorem A.7, we next prove the following stronger version, which bounds the size of pseudo-materialisations of semi-ground, positive, limit-linear programs.

###### Proof.

Let be the maximal magnitude of an integer in , , and . Let be obtained from by removing each fact that does not unify with an atom in and let be a fresh nullary predicate.

Clearly, we have where is the least pseudo-interpretation w.r.t.  such that for each . Let be obtained from and fact analogously to the construction in the proof of Lemma A, but where each disjunct in is replaced by if and by if for some . It is easy to see that every assignment corresponding to is a countermodel of . Therefore, since satisfies the same structural constraints as the formula in Lemma A, by an argument analogous to the one in the proof of Theorem A.7 we obtain that has a pseudo-model such that , the magnitude of each integer in is bounded by some number that is polynomial in , exponential in , and double-exponential in , and where, it holds that if and only if for each limit predicate and objects . Consequently, we have established that has a pseudo-model that satisfies the required bounds in the lemma. In what follows we use the fact that to show that also satisfies the bounds in the lemma.

Let us denote with the partial pseudo-materialisation of for any and hence, . We start with the observation that the value of a number in a limit fact can only increase with respect to during the construction of . For instance, if , with a predicate, and , then . Let, and, for , be the maximum between

• ,

• the maximal magnitude of a negative integer occurring in a fact in , and

• the maximal magnitude of a positive integer occurring in a fact in .

Numbers allow us to bound the integers produced by the immediate consequence operator applied to pseudo-interpretation . Specifically, we argue that for each and rule with head for some , we have

• if ,

• if is a predicate, and

• if is a predicate.

To see why this holds, consider a pseudo-interpretation obtained from by replacing each IDB fact with , and each IDB fact with . By construction, we have and hence whenever is defined. But since the magnitude of all numbers in is bounded by , by Proposition 3 in [?], has a solution where the maximal magnitude of all numbers is bounded by , and hence the magnitude of the value of for this solution is bounded by (unless the value of is unbounded in , in which case and we are done). The last two subclaims are immediate since