Stratified Negation in Limit Datalog Programs
Abstract
There has recently been an increasing interest in declarative data analysis, where analytic tasks are specified using a logical language, and their implementation and optimisation are delegated to a generalpurpose query engine. Existing declarative languages for data analysis can be formalised as variants of logic programming equipped with arithmetic function symbols and/or aggregation, and are typically undecidable. In prior work, the language of limit programs was proposed, which is sufficiently powerful to capture many analysis tasks and has decidable entailment problem. Rules in this language, however, do not allow for negation. In this paper, we study an extension of limit programs with stratified negationasfailure. We show that the additional expressive power makes reasoning computationally more demanding, and provide tight data complexity bounds. We also identify a fragment with tractable data complexity and sufficient expressivity to capture many relevant tasks.
Stratified Negation in Limit Datalog Programs
Mark Kaminski, Bernardo Cuenca Grau, Egor V. Kostylev, Boris Motik and Ian Horrocks Department of Computer Science, University of Oxford, UK {mark.kaminski, bernardo.cuenca.grau, egor.kostylev, boris.motik, ian.horrocks}@cs.ox.ac.uk
1 Introduction
Data analysis tasks are becoming increasingly important in information systems. Although these tasks are currently implemented using code written in standard programming languages, in recent years there has been a significant shift towards declarative solutions where the definition of the task is clearly separated from its implementation [?; ?; ?; ?; ?; ?].
Languages for declarative data analysis are typically rulebased, and they have already been implemented in reasoning engines such as BOOM [?], DeALS [?], Myria [?], SociaLite [?], Overlog [?], Dyna [?], and Yedalog [?].
Formally, such declarative languages can be seen as variants of logic programming equipped with means for capturing quantitative aspects of the data, such as arithmetic function symbols and aggregates. It is, however, wellknown since the ’90s that the combination of recursion with numeric computations in rules easily leads to semantic difficulties [?; ?; ?; ?; ?; ?; ?; ?], and/or undecidability of reasoning [?; ?]. In particular, undecidability carries over to the languages underpinning the aforementioned reasoning engines for data analysis.
? [?] have recently proposed the language of limit Datalog programs—a decidable variant of negationfree Datalog equipped with arithmetic functions over the integers that is expressive enough to capture many data analysis tasks. The key feature of limit programs is that all intensional predicates with a numeric argument are limit predicates, the extension of which represents minimal () or maximal () bounds of numeric values. For instance, if we encode a weighted directed graph as facts over a ternary predicate and a unary predicate in the obvious way, then the following rules encode the allpairs shortest path problem, where the ternary limit predicate is used to encode the distance from any node to any other node in the graph as the length of a shortest path between them.
(1)  
(2) 
The semantics of predicates is defined such that a fact is entailed from these rules and a dataset if and only if the distance from to is at most ; as a result, all facts with are also entailed. This is in contrast to standard first order predicates, where there is no semantic relationship between and . The intended semantics of limit predicates can be axiomatised using rules over standard predicates; in particular, our example limit program is equivalent to a standard logic program consisting of rules (1), (2), and the following rule (3), where is now treated as a regular firstorder predicate:
(3) 
? [?] showed that, under certain restrictions on the use of multiplication, reasoning (i.e., fact entailment) over limit programs is decidable and coNPcomplete in data complexity; then, they proposed a practical fragment with tractable data complexity.
Limit Datalog programs as defined in prior work are, however, positive and hence do not allow for negationasfailure in the body of rules. Nonmonotonic negation applied to limit atoms can be useful, not only to express a wider range of data analysis tasks, but also to declaratively obtain solutions to problems where the cost of such solutions is defined by a positive limit program. For instance, our example limit program consisting of rules (1) and (2) provides the length of a shortest path between any two nodes, but does not provide access to any of the paths themselves—an issue that we will be able to solve using nonmonotonic negation.
In this paper, we study the language of limit programs with stratified negationasfailure. Our language extends both positive limit Datalog as defined in prior work and plain (functionfree) Datalog with stratified negation. We argue that our language provides useful additional expressivity, but at the expense of increased complexity of reasoning; for programs with restricted use of multiplication, complexity jumps from coNPcompleteness in the case of positive programs, to completeness for programs with stratified negation. We also show that the tractable fragment of positive limit programs defined in [?] can be seamlessly extended with stratified negation while preserving tractability of reasoning; furthermore, the extended fragment is sufficiently expressive to capture the relevant data analysis tasks.
The proofs of all our results are given in the appendix.
2 Preliminaries
In this section we recapitulate the syntax and semantics of Datalog programs with integer arithmetic and stratified negation (see e.g., [?] for an excellent survey).
Syntax We assume a fixed vocabulary of countably infinite, mutually disjoint sets of predicates equipped with nonnegative arities, objects, object variables, and numeric variables. Each position of an ary predicate is of either object or numeric sort. An object term is an object or an object variable. A numeric term is an integer, a numeric variable, or of the form , , or where and are numeric terms and , , and are the standard arithmetic functions. A constant is an object or an integer. A standard atom is of the form , with an ary predicate and each a term matching the sort of the th position of . A (standard) positive literal is a standard atom, and a (standard) negative literal is of the form , for a standard atom. A comparison atom is of the form or , with and the usual comparison predicates over the integers, and and numeric terms. We write as an abbreviation for . A term, atom or literal is ground if it has no variables.
A rule has the form , where the body is a possibly empty conjunction of standard literals and comparison atoms , and the head is a standard atom. We assume without loss of generality that standard body literals are functionfree; indeed, a conjunction with a functional term can be equivalently rewritten by replacing with a fresh variable and adding to the conjunction. A rule is safe if each object variable in occurs in a positive literal in the body of . A ground instance of is obtained from by substituting each variable by a constant of the right sort.
A fact is a rule with empty body and a functionfree standard atom in the head that has no variables in object positions and no repeated variables in numeric positions. Intuitively, a variable in a fact says that the fact holds for every integer in the position. As a convention, we will omit and use symbol instead of variables when writing facts. A dataset is a finite set of facts. Dataset is ordered if (i) it contains facts , , , , for some repetitionfree enumeration of all objects in ; and (ii) it contains no other facts over predicates , , and . A program is a finite set of safe rules; without loss of generality we assume that distinct rules do not share variables. A predicate is intensional (IDB) in a program if occurs in in the head of a rule that is not a fact; otherwise, is extensional (EDB) in . Program is positive if it has no negative literals, and it is semipositive if negation occurs only in front of EDB atoms. A stratification of is a function mapping each predicate to a positive integer such that, for each rule with the head over a predicate and each standard body literal over , we have if is positive, and if is negative. Program is stratified if it admits a stratification. Given a stratification , we write for the th stratum of over —that is, the set of all rules in whose head predicates satisfy . Note that each stratum is a semipositive program.
Semantics A (Herbrand) interpretation is a possibly infinite set of ground facts (i.e., facts without ). Interpretation satisfies a ground atom , written , if either (i) is a standard atom such that evaluation of the arithmetic functions in under the usual semantics over integers produces a fact in ; or (ii) is a comparison atom that evaluates to under the usual semantics. Interpretation satisfies a ground negative literal , written , if . The notion of satisfaction is extended to conjunctions of ground literals, rules, and programs as in firstorder logic, with all variables in rules implicitly universally quantified. If satisfies a program , then is a model of . For a Herbrand interpretation and a (possibly infinite) semipositive set of rules, let be the set of facts such that is a ground instance of a rule in and . Given a program and a stratification of , for each we define interpretation by induction on and :
The materialisation of is the interpretation , for the greatest number such that . The materialisation of a program does not depend on the chosen stratification. A stratified program entails a fact , written , if for every ground instance of . For positive programs, this definition coincides with the usual firstorder notion of entailment: for positive and a fact, if and only if holds for all .
Reasoning We study the computational properties of checking whether , for a program, a dataset, and a fact. We are interested in data complexity, which assumes that only and form the input while is fixed. Unless otherwise stated, all numbers in the input are coded in binary, and the size of is the size of its representation. Checking is undecidable even if the only arithmetic function in is [?] and predicates have at most one numeric position [?].
We use standard definitions of the basic complexity classes such as P, NP, coNP, and FP. Given a complexity class , is the class of decision problems solvable in polynomial time by deterministic Turing machines with an oracle for a problem in ; functional class is defined similarly. Finally, is a synonym for .
3 Stratified Limit Programs
We introduce stratified limit programs as a language that can be seen as either a semantic or a syntactic restriction of Datalog with integer arithmetic and stratified negation. Our language is also an extension of that in [?] with stratified negation.
Definition 1.
A stratified limit program is a pair where

is a stratified program where each predicate either has no numeric position, in which case it is an object predicate, or only its last position is numeric, in which case it is a numeric predicate, and

is a partial function from numeric predicates to that is total on the IDB predicates in and on predicates occurring in nonground facts.
A numeric predicate is a (or ) limit predicate if (or , respectively). Numeric predicates that are not limit predicates are ordinary. An atom, fact or literal is numeric, limit, etc. if so is the used predicate.
All notions defined on ordinary Datalog programs (such as EDB and IDB predicates, stratification, etc.) transfer to limit programs by applying them to . We often abuse notation and write instead of when is clear from the context or immaterial. Whenever we consider a union of two limit programs, we silently assume that they coincide on . Finally, we denote (or ) by if is a (or, respectively, ) limit predicate.
Intuitively, a limit fact says that the value of for a tuple of objects is or more, if is , or or less, if is . For example, a limit fact in our allpairs shortest path example says that node is reachable from node via a path with cost or less. The intended semantics of limit predicates can be axiomatised using standard rules as given next.
Definition 2.
An interpretation satisfies a limit program if it satisfies the program , where contains the following rule for each limit predicate in :
The materialisation of is ; and entails , written , if .
We next demonstrate the use of stratified negation on examples. One of the main uses of negation of a limit atom is to ‘access’ the limit value (e.g., the length of a shortest path) attained by the atom in the materialisation of previous strata, and then exploit such values in further computations. To facilitate such use of negation in examples, we introduce a new operator as syntactic sugar in the language.
Definition 3.
The least upper bound expression of a (or ) limit atom is the conjunction where (or , respectively) and is a fresh variable.
Clearly, for an interpretation and a ground atom if is the limit integer such that .
Example 4.
An input of the singlepair shortest path problem can be encoded in the obvious way as a dataset using a ternary ordinary numeric predicate to represent the graph’s weighted edges, and unary facts and to identify the source and target nodes and , respectively. The stratified limit program given next computes, together with (where all edge weights are positive), a DAG over a binary object predicate such that every maximal path in the DAG is a shortest path from to .
(4)  
(5)  
(6)  
(7) 
The first stratum consists of rules (4) and (5), and computes the length of a shortest path from to all other nodes using the predicate ; in particular, if and only if is the length of a shortest path from to . Then, in a second stratum, the program computes the predicate such that if and only if the edge is part of a shortest path from to . ∎
Example 5.
The closeness centrality of a node in a strongly connected weighted directed graph is a measure of how central the node is in the graph [?]; variants of this measure are useful, for instance, for the analysis of market potential. Most commonly, closeness centrality of a node is defined as , where is the length of a shortest path from to ; the sum in the denominator is often called the farness centrality of . We next give a limit program computing a node of maximal closeness centrality in a given directed graph. We encode a graph as an ordered dataset using, as before, a unary object predicate and a ternary ordinary numeric predicate . Program consists of rules (8)–(16), where , and are predicates, and and are object predicates.
(8)  
(9)  
(10)  
(11)  
(12)  
(13)  
(14)  
(15)  
(16) 
The first stratum consists of rules (8)–(12). Rules (8) and (9) compute the distance (length of a shortest path) between any two nodes. Rules (10)–(12) then compute the farness centrality of each node based on the aforementioned distances; for this, the program exploits the order predicates to iterate over the nodes in the graph while recording the best value obtained so far in the iteration using an auxiliary predicate . In the second stratum (rules (13)–(16)), the program uses negation to compute the node of minimum farness centrality (and hence of maximum closeness centrality), which is recorded using the predicate; the order is again exploited to iterate over nodes, and an auxiliary predicate is used to record the current node of the iteration and the node with the best centrality encountered so far. ∎
4 Stratified LimitLinear Programs
By results in [?], checking fact entailment is undecidable even for positive limit programs. Essentially, this follows from the fact that checking rule applicability over a set of facts requires solving arbitrary nonlinear inequalities over integers—that is, solving the 10th Hilbert problem, which is undecidable. To regain decidability, they proposed a restriction on positive limit programs, called limitlinearity, which ensures that every program satisfying the restriction can be transformed using a grounding technique so that all numeric terms in the resulting program are linear. In particular, this implies that rule applicability can be determined by solving a system of linear inequalities, which is feasible in NP. As a result, fact entailment for positive limitlinear programs is coNPcomplete in data complexity.
We next extend the notion of limitlinearity to programs with stratified negation, and define semigrounding as a way to simplify a limitlinear program by replacing certain types of variables with constants. We then prove that fact entailment is complete in data complexity for such programs. All programs in our previous examples are limitlinear as per the definition given next.
Definition 6.
A numeric variable is guarded in a rule of a stratified limit program if

either occurs in a positive ordinary literal in ;

or the body of contains the literals
where is a (or ) predicate, (or , respectively), and .
Rule is limitlinear if each numeric term in is of the form , where each is a distinct numeric variable not occurring in in a (positive or negative) ordinary numeric literal, term uses only variables occurring in a positive ordinary literal in , and terms with use only variables that are guarded in and do not use . A limitlinear program contains only limitlinear rules.
A rule is semiground if all variables in are numeric and occur only in limit and comparison atoms. The semigrounding of a program is obtained by replacing, in every rule in , each object variable and each numeric variable occurring in an ordinary numeric atom in with a constant in in all possible ways.
It is easily seen that the semigrounding of a limitlinear program entails the same facts as for every dataset. Furthermore, as in prior work, Definition 6 ensures that the semigrounding of a positive limitlinear program contains only linear numeric terms; finally, for programs with stratified negation, it ensures that negation can be eliminated while preserving limitlinearity when the program is materialised stratumbystratum, as we will discuss in detail later on.
Decidability of fact entailment for positive limitlinear programs is established by first semigrounding the program and then reducing fact entailment over the resulting program to the validity problem of Presburger formulas [?]—that is, firstorder formulas interpreted over the integers and composed using only variables, constants and , functions and , and the comparisons.
The extension of such a reduction to stratified limit programs, however, is complicated by the fact that in the presence of negationasfailure, entailment no longer coincides with classical firstorder entailment. We thus adopt a different approach, where we show decidability and establish data complexity upper bounds according to the following steps.
Step 1. We extend the results in [?] for positive programs by showing that, for every positive limitlinear program and dataset , we can compute in a finite representation of its (possibly infinite) materialisation (see Lemma 4 and Corollary 4). This representation is called the pseudomaterialisation of .
Step 2. We further extend the results in Step 1 to semipositive limitlinear programs, where negation occurs only in front of EDB predicates. For this, we show that fact entailment for such programs can be reduced in polynomial time in the size of the data to fact entailment over semiground positive limitlinear programs by exploiting the notion of a reduct (see Definition 10 and Lemma 4). Thus, we can assume existence of an oracle for computing the pseudomaterialisation of a semipositive limitlinear program.
Step 3. We provide an algorithm (see Algorithm 1) that decides entailment of a fact by a stratified limitlinear program using oracle from Step 2. The algorithm maintains a pseudomaterialisation , which is initially empty and is constructed bottomup stratum by stratum. In each step , the algorithm updates the pseudomaterialisation by applying to the union of the pseudomaterialisation for stratum and the rules in the th stratum. The final , from which entailment of is obtained, is computed using a constant number of oracle calls in the size of the data, which yields a data complexity upper bound (Proposition 4 and Theorem 4).
In what follows, we specify each of these steps. We start by formally defining the notion of a pseudomaterialisation of a stratified limit program , which compactly represents the materialisation . Intuitively, can be infinite because it can contain, for any limit predicate and tuple of objects of suitable arity, an infinite number of facts of the form . However, if the materialisation has facts of this form, then either there is a limit value such that for each and for each , or for every integer . As argued in prior work, it then suffices for the pseudomaterialisation to contain only a single fact in the former case, or in the latter case.
Definition 7.
A pseudointerpretation is a set of facts such that occurs only in facts over limit predicates and holds for all facts and in with limit .
The pseudomaterialisation of a limit program , written , is the (unique) pseudointerpretation such that

an object or ordinary numeric fact is contained in if and only if it is contained in ; and

for each limit predicate , object tuple , and integer ,

if and only if and for all , and

if and only if for all integers .

We now strengthen the results in [?] by establishing a bound on the size of pseudomaterialisations of positive, limitlinear programs.
Let be a semiground, positive, limitlinear program, and let be a limit dataset. Then and the magnitude of each integer in is bounded polynomially in the largest magnitude of an integer in , exponentially in , and doubleexponentially in , where stands for the size of the representation of assuming that all numbers take unit space.
Lemma 8.
By Lemma 4, the pseudomaterialisation of contains at most linearly many facts; furthermore, the size of each such fact is bounded polynomially once is considered fixed. Hence, the pseudomaterialisation of can be computed in in data complexity, even if is not semiground.
Let be a positive, limitlinear program. Then the function mapping each limit dataset to is computable in in .
Corollary 9.
In our second step, we extend this result to semipositive programs. For this, we start by defining the notion of a reduct of a semipositive limitlinear program . The reduct is obtained by first computing a semiground instance of and then eliminating all negative literals in while preserving fact entailment. Intuitively, negative literals can be eliminated because they involve only EDB predicates; as a result, their extension can be computed in polynomial time from the facts in alone. To eliminate a ground negative literal , it suffices to check whether is entailed by the facts in and simplify all rules containing accordingly; in turn, limit literals involving a numeric variable can be rewritten as comparisons of with a constant computed from the facts in .
Definition 10.
Let be a semipositive, limitlinear program and let be the subset of all facts in . The reduct of is obtained by first computing the semigrounding of and then applying the following transformations to each rule and each negative body literal in :

if , for a ground atom, delete if , and delete from otherwise,

if is a nonground limit literal, then

delete if for each integer ;

delete from if for each ; and

replace in with otherwise, where .

Note that semiground programs disallow nonground negative literals over ordinary numeric predicates, which is why these are not considered in Definition 10. As shown by the following lemma, reducts allow us to reduce fact entailment for semipositive, limitlinear programs to semiground, positive, limitlinear programs.
For a semipositive, limitlinear program and a limit dataset, the reduct of , and a fact, we have if and only if . Moreover can be computed in polynomial time in , is polynomially bounded in , and .
Lemma 11.
The results in Lemma 4 and Lemma 4 imply that the pseudomaterialisation of a semipositive, limitlinear program can be computed in in data complexity.
Let be a semipositive, limitlinear program. Then the function mapping each limit dataset to is computable in in .
Lemma 12.
We are now ready to present Algorithm 1, which decides entailment of a fact by a stratified limitlinear program . The algorithm uses an oracle for computing the pseudomaterialisation of a semipositive program. The existence of such oracle and its computational bounds are ensured by Lemma 4. Algorithm 1 constructs the pseudomaterialisation of stratum by stratum in a bottomup fashion. For each stratum , the algorithm uses oracle to compute the pseudomaterialisation of the program consisting of the rules in the current stratum and the facts in the pseudomaterialisation computed for the previous stratum. Once has been constructed, entailment of is checked directly over .
Correctness of the algorithm is immediate by the properties of and the correspondence between pseudomaterialisations and materialisations. Moreover, if oracle runs in in data complexity, for some complexity class , then it can only return a pseudointerpretation that is polynomially bounded in data complexity; as a result, Algorithm 1 runs in since the number of strata of does not depend on the input dataset.
If oracle is computable in in data complexity, then Algorithm 1 runs in in data complexity.
Proposition 13.
The following upper bound immediately follows from the correctness of Algorithm 1 and Proposition 4.
For a stratified, limitlinear program and a fact, deciding is in in data complexity.
Lemma 14.
The matching lower bound is obtained by reduction from the OddMinSAT problem [?]. An instance of OddMinSAT consists of a repetitionfree tuple of variables and a satisfiable propositional formula over these variables. The question is whether the truth assignment satisfying for which the tuple is lexicographically minimal, assuming , among all satisfying truth assignments of has . In our reduction, is encoded as a dataset using object predicates and to encode the structure of and numeric predicates to encode the order of variables in ; a fixed, twostrata program then goes through all assignments in the ascending lexicographic order and evaluates the encoding of on until it finds some that makes true; then derives fact if and only if . Thus, if and only if belongs to the language of OddMinSAT.
For a stratified, limitlinear program and a fact, deciding is complete in data complexity. The lower bound holds already for programs with two strata.
Theorem 15.
5 A Tractable Fragment
Tractability in data complexity is an important requirement in dataintensive applications. In this section, we propose a syntactic restriction on stratified, limitlinear programs that is sufficient to ensure tractability of fact entailment in data complexity. Our restriction extends that of type consistency in prior work to account for negation. The programs in Examples 4 and 5 are typeconsistent.
Definition 16.
A semiground, limitlinear rule is typeconsistent if

each numeric term in is of the form where is an integer and each , , is a nonzero integer, called the coefficient of variable in ;

each numeric variable occurs in exactly one standard body literal;

each numeric variable in a negative literal is guarded;

if the head of is a limit atom, then each unguarded variable occurring in with a positive (or negative) coefficient also occurs in the body in a (unique) positive limit literal that is of the same (or different, respectively) type (i.e., vs. ) as ;

for each comparison or in , each unguarded variable occurring in with a positive (or negative) coefficient also occurs in a (unique) positive (or , respectively) body literal, and each unguarded variable occurring in with a positive (or negative) coefficient occurs in a (unique) positive (or , respectively) body literal.
A semiground, stratified, limitlinear program is typeconsistent if all of its rules are typeconsistent. A stratified limitlinear program is typeconsistent if the program obtained by first semigrounding and then simplifying all numeric terms as much as possible is typeconsistent.
Similarly to typeconsistency for positive programs, Definition 16 ensures that divergence of limit facts to can be detected in polynomial time when constructing a pseudomaterialisation (see [?] for details). Furthermore, the conditions in Definition 16 have been crafted such that the reduct of a semipositive typeconsistent program (and hence of any intermediate program considered while materialising a stratified program) can be trivially rewritten into a positive typeconsistent program. For this, it is essential to require a guarded use of negation (see third condition in Definition 16).
For a semipositive, typeconsistent program and a limit dataset, the reduct of is polynomially rewritable to a positive, semiground, typeconsistent program such that, for each fact , if and only if .
Lemma 17.
Lemma 5 allows us to extend the polytime algorithm in [?] for computing the pseudomaterialisation of a positive typeconsistent program to semipositive programs, thus obtaining a tractable implementation of oracle restricted to typeconsistent programs. This suffices since Algorithm 1, when given a typeconsistent program as input, only applies to typeconsistent programs. Thus, by Proposition 4, we obtain a polynomial time upper bound on the data complexity of fact entailment for typeconsistent programs with stratified negation. Since plain Datalog is already Phard in data complexity, this upper bound is tight.
For a stratified, typeconsistent program and a fact, deciding is Pcomplete in data complexity.
Theorem 18.
Finally, as we show next, our extended notion of type consistency can be efficiently recognised.
Checking whether a stratified, limitlinear program is typeconsistent is in LogSpace.
Proposition 19.
6 Conclusion and Future Work
Motivated by declarative data analysis applications, we have extended the language of limit programs with stratified negationasfailure. We have shown that the additional expressive power provided by our extended language comes at a computational cost, but we have also identified sufficient syntactic conditions that ensure tractability of reasoning in data complexity. There are many avenues for future work. First, it would be interesting to formally study the expressive power of our language. Since typeconsistent programs extend plain (functionfree) Datalog with stratified negation, it is clear that they capture P on ordered datasets [?], and we conjecture that the full language of stratified limitlinear programs captures . From a more practical perspective, we believe that limit programs can naturally express many tasks that admit a dynamic programming solution (e.g., variants of the knapsack problem, and many others). Conceptually, a dynamic programming approach can be seen as a threestage process: first, one constructs an acyclic ‘graph of subproblems’ that orders the subproblems from smallest to largest; then, one computes a shortest/longest path over this graph to obtain the value of optimal solutions; finally, one backwardscomputes the actual solution by tracing back in the graph. Capturing the third stage seems to always require nonmonotonic negation (as illustrated in our path computation example), whereas the first stage may or may not require it depending on the problem. Finally, the second stage can be realised with a (recursive) positive program. Second, our formalism should be extended with aggregate functions. Although certain forms of aggregation can be simulated using arithmetic functions and iterating over the object domain by exploiting the ordering, having aggregation explicitly would allow us to express certain tasks in a more natural way. Third, we would like to go beyond stratified negation and investigate the theoretical properties of limit Datalog under wellfounded [?] or the stable model semantics [?]. Finally, we plan to implement our reasoning algorithms and test them in practice.
Acknowledgments
This research was supported by the EPSRC projects DBOnto, MaSI, and ED.
References
 [Alvaro et al., 2010] Peter Alvaro, Tyson Condie, Neil Conway, Khaled Elmeleegy, Joseph M. Hellerstein, and Russell Sears. BOOM analytics: exploring datacentric, declarative programming for the cloud. In EuroSys 2010, pages 223–236. ACM, 2010.
 [Beeri et al., 1991] Catriel Beeri, Shamim A. Naqvi, Oded Shmueli, and Shalom Tsur. Set constructors in a logic database language. J. Log. Program., 10(3&4):181–232, 1991.
 [Chin et al., 2015] Brian Chin, Daniel von Dincklage, Vuk Ercegovac, Peter Hawkins, Mark S. Miller, Franz Josef Och, Christopher Olston, and Fernando Pereira. Yedalog: Exploring knowledge at scale. In SNAPL 2015, volume 32 of LIPIcs, pages 63–78. Schloss Dagstuhl  LeibnizZentrum für Informatik, 2015.
 [Chistikov and Haase, 2016] Dmitry Chistikov and Christoph Haase. The taming of the semilinear set. In ICALP, volume 55 of LIPIcs, pages 128:1–128:13. Schloss Dagstuhl  LeibnizZentrum für Informatik, 2016.
 [Consens and Mendelzon, 1993] Mariano P. Consens and Alberto O. Mendelzon. Low complexity aggregation in GraphLog and Datalog. Theor. Comput. Sci., 116(1):95–116, 1993.
 [Dantsin et al., 2001] Evgeny Dantsin, Thomas Eiter, Georg Gottlob, and Andrei Voronkov. Complexity and expressive power of logic programming. ACM Comput. Surv., 33(3):374–425, 2001.
 [Eisner and Filardo, 2011] Jason Eisner and Nathaniel Wesley Filardo. Dyna: Extending datalog for modern AI. In Datalog 2010, volume 6702 of LNCS, pages 181–220. Springer, 2011.
 [Ganguly et al., 1995] Sumit Ganguly, Sergio Greco, and Carlo Zaniolo. Extrema predicates in deductive databases. J. Comput. Syst. Sci., 51(2):244–259, 1995.
 [Gelfond and Lifschitz, 1988] Michael Gelfond and Vladimir Lifschitz. The stable model semantics for logic programming. In ICLP/SLP 1988, pages 1070–1080. MIT Press, 1988.
 [Kaminski et al., 2017] Mark Kaminski, Bernardo Cuenca Grau, Egor V. Kostylev, Boris Motik, and Ian Horrocks. Foundations of declarative data analysis using limit datalog programs. In IJCAI 2017, pages 1123–1130. ijcai.org, 2017.
 [Kemp and Stuckey, 1991] David B. Kemp and Peter J. Stuckey. Semantics of logic programs with aggregates. In ISLP 1991, pages 387–401. MIT Press, 1991.
 [Krentel, 1988] Mark W. Krentel. The complexity of optimization problems. J. Comput. System Sci., 36(3):490–509, 1988.
 [Loo et al., 2009] Boon Thau Loo, Tyson Condie, Minos N. Garofalakis, David E. Gay, Joseph M. Hellerstein, Petros Maniatis, Raghu Ramakrishnan, Timothy Roscoe, and Ion Stoica. Declarative networking. Commun. ACM, 52(11):87–95, 2009.
 [Markl, 2014] Volker Markl. Breaking the chains: On declarative data analysis and data independence in the big data era. PVLDB, 7(13):1730–1733, 2014.
 [Mazuran et al., 2013] Mirjana Mazuran, Edoardo Serra, and Carlo Zaniolo. Extending the power of datalog recursion. VLDB J., 22(4):471–493, 2013.
 [Mumick et al., 1990] Inderpal Singh Mumick, Hamid Pirahesh, and Raghu Ramakrishnan. The magic of duplicates and aggregates. In VLDB 1990, pages 264–277. Morgan Kaufmann, 1990.
 [Ross and Sagiv, 1997] Kenneth A. Ross and Yehoshua Sagiv. Monotonic aggregation in deductive databases. J. Comput. System Sci., 54(1):79–97, 1997.
 [Sabidussi, 1966] Gert Sabidussi. The centrality index of a graph. Psychometrika, 31(4):581–603, 1966.
 [Seo et al., 2015] Jiwon Seo, Stephen Guo, and Monica S. Lam. SociaLite: An efficient graph query language based on datalog. IEEE Trans. Knowl. Data Eng., 27(7):1824–1837, 2015.
 [Shkapsky et al., 2016] Alexander Shkapsky, Mohan Yang, Matteo Interlandi, Hsuan Chiu, Tyson Condie, and Carlo Zaniolo. Big data analytics with datalog queries on Spark. In SIGMOD 2016, pages 1135–1149. ACM, 2016.
 [Van Gelder et al., 1991] Allen Van Gelder, Kenneth A. Ross, and John S. Schlipf. The wellfounded semantics for general logic programs. J. ACM, 38(3):620–650, 1991.
 [Van Gelder, 1992] Allen Van Gelder. The wellfounded semantics of aggregation. In PODS 1992, pages 127–138. ACM Press, 1992.
 [Wang et al., 2015] Jingjing Wang, Magdalena Balazinska, and Daniel Halperin. Asynchronous and faulttolerant recursive datalog evaluation in sharednothing engines. PVLDB, 8(12):1542–1553, 2015.
Appendix A Proofs for Section 4
Before proceeding to the proofs of our theorems in the main body of the paper, we restate some notions from [?]. All models of a limit program are easily seen to satisfy the following closure property.
Definition A.1.
An interpretation is limitclosed (for a limit program ) if, for each fact where is a limit predicate, holds for each integer with .
There is a onetoone correspondence between pseudointerpretations and limitclosed interpretations, and thus each model of a program can be equivalently represented by a pseudointerpretation.
Definition A.2.
A limitclosed interpretation corresponds to a pseudointerpretation if the following conditions hold:

an object or ordinary numeric fact is contained in if and only if it is contained in ; and

for each limit predicate , each tuple of objects , and each integer , (i) for all if and only if , and (ii) and for all and is a limit predicate if and only if .
Let and be pseudointerpretations corresponding to interpretations and . Then, satisfies a ground atom , written , if ; is a pseudomodel of a program , written , if ; finally, holds if .
? [?] then define an immediate consequence operator for positive limit programs that works on pseudointerpretations and show that the pseudomaterialisation of a positive limit program can be computed as the pseudointerpretation inductively defined as follows, where , for a set of pseudointerpretations, is the supremum of w.r.t. :
We call pseudointerpretations partial pseudomaterialisations of .
The coNP upper bound for fact entailment in [?] is shown by a reduction to validity of Presburger formulas of a certain shape. We next extend this reduction as given in [?] for a (semiground and positive) limitlinear program to account for datasets involving .
Definition A.3.
For each ary object predicate , each ary ordinary numeric predicate , each ary limit predicate , each tuple of objects , and each integer , let , , and be distinct propositional variables, and let a distinct integer variable.
For a semiground, positive, limitlinear program, is the Presburger formula where is the formula (with the same quantifier block as ) that is obtained by replacing each atom in with its encoding defined as follows:

if is a comparison atom;

if is an object atom of the form ;

if is an ordinary numeric atom of the form where is a ground numeric term evaluating to ;^{1}^{1}1Note that all ordinary numeric atoms in have this form since is semiground.

if is a limit atom of the form where ; and

if is a limit atom of the form .
Let be a pseudointerpretation, and let be an assignment of Boolean and integer variables. Then, corresponds to if all of the following conditions hold for all , , , and as specified above, for each integer :

if and only if ;

if and only if ;

if and only if or there exists such that ;

and if and only if .
Note that in Definition A.3 ranges over all integers (which excludes ), is equal to some integer , and is a pseudointerpretation and thus cannot contain both and ; thus, implies .
The key property of the Presburger encoding in [?] is established by the following lemma, which we easily reprove for our variant of the encoding.
Lemma A.4.
Let be a pseudointerpretation and let be a variable assignment such that corresponds to . Then,

if and only if for each ground atom , and

if and only if for each semiground, positive rule .
Proof.
Claim 1 follows analogously to the respective argument in [?] except for having an extra case, namely , for a limit predicate. The proof of this case is analogous but simpler to the case for where . Claim 2 then follows from Claim 1 same as before. ∎
Using Lemma A.4, ? [?] establish the following correspondence between entailment for positive limitlinear programs and validity of Presburger sentences.
For a semiground, positive, limitlinear program and a fact, there exists a Presburger sentence that is valid if and only if . Each is a conjunction of possibly negated atoms. Moreover, and each are bounded polynomially by . Number is bounded polynomially by and exponentially by . Finally, the magnitude of each integer in is bounded by the maximal magnitude of an integer in and .
Lemma A.5.
By a more precise analysis of the Presburger formulas in the proof of Lemma A, we can sharpen the bounds provided by the lemma as follows, where (resp. , , etc.) stands for the size of the representation of (resp. , , etc.) assuming that all numbers take unit space.
For a semiground, positive, limitlinear program and a fact, there exists a Presburger sentence that is valid if and only if . Each is a conjunction of possibly negated atoms. Moreover, is bounded polynomially in and each is bounded polynomially in . Number is bounded polynomially in and exponentially in . Finally, the magnitude of each integer in is bounded by the maximal magnitude of an integer in and .
Lemma A.6.
Analogously to the notion of a model for an interpretation, we call With Lemma A at hand, ? [?] then show the following theorem, which bounds the magnitude of integers in counterpseudomodels for entailment (the proof of the theorem adapts to our setting as is).
Theorem A.7.
For a semiground, positive, limitlinear program, a limit dataset, and a fact, if and only if a pseudomodel of exists where , , and the magnitude of each integer in is bounded polynomially in the largest magnitude of an integer in , exponentially in , and doubleexponentially in .
Furthermore, the doubleexponential bound in can be trivially sharpened to by employing Lemma A in place of Lemma A. Building on the proof of Theorem A.7, we next prove the following stronger version, which bounds the size of pseudomaterialisations of semiground, positive, limitlinear programs.
Lemma A.8.
Proof.
Let be the maximal magnitude of an integer in , , and . Let be obtained from by removing each fact that does not unify with an atom in and let be a fresh nullary predicate.
Clearly, we have where is the least pseudointerpretation w.r.t. such that for each . Let be obtained from and fact analogously to the construction in the proof of Lemma A, but where each disjunct in is replaced by if and by if for some . It is easy to see that every assignment corresponding to is a countermodel of . Therefore, since satisfies the same structural constraints as the formula in Lemma A, by an argument analogous to the one in the proof of Theorem A.7 we obtain that has a pseudomodel such that , the magnitude of each integer in is bounded by some number that is polynomial in , exponential in , and doubleexponential in , and where, it holds that if and only if for each limit predicate and objects . Consequently, we have established that has a pseudomodel that satisfies the required bounds in the lemma. In what follows we use the fact that to show that also satisfies the bounds in the lemma.
Let us denote with the partial pseudomaterialisation of for any and hence, . We start with the observation that the value of a number in a limit fact can only increase with respect to during the construction of . For instance, if , with a predicate, and , then . Let, and, for , be the maximum between

,

the maximal magnitude of a negative integer occurring in a fact in , and

the maximal magnitude of a positive integer occurring in a fact in .
Numbers allow us to bound the integers produced by the immediate consequence operator applied to pseudointerpretation . Specifically, we argue that for each and rule with head for some , we have

if ,

if is a predicate, and

if is a predicate.
To see why this holds, consider a pseudointerpretation obtained from by replacing each IDB fact with , and each IDB fact with . By construction, we have and hence whenever is defined. But since the magnitude of all numbers in is bounded by , by Proposition 3 in [?], has a solution where the maximal magnitude of all numbers is bounded by , and hence the magnitude of the value of for this solution is bounded by (unless the value of is unbounded in , in which case and we are done). The last two subclaims are immediate since