ILP Modulo Data

# ILP Modulo Data

Panagiotis Manolios, Vasilis Papavasileiou, and Mirek Riedewald
This research was supported in part by DARPA under AFRL Cooperative Agreement No. FA8750-10-2-0233 and by NSF grants CCF-1117184 and CCF-1319580. Northeastern University
{pete,vpap,mirek}@ccs.neu.edu
###### Abstract

The vast quantity of data generated and captured every day has led to a pressing need for tools and processes to organize, analyze and interrelate this data. Automated reasoning and optimization tools with inherent support for data could enable advancements in a variety of contexts, from data-backed decision making to data-intensive scientific research. To this end, we introduce a decidable logic aimed at database analysis. Our logic extends quantifier-free Linear Integer Arithmetic with operators from Relational Algebra, like selection and cross product. We provide a scalable decision procedure that is based on the BC() architecture for ILP Modulo Theories. Our decision procedure makes use of database techniques. We also experimentally evaluate our approach, and discuss potential applications.

## I Introduction

In 2010, enterprises and users stored more than 13 exabytes of new data [1]. Database Management Systems (DBMS’s) based on the Relational Model [3] are a key component in the computing infrastructure of virtually any organization. With big data playing a determining role in business and science, we are motivated to rethink data management and analysis.

Database systems capable of symbolic computation could enable powerful new methodologies for strategic planning, decision making, and scientific research. We propose database systems that (a) store symbolic (in addition to concrete) data, and at the same time (b) allow queries of a symbolic nature, e.g., with free variables. Such database systems can be dually thought of as constraint solvers that reason in the presence of data. Symbolic data allows us to encode partially specified or entirely speculative information, e.g., database entries that exist for the purpose of what-if analysis. Symbolic queries enable deductive reasoning about data.

Existing relational query languages (e.g., SQL) only allow concrete data and queries. Symbolic enhancements require a formalism that combines constraints and relational queries. We address this need by introducing the  logic.  extends quantifier-free Linear Integer Arithmetic (QFLIA) with database tables and operators from Relational Algebra, like selection (), union (), and cross product (). While is decidable, the logic in its general form gives rise to hard satisfiability problems, primarily because it allows universal quantification over cross products of big tables. We study unrestricted  (for it is a natural umbrella formalism), but also provide restrictions that enable an efficient decision procedure. In other words, we identify a class of database problems that are a realistic initial target for formal analysis.

We provide a scalable procedure based on the BC() architecture for ILP Modulo Theories (IMT) [10]. Our approach is dubbed ILP Modulo Data, because an ILP solver co-exists with a procedure that establishes a correspondence between integer variables and database tables. The latter contain a mix of concrete and symbolic data. ILP Modulo Data allows us to use a powerful ILP solver based on branch-and-cut (B&C) on the arithmetic side, while also utilizing database techniques that allow us to scale to realistic datasets.

The compositional nature of ILP Modulo Data is well-suited for potential applications. Organizations have access to vast amounts of data, but at the same time rely heavily on Mathematical Programming technology. We enhance Mathematical Programming tools with the ability to directly access data, thus assisting data-backed decision making. Such tools would also benefit scientists in fields ranging from ornithology [16] to astronomy [5], by providing immediate feedback on the consistency between models the scientists devise and datasets of observations they collect. Our paper outlines potential applications, while our experimental evaluation relies on benchmarks that characterize them. We experimentally demonstrate that our ILP Modulo Data framework provides better performance than the approach of eagerly reducing to QFLIA.

Paper Structure: Section II introduces our reasoning paradigm through a motivating example. Section III presents the  logic, while Section IV identifies a  fragment that yields scalable procedures. Section V describes our decision procedure. We experimentally evaluate our approach in Section VI. We provide an overview of related work in Section VII, and conclude with Section VIII.

## Ii Motivating Example

Our motivating example (formalized in Figure 1) concerns the problem of optimally investing a given amount of capital. This is an appropriate application for our techniques, because (a) investments are almost always data-driven as they take historical stock prices into account, and (b) financial institutions already rely on Mathematical Programming.

The problem involves investing in a portfolio of publicly traded stocks, with the goal of maximizing profit while following guidelines that minimize risk. A database provides information on these stocks, including stock prices from the New York Stock Exchange (NYSE). We would like to pick the stocks that would have yielded the highest profit over a period of time in the recent past, e.g., over the preceding year. This optimization problem is subject to risk-mitigation constraints that require us to pick companies from a variety of sectors. While investing in the exact solver-generated portfolio (which relies only on past performance) is not necessarily a good strategy, such a portfolio provides useful information for the analysts who make the final investment decisions.

The data is given in tables and (Figures (a)a and (b)b). Each company in is described by a unique ID (with the associated NYSE symbol parenthesized), its capitalization (small, medium, or large), and its sector (e.g., tech, retail, financials, automotive, energy, emerging-markets). While Figure 1 uses human-readable names, we can encode these fields with bounded integer quantities. Each entry in describes the observed movement of a certain stock in a given timeframe, assuming that dividends were reinvested. For example, the first row describes an increase of 28% in the price of EMC. is an application-specific abstraction, i.e., the actual database contains past stock prices and is a view produced by comparing data for two time periods.

The stock in the portfolio is characterized by a unique ID that corresponds to entries in the dataset, i.e., there exist entries and . To minimize risk, we force the IDs to be distinct, and allow no single sector to account for more than a third of the total capital. Additionally, no more than a fourth of the capital goes to smallcap companies. The objective function maximizes the capital at the end of the period, and thus the profit.

Note that if the amounts are variables, the objective function is non-linear. The problem can be circumvented by providing integer constants for , i.e., by specifying how the capital will be partitioned. With constants for , the non-table constraints are essentially in QFLIA. (The summations for that satisfy conditions like and are easy to encode as sums of if-then-else terms.) Conversely, the problem is essentially satisfiability of an arithmetic instance, where certain variables correspond to database contents. This is the kind of problem that we propose new techniques for. We cannot use a standalone DBMS, since DBMS’s do not handle constraints and optimization. Neither are existing solvers up to the task, since they do not provide ways of managing data.

The constraints we have described are meant to be representative. Clearly investors also have to consider other options, including investing in index funds, bonds, debt securities and derivative contracts. These financial instruments may have other characteristics that need to be modeled. Our constraints are also based on simplifying assumptions, e.g., that we can invest an arbitrary amount in any given stock at any time. It is not within the scope of our paper to model investment problems comprehensively. What matters is that these additional concerns also mix arithmetic with data, thus reinforcing the need for data-aware solving.

## Iii The Logic Δ

This Section introduces the logic . combines arithmetic with queries over tabular data. thus encompasses database problems like our motivating example of Section II.

The grammar of  is given in Figure 2. , , , and are the non-terminal symbols for integer constants, terms, tables, and formulas, respectively. The first line of productions for corresponds to pairs and their accessors; the second line is for variable symbols () and integer expressions. A table (non-terminal symbol ) is either an input table, a selection, a cross-product, or a union. The selection is a table that consists of only those entries in that satisfy , i.e., the variable ranges over the table entries; binds in , but not in . For formulas (non-terminal symbol ), should be read as “ is not empty”. All other constructs bear the obvious meaning. We assume that all variables not bound by are integer. We will freely use derived operators, e.g., conjunction and integer equality.

is typed. Each term is either of type int or of type , where and are types. and are only permissible when applied to a term of type for some type and some type ; if is of type , then is of type and is of type . The integer constants are of type int. The arithmetic operators (, , and ) only apply to terms of type int; and produce integers. Each table has a schema, which is the type of its entries. (Schemas are the table-level counterpart of types.) An input table is comprised of entries of the same type. If table has schema and table has schema , then has schema . For to be properly typed, should be a properly-typed formula under the assumption that the type of is the schema of ; the schema of is the same as the schema of . Union expects tables of the same schema and preserves it.

Clearly, is at least as powerful as QFLIA. At the same time, encompasses most features one would expect from a relational query language. We have left out certain operators usually present in query languages. First, note that projection () would not provide additional power, since it is possible to refer to any subset of the columns, without producing an intermediate table that leaves out the irrelevant ones. Also, the set difference can be encoded as , assuming that the schema of and has exactly one column; otherwise, in place of we would have a conjunction of equalities over all columns. Additionally, can express many forms of aggregation, including (when compared to a constant), , and .

###### Example 1.

The portfolio encoded by Figure 1 can be represented as the input table

 portfolio={(1,(x1,a1)),…,(n,(xn,an))}.

contains symbolic data, something which is not allowed by DBMS’s. The first column ensures that the entries are distinct, irrespective of the assignment. is of schema . Consider the following constraint:

The constraint states that there are no entries and in such that and , i.e., references distinct stocks (as was our intention in Figure 1). The constraint essentially involves universal quantification over .

### Iii-a Decidability

satisfiability can be reduced to QFLIA satisfiability. We explain the reduction briefly. We represent a table expression of schema as a set consisting of pairs , where is a term of type and is a QFLIA formula, with the intended meaning that is present in the table iff is true. We use the operator to distinguish the auxiliary pairs used for the reduction from the ones allowed by the syntax of . For a formula , denotes the corresponding formula in QFLIA; similarly for integer terms. stands for substituting with in , with appropriate care for occurrences of the symbol bound by inside . We define for tables and formulas below as two mutually recursive functions.

 \llbracket{r1,…,rn}\rrbracket= {r1⊘true,…,rn⊘true} (1) \llbracket⟨σx:F:D⟩\rrbracket= {r⊘(b∧\llbracketF[x/r]\rrbracket) | r⊘b∈\llbracketD\rrbracket} \llbracketD1×D2\rrbracket= {(r1,r2)⊘(b1∧b2) | r1⊘b1∈\llbracketD1\rrbracket,r2⊘b2∈\llbracketD2\rrbracket} \llbracketD1∪D2\rrbracket= \llbracketD1\rrbracket∪\llbracketD2\rrbracket
 \llbracketT1≤T2\rrbracket= \llbracketT1\rrbracket≤\llbracketT2\rrbracket (2) ⋁r⊘b∈\llbracketD\rrbracketb \llbracket¬F\rrbracket= ¬\llbracketF\rrbracket \llbracketF1∨F2\rrbracket= \llbracketF1\rrbracket∨\llbracketF2\rrbracket

For encoding integer terms as QFLIA terms (e.g., in Equation 2), all that needs to be done is elimination of pair constructors and accessors via the rules and . The reduction suffices to establish decidability of . The reduction also provides formal semantics for  by specifying its meaning in terms of QFLIA.

### Iii-B Complexity

###### Theorem 1.

The satisfiability problem for is in NEXPTIME.

###### Proof Sketch.

The reduction to QFLIA (Equations 1 and 2) produces a formula exponentially larger than the input. Since QFLIA is in NP, the reduction provides a non-deterministic exponential time procedure for -satisfiability. ∎

###### Theorem 2.

The satisfiability problem for is PSPACE-hard.

###### Proof Sketch.

We reduce the (PSPACE-complete) QBF problem to  satisfiability in polynomial time. We deal with Boolean quantification by quantifying over the input table . For example, the formula becomes

Complexity analysis of  beyond Theorems 1 and 2 is not within the scope of this paper, and has mostly theoretical significance. In practice, query size is orders of magnitude smaller than data size. Conversely, it is meaningful to study data complexity [18], i.e., complexity where only the amount of data varies. Instead of assuming a query of constant size, we provide a stronger result by limiting the number of tables that can participate in a cross product. (We also limit nested quantifiers, because the latter can simulate cross products.) We define below the function that characterizes this number.

 rank({r1,…,rn})= 1 (3) rank(⟨σx:F:D⟩)= rank(F)+rank(D) rank(D1×D2)= rank(D1)+rank(D2) rank(D1∪D2)= max(rank(D1),rank(D2))
 rank(T1≤T2)= 0 (4) rank(D) rank(¬F)= rank(F) rank(F1∨F2)= max(rank(F1),rank(F2))
###### Definition 1 (k-Δ).

For any natural number , is the set of formulas .

###### Theorem 3.

For any natural number , is NP-complete.

###### Proof Sketch.

is NP-hard , because any QFLIA formula can be reduced to a formula in polynomial time (). We obtain membership in NP from the reduction defined by Equation 2, which produces polynomially-sized QFLIA formulas. ∎

Given the class of formulas for some , the reduction produces QFLIA formulas of size , where is the input size. While the reduction is polynomial (since is fixed), it may not be practical even for , given that datasets of millions of entries are common. Conversely, we propose restrictions that yield a lazy solving architecture.

## Iv The Existential Fragment of Δ

We proceed to study the existential fragment of , which we denote by .

###### Definition 2 (∃Δ).

A  formula belongs to if the operator always appears below an even number of negations, i.e., only appears with positive polarity.

The motivation for studying is as follows. Universal quantification pushes for an approach similar to quantifier instantiation, e.g., Example 1 (which is not in ) inherently requires instantiating a constraint for every element in . This can be done incrementally by applying patterns that are standard in verification tools. In contrast, we are not aware of techniques that would be a good match for the kind of existential quantification that arises in . Therefore, the rest of this paper focuses on .

Formulas in  can be transformed into formulas in a convenient intermediate logic without cross products, selections, or unions. We rephrase in terms of a new membership operator. Each formula of the form is viewed as , where has the obvious semantics and is a properly shaped row comprised of fresh integer variables. We will refer to rows like that serve as witnesses for as witness rows. The next step is to translate membership in arbitrary table expressions to membership in input tables. becomes , while becomes . Finally, becomes . We eliminate all cross products, selections, and unions by repeated application of the above transformations.

###### Example 2.

The tables of Figures (a)a and (b)b can be easily encoded as  input tables of schemas and . Let small capitalization be represented by the constant . Consider the following constraint:

The constraint asserts the existence of some tuple that satisfies . (We have eliminated the accessors and .) This is equivalent to asserting that .

The procedure we outlined produces a decomposed formula consisting of a QFLIA part and membership constraints. We proceed to define these notions formally.

###### Definition 3 ((Conditional) Membership Constraint).

A membership constraint is a constraint of the form

 (x1,…,xk)∈{(y1,1,…,y1,k),…,(yl,1,…,yl,k)} (5)

for positive integers and and variable symbols , . A constraint of the form , where is a variable symbol and is a membership constraint, is called a conditional membership constraint.

A membership constraint may hold conditionally, either because it arises from an -atom that appears under propositional structure (and therefore holds conditionally), or because of a disjunction introduced by the union operator. We use conditions of the form because ILP necessitates -bounded integer variables in place of Boolean variables. Implication in the opposite direction is never needed, since always appears with positive polarity (as per Definition 2).

Membership constraints do not contain arbitrary arithmetic expressions, but only variable symbols. “Variable abstraction” [9] eliminates richer expressions. While variable abstraction allows for compositional reasoning and helps with theoretical analysis, a limited fragment of arithmetic in membership constraints yields more efficient implementation. Part of our discussion will involve tables that contain integer constants and terms of the form , where is a variable symbol and is an integer constant. (Everything we present is easy to generalize for such terms.) For convenience, we flatten out rows constructed using the pair constructor of Figure 2, and instead deal with -tuples of integers. This is only a matter of presentation and has no impact on the algorithms.

###### Definition 4.

A decomposed formula is a conjunction , where (a) is a QFLIA formula and (b) is a conjunction of possibly conditional membership constraints.

###### Theorem 4.

satisfiability is NP-complete.

###### Proof.

satisfiability is NP-hard, because is at least as powerful as QFLIA.  satisfiability is in NP, because we can reduce to QFLIA in polynomial time. The reduction first produces a formula in decomposed form (Definition 4). Equation 5 is equivalent to ; therefore, the membership operator can be eliminated. The result is a formula in QFLIA. ∎

The polynomial size of the reduction relies on the fact that  does not allow tables to be named and referenced from multiple places, i.e., table expressions are not DAG-shaped. Despite the polynomial reduction, a lazy scheme remains relevant. The reason is that QFLIA solvers are not meant for long disjunctions that essentially encode database tables.

## V Bc(T) for Δ

The decomposed form of Definition 4 is particularly suited for a scheme that combines separate procedures for QFLIA and table membership. Given that the QFLIA part can be encoded as a conjunction of integer linear constraints [10], it becomes possible to solve instances in decomposed form (and by extension  instances) by instantiating the BC() framework for IMT [10]. An ILP solver deals with the QFLIA constraints, and exchanges information with a procedure that checks membership in finite sets. Since database queries typically have simple propositional structure, we do not expect encoding the latter with linear constraints to be a bottleneck.

The membership procedure is confronted with a conjunction of membership constraints (Definition 3). Dealing with conditional constraints is essentially a matter of Boolean search. The membership procedure needs to understand equality atoms, equality being a primitive. (Our setting is standard first-order logic with equality.) In particular, the procedure keeps track of truth assignments to the equalities in:

 {xi=yj,i | j∈[1,l],i∈[1,k]} (6)

The symbols and have the same meaning as in Definition 3. In the presence of multiple membership constraints, the union of sets, like in Equation 6, is relevant. Given that membership constraints can be checked in isolation, our discussion proceeds with a single constraint. The variables and also appear in linear constraints. It simplifies our design to assume that all of them appear in ILP, even if they are unconstrained there. The BC() framework provides a mechanism (“difference constraints” [10]) for notifying background procedures about atoms like the ones in Equation 6. Given truth values for these atoms, we check that a membership constraint is satisfied by simply traversing the table and looking for a tuple that is column-wise equal to the witness row. The constraint is violated if for every , there exists some such that , i.e., there is no candidate tuple.

The arithmetic and membership parts share variables. It is vital that we systematically explore the space of (dis)equalities between these variables. This exchange of information resembles the non-deterministic Nelson-Oppen scheme (NO) for combining decision procedures [14]. We demonstrate that NO can accommodate membership constraints.

###### Definition 5 (Arrangement).

Let be an equivalence relation over a set of variables . The set

 α(V,E)={x=y | xEy} ∪ {x≠y | x,y∈V and not xEy}

is the arrangement of induced by .

###### Definition 6 (Stably-Infinite Theory).

A -theory is called stably-infinite if for every -satisfiable quantifier-free -formula there exists an interpretation satisfying whose domain is infinite.

###### Fact 1 (Nelson-Oppen for Stably-Infinite Theories [14, 9]).

Let be a stably-infinite -theory, for , and let . Also, let be a conjunction of -literals. is -satisfiable iff there exists an equivalence relation of the variables shared by and such that is -satisfiable, for .

###### Lemma 1 (Nelson-Oppen with Propositional Structure).

Let be a stably-infinite -theory, for , and let . Also, let be a quantifier-free -formula. is -satisfiable iff there exists an equivalence relation of the variables shared by and such that is -satisfiable, for .

###### Proof.
• If is -satisfiable, there exists a first-order model that satisfies it. The way interprets the variables in gives rise to an equivalence relation over such that satisfies , .

• If there exists an equivalence relation over such that is -satisfiable, then there exists a model that -satisfies , . Let

 Γi= {t | t is an atom in Φi,Li⊨t}  ∪ {¬t | t is an atom in Φi,Li⊨¬t}, i=1,2.

is -satisfiable, . By Fact 1, is -satisfiable. But . Therefore, is -satisfiable.

###### Lemma 2 (Nelson-Oppen for Membership Constraints).

Let be a stably-infinite -theory. Also, let be a conjunction of -literals, and be a conjunction of possibly negated membership constraints. is -satisfiable iff there exists an equivalence relation of the variables shared by and such that is -satisfiable and is satisfiable.

###### Proof.

Membership constraints can be viewed as disjunctions of conjunctions (Proof of Theorem 4) in which no function, predicate, and constant symbols appear, i.e., in the empty signature. The theory pertaining the membership constraints is the empty theory (), since no axioms are needed. is trivially stably-infinite. Our proof obligation follows by applying Lemma 1 with and . ∎

Note that Lemma 2 allows negated membership constraints. While the latter do not pose algorithmic difficulties, our discussion is limited to the positive occurrences needed for . The statement of Lemma 2 is structurally similar to that of Fact 1, with membership constraints replacing the constraints of some participating stably-infinite theory. It follows that a membership procedure can participate in NO as a black box, much like a theory solver, even though we have not formalized membership constraints by means of a theory. We can thus combine a form of set reasoning with any stably-infinite theory.

BC() guarantees completeness for the combination of ILP with a stably-infinite theory [10] by ensuring that the branching strategy explores all possible arrangements. We established that membership can be used much like a stably-infinite theory. All that is needed for completeness is a membership procedure capable of checking consistency of its constraints conjoined with a given arrangement (that contains all literals of Equation 6). As we have seen, this operation is simple and involves no arithmetic. In pursuit of efficiency, we proceed to describe branching and propagation techniques based on table contents. Meaningful branching and propagation involve the integer bounds of variables, i.e., necessitate limited arithmetic reasoning on the membership side.

### V-a Propagation

B&C-based ILP solvers keep track of variable lower and upper bounds, and heavily rely on bounds propagation algorithms. We describe how to enhance such propagation to exploit the structure of membership constraints.

We denote by an the current lower and upper bounds on variable . (respectively ) is either an integer constant, or (resp. ) if no bound is known. We use the notation and for bounds on that the membership procedure infers. We proceed with a membership constraint as per Definition 3. Let ; similarly, we denote by the tuple . Let be true if and only if for all , the sets and intersect.

 lb′(xi)=max(lb(xi), min{lb(yj,i) | j∈[1,l],match(x,yj)}) (7) ub′(xi)=min(ub(xi), max{ub(yj,i) | j∈[1,l],match(x,yj)}) (8)

We over-approximate the values of the variables by considering all candidate entries (inner and ). The outer and guarantee that we do not weaken bounds. If there exists exactly one value such that , it is sound to deduce the equalities , for all . If there is no candidate entry, inconsistency is reported.

###### Example 3 (Interleaved Propagation).

Consider the decomposed formula . The formula corresponds to a query over concrete tuples that any DBMS can evaluate in linear time. It is thus vital that our techniques yield acceptable performance. Equations 7 and 8 bound to and y to . Given the equality , ILP propagation deduces that , since is the intersection of permissible ranges for and . The membership procedure detects that now only holds for , and fixes to and to . The ILP solver in turn deduces unsatisfiability, since is violated. No branching was needed. Encoding the formula in QFLIA would hide its structure, leading to search. The example generalizes to other lengths and bounded symbolic data.

### V-B Branching and Arrangement Search

It follows from Lemma 2 that a branching strategy which exhaustively explores all possible arrangements of the shared variables guarantees completeness. To achieve better performance, we have to branch with the tabular structure of databases in mind, without overlooking symbolic data.

Figure 3 provides an example. The root node (Node ) describes a single membership constraint, which we assume to be part of a larger decomposed formula. We maintain integer constants in the table, instead of performing variable abstraction which would introduce auxiliary variables for them. According to Equation 6, the membership procedure needs truth assignments for the equalities in It would not be wise for the search strategy to overlook that this set originates from a table containing numbers, and treat the set members as if they were atomic propositions unrelated to each other.

In our example, branching on the condition produces two subproblems. Node 1 shows only the tuples that still apply under the condition , i.e., the ones that still satisfy the predicate ; similarly for Node 2. is a choice informed by the tabular structure. Since as the value of the first column is close to the “middle” of the table, branching on rules out approximately half of the candidates. is present in both subproblems (Nodes 1 and 2). Branching based on constant bounds is therefore not enough, for we will possibly have to deal with symbolic tuples. Figure 3 demonstrates further branching on to determine whether is a suitable witness for the membership constraint.

The example demonstrates the dual nature of the search strategy needed. The problem naturally pushes towards branch-and-bound (which is a restriction of B&C), e.g., branching on is meaningful. It remains necessary to also branch on equalities between shared variables (e.g., ), just like in any practical implementation of NO. (To be precise, in ILP we would have two separate nodes for and in place of .) Implementing NO with B&C enables both kinds of branching.

Branching is organically tied to propagation. Initially (Node ), assuming no previously known bounds for , the table contents only allow us to bound to the range ; if is unbounded, remains unbounded. The decisions and (i.e., Node ) tighten to . We also obtain the range for , i.e., branching on some column potentially leads to propagation across other columns.

### V-C Discussion

The analysis of this Section indicates that  formulas can be decomposed in such a way that a procedure for table lookup assumes part of the workload. BC() is particularly suited for implementing such a combination. BC() can easily accommodate data-aware propagation (Section V-A) and branching (Section V-B). Our techniques would be harder to implement within a DPLL()-style solver [15], given that the toplevel search of DPLL() is over the Booleans (and not the integers). A DPLL()-based implementation of our techniques would essentially require integrating branch-and-bound in DPLL(), which is beyond the scope of our work.

The table lookup procedure can be thought of as a small database engine within the solver. The employed database engine can be an actual DBMS, storing the concrete part of tables and possibly bounds on symbolic fields. A DBMS would provide multiple opportunities for improvements. Equations 7 and 8 essentially describe database aggregation, and thus provide a starting point for the kinds of queries that apply. DBMS queries can be over multiple tables at a time, and can involve conditions other than bounds. As a matter of fact, the predicate of Equations 7 and 8 can be strengthened with any condition on the data that follows from the formula (e.g., in Example 3), thus computing tighter bounds. Different kinds of database optimizations apply, e.g., materializing queries for better incremental behavior and smarter indexing based on user input.

(and its decomposed form) formally characterizes a relevant class of problems that can be solved by a compositional scheme which employs a database engine. Our scheme may actually apply to a superset of .

## Vi Applications and Experiments

We have implemented support for databases on top of the Inez constraint solver. Inez is our implementation of the BC() architecture for IMT on top of the SCIP (M)ILP solver [2]. We refer to the version of Inez that provides database extensions as InezDB. InezDB supports existential database constraints by means of the BC()-based combination described in Section V, but also universal quantification by eager instantiation. InezDB (like Inez) additionally supports objective functions.

We have produced a collection of InezDB input files that have the structure we expect in applications. Our benchmark suite is publicly available and can be used as a starting point towards a richer benchmark suite of problems that involve data and constraints. We provide a brief overview of the application areas that inspire our benchmarks.

### Vi-a How-To Analysis

Research in the general direction of reverse data management [11] proposes ways of obtaining the desired results out of a database query. We outline this class of problems through an example, which gives rise to some of our benchmarks.

###### Example 4 (emp_join.ml).

The management of a company is surprised to find out that (according to the corporate database) there is no employee younger than 30 whose yearly income exceeds $60000. Why not is not obvious, since income is a complicated function of multiple quantities including a base salary, benefits based on age, employee level (junior, middle, or senior), and bonuses. The management consults the database administrator on how to [12] ameliorate the seeming injustice. Together, they explore bonuses that would allow young employees to exceed the$60000 limit. This amounts to synthesizing tuples for the table of bonuses. An alternative is to adjust various parameters in the income computation, i.e., to modify the query instead of the data [17]. This can be done by replacing constants with variables, and letting the solver come up with suitable values.

### Vi-B Test-Case Generation

Test case generation is relevant for databases [19]. A family of benchmarks in our collection demonstrate test data generation by concretizing tables initially containing symbolic data.

###### Example 5 (emp_keys.ml).

The problem involves two tables, named and . has an ID column constrained to reference existing entries in , i.e., there is a foreign key constraint. contains thousands of tuples with symbolic IDs. A satisfying assignment corresponds to a generated database that meets the foreign key constraint, thus serving as meaningful test input.

### Vi-C Scientific Applications

Studying big datasets is a key aspect of scientific research in fields ranging from ornithology [16] to astronomy [5]. To demonstrate the applicability of our techniques, we provide benchmarks inspired by queries that ornithologists perform.

###### Example 6 (birds_box.ml).

An ornithologist wants to see a rare species in person, but has not decided on a good location. She has access to a database of observations. Each observation describes a bird and the geographic coordinates where it was seen. An area can be described as a symbolic rectangle . Our techniques allow the ornithologist to simply ask for observations of the species of interest that lie in . The query effectively concretizes .

### Vi-D Portfolio Management

We experimented with the portfolio optimization example of Section II. Our exact instance (portfolio.ml) encodes a more complex variant of the formalization in Section II. An additional table contains stock dividends; dividends are taken into account in the objective function. We tried a range of parameters with a timeout of one hour, and obtained a range of solutions. Notably, picking an optimal portfolio of 5 out of 50 stocks took 161 seconds; 5 out of 4000 stocks took 1510 seconds; and 6 out of 2000 stocks took 1172 seconds. Such table sizes are realistic, given that NYSE lists approximately 2800 companies.

### Vi-E Overview of Results

We compare InezDB against an Inez frontend that solves formulas by eagerly translating them to QFLIA via the encoding of Theorem 4. Inez in turn solves QFLIA formulas by reducing them to constraints that SCIP understands. (These constraints are not strictly ILP, since we utilize specialized constraint handlers [2].) We refer to this configuration simply as Inez, since the only addition to Inez is a new frontend. We also produce SMT-LIB versions of our QFLIA formulas, and run them against the latest available version of Z3 (4.3.1).

We provide 8 benchmark generators that allow different modes of operation (e.g., some of them are able to produce both satisfiable and unsatisfiable benchmarks), and are able to output benchmarks with different table sizes. Our input table sizes range from 60 tuples to 640000 tuples. In total, our parameters give rise to 166 benchmarks. We run all three solvers with a timeout of 1800 seconds and a memory limit of 12GB on a machine that provides 2 Intel Xeon X5677 CPUs of 4 cores each and 96GB of RAM. Figure 4 visualizes our experiments. Inez solves 25 satisfiable and 47 unsatisfiable benchmarks. InezDB solves 74 satisfiable and 81 unsatisfiable benchmarks. Finally, Z3 solves 57 satisfiable and 58 unsatisfiable benchmarks. Among the failures for Inez (resp. Z3), 37 (resp. 27) are due to the memory limit. InezDB runs out of memory only once. If we turn off the memory limits, the total numbers of failures don’t change much.

Figure (a)a indicates that InezDB outperforms Inez by a significant margin. This margin can be attributed to two factors. First, InezDB exploits the structure of database problems (e.g., for branching and propagation), while Inez has no knowledge of this structure. Second, our reduction to QFLIA (in the case of Inez) produces patterns that SCIP is not optimized for, since the latter is designed for MILP and not for QFLIA.

Figure (b)b compares Inez against a leading solver for QFLIA (Z3), and thus characterizes the tool’s performance in absolute terms. There is a cluster of 40 benchmarks for which InezDB is 2-8 times faster than Z3. (Note that the scale is logarithmic.) InezDB is at least 8 times faster for 31 of the benchmarks that both tools solve, and solves many benchmarks for which Z3 times out. All failures for InezDB are failures for Z3. Z3 outperforms InezDB for only 7 out of the 166 benchmarks, none of which take InezDB more than 4 seconds to solve.

We conclude the evaluation by pointing out that there is significant room for improvement in InezDB. As is the case with almost every first implementation of a new decision procedure, there is room for improvement, e.g., InezDB can benefit from better preprocessing and more sophisticated branching. InezDB can also be improved by adopting database techniques (as we outlined in Section V), or by integrating a DBMS. Our promising experimental results even without such optimizations constitute sufficient evidence that ILP Modulo Data is a viable design for data-enabled reasoning tools.

## Vii Related Work

The Constraint Database framework [6] provides a database perspective on constraint solving. The framework encompasses relations described by means of constraints, but not relations comprised of concrete tuples.

“Table constraints” [8, 4], as studied in Constraint Programming, resemble our membership constraints. Such tables are not meant as database tables. Our work differs in significant ways, e.g., our setup allows symbolic table contents. Also, the algorithms presented for table constraints rely on table contents from small domains (i.e., not the reals or the integers). This aligns with the overall emphasis of Constraint Programming, but conflicts with our intended applications.

Veanes et al. describe the Qex technique and tool that uses Z3 to generate tests for SQL queries [19]. Qex essentially encodes the relational operators via axioms, which are later instantiated via E-matching [13]. E-matching is a generic scheme that is not optimized in any way for database problems. Qex is geared towards relatively small tables that suffice as test cases, while our target applications involve bigger tables.

Other approaches tackle constraints arising in database applications with off-the-shelf generic solvers (via eager reductions). Notably, Khalek et al. use Alloy [7], while Meliou and Suciou use MILP [12]. In neither of these approaches does the core of the solver exploit the structure of database instances, e.g., for branching or propagation.

## Viii Conclusions and Future Work

We introduced the ILP Modulo Data framework for marrying data with symbolic reasoning. To that end, we introduced the decidable logic . We identified a fragment of  that can be solved efficiently by instantiating the BC() architecture. We developed a solver for , and evaluated this solver on a set of benchmarks that we made publicly available.

There are many interesting research directions to be explored in future work, including: (a) the design and implementation of solvers that include an actual DBMS, (b) efficiently handling universal quantification over big tables, say by partitioning input tables and using parallelization, (c) extending our techniques to allow mixed integer, real arithmetic, and other first-order theories, and (d) solving interesting business and scientific applications using the ILP Modulo Data framework.

## References

• [1] Challenges and Opportunities with Big Data, 2012. Computing Community Consortium White Paper.
• [2] Tobias Achterberg. Constraint Integer Programming. PhD thesis, Technische Universitat Berlin, 2007.
• [3] Edgar Codd. A Relational Model of Data for Large Shared Data Banks. CACM, 13(6):377–387, 1970.
• [4] Ian Gent, Chris Jefferson, Ian Miguel, and Peter Nightingale. Data Structures for Generalised Arc Consistency for Extensional Constraints. In AAAI, 2007.
• [5] Jim Gray, Alex Szalay, Ani Thakar, Peter Kunszt, Christopher Stoughton, Don Slutz, and Jan vandenBerg. Data Mining the SDSS SkyServer Database. arXiv preprint cs/0202014, 2002.
• [6] Paris Kanellakis, Gabriel Kuper, and Peter Revesz. Constraint Query Languages (Preliminary Report). In PODS, 1990.
• [7] Shadi Abdul Khalek, Bassem Elkarablieh, Yai Laleye, and Sarfraz Khurshid. Query-Aware Test Generation Using a Relational Constraint Solver. In ASE, 2008.
• [8] Christophe Lecoutre and Radoslaw Szymanek. Generalized Arc Consistency for Positive Table Constraints. In CP, 2006.
• [9] Zohar Manna and Calogero Zarba. Combining Decision Procedures. In 10th Anniversary Colloquium of UNU/IIST, 2002.
• [10] Panagiotis Manolios and Vasilis Papavasileiou. ILP Modulo Theories. In CAV, 2013.
• [11] Alexandra Meliou, Wolfgang Gatterbauer, and Dan Suciu. Reverse Data Management. In VLDB, 2011.
• [12] Alexandra Meliou and Dan Suciu. Tiresias: The Database Oracle for How-To Queries. In SIGMOD, 2012.
• [13] Leonardo De Moura and Nikolaj Bjorner. Efficient E-matching for SMT solvers. In CADE-21, 2007.
• [14] Greg Nelson and Derek C. Oppen. Simplification by Cooperating Decision Procedures. TOPLAS, 1:245–257, 1979.
• [15] Robert Nieuwenhuis, Albert Oliveras, and Cesare Tinelli. Solving SAT and SAT Modulo Theories: From an abstract Davis–Putnam–Logemann–Loveland procedure to DPLL(T). JACM, 53(6):937–977, 2006.
• [16] Daria Sorokina, Rich Caruana, Mirek Riedewald, Wesley Hochachka, and Steve Kelling. Detecting and Interpreting Variable Interactions in Observational Ornithology Data. In DDDM, pages 64–69. IEEE, 2009.
• [17] Quoc Trung Tran and Chee-Yong Chan. How to ConQueR Why-Not Questions. In SIGMOD, 2010.
• [18] Moshe Vardi. The Complexity of Relational Query Languages. In STOC, 1982.
• [19] Margus Veanes, Nikolai Tillmann, and Peli de Halleux. Qex: Symbolic SQL Query Explorer. In LPAR-16, 2010.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters