Effective and Complete Discovery of Order Dependencies
via Setbased Axiomatization
Abstract
Integrity constraints (ICs) provide a valuable tool for expressing and enforcing application semantics. However, formulating constraints manually requires domain expertise, is prone to human errors, and may be excessively time consuming, especially on large datasets. Hence, proposals for automatic discovery have been made for some classes of ICs, such as functional dependencies (FDs), and recently, order dependencies (ODs). ODs properly subsume FDs, as they can additionally express business rules involving order; e.g., an employee never has a higher salary while paying lower taxes compared with another employee.
We address the limitations of prior work on OD discovery which has factorial complexity in the number of attributes, is incomplete (i.e., it does not discover valid ODs that cannot be inferred from the ones found) and is not concise (i.e., it can result in “redundant” discovery and overly large discovery sets). We improve significantly on complexity, offer completeness, and define a compact canonical form. This is based on a novel polynomial mapping to a canonical form for ODs, and a sound and complete set of axioms (inference rules) for canonical ODs. This allows us to develop an efficient setcontainment, latticedriven OD discovery algorithm that uses the inference rules to prune the search space. Our algorithm has exponential worstcase time complexity, , in the number of attributes and linear complexity in the number of tuples. We prove that it produces a complete, minimal set of ODs (i.e., minimal with regards to the canonical representation). Finally, using real and synthetic datasets, we experimentally show ordersofmagnitude performance improvements over the current stateoftheart algorithm and demonstrate effectiveness of our techniques.
Effective and Complete Discovery of Order Dependencies
via Setbased Axiomatization
Jaroslaw Szlichta, Parke Godfrey, Lukasz Golab, Mehdi Kargar, Divesh Srivastava University of Ontario Institute of Technology, Canada 
York University, Canada 
University of Waterloo, Canada 
University of Windsor, Canada 
AT&T LabsResearch, USA 
jaroslaw.szlichta@uoit.ca, godfrey@yorku.ca, lgolab@uwaterloo.ca, mkargar@uwindsor.ca, divesh@research.att.com 
\@float
copyrightbox[b]
\end@floatWith the interest in data analytics at an alltime high, data quality and query optimization are being revisited to address the scale and complexity of modern dataintensive applications. Real data suffer from inconsistencies, duplicates and missing values [?, ?, ?]. A recent Gartner Research Report study [?] revealed that, by 2017, one third of the largest global companies will experience data quality crises due to their inability to trust and govern their enterprise information. Deep analytics on large data warehouses, spanning thousands of lines of SQL code, are no longer restricted to welltuned, canned batch reports [?]. Instead, complex adhoc queries are increasingly required for business operations to make timely datadriven decisions. Without clean data and effective query optimization, organizations will not be able to stay competitive and to take advantage of new datadriven opportunities.
Integrity constraints (ICs), which specify the intended semantics and attribute relationships, are commonly used to characterize data quality and to optimize business queries. Prior work has focused on functional dependencies (FDs) and extensions thereof, such as conditional FDs [?]. FDs cannot capture relationships among ordered attributes, however, such as between timestamps and numbers, which are common in business data [?]. For example, consider Table Effective and Complete Discovery of Order Dependencies via Setbased Axiomatization, which shows employee tax records in which tax is calculated as a percentage (perc) of salary (sal). Both tax and percentage increase with salary.
We study order dependencies (ODs) [?, ?], which can naturally express such semantics. The OD salary orders holds: if we sort the table by salary, we know it is also sorted by ; i.e., sorted by tax, with ties broken by percentage. Similarly, the OD salary orders holds in Table Effective and Complete Discovery of Order Dependencies via Setbased Axiomatization. The class of ODs subsumes the class of FDs [?]. For example, if salary orders tax, percentage, then the FD that salary functionally determines tax, percentage must also hold.
#  ID  yr  posit  bin  sal  perc  tax  grp  subg 
t1  10  16  secr  1  5K  20%  1K  A  III 
t2  11  16  mngr  2  8K  25%  2K  C  II 
t3  12  16  direct  3  10K  30%  3K  D  I 
t4  10  15  secr  1  4.5K  20%  0.9K  A  III 
t5  11  15  mngr  2  6K  25%  1.5K  C  I 
t6  12  15  direct  3  8K  25%  2K  C  II 
The additional expressiveness of ODs makes them particularly suitable for improving data quality, where ODs can describe intended semantics and business rules; and their violations point out possible data errors. ODs enable new optimization techniques for online analytical processing (OLAP) queries: eliminating expensive joins from query plans in data warehouse environments [?]; processing queries with orderby, groupby, distinct, and partition statements [?, ?]; and improving the performance of queries with SQL functions and algebraic expressions (e.g., date orders year(date) and date orders year(date)*100 + month(date)) [?, ?].
We illustrate some possible optimizations using Query 1 over a data warehouse schema such as the TPCDS^{1}^{1}1http://tpc.org/tpcds/ benchmark. Financial year is represented by d_year, d_quarter and d_month, and is stored in the table. The values of quarter divide the financial year into four threemonth periods. The fact table web_sales has a foreign key d_date_sk that references the dimension table date_dim, which is a surrogate key, a sequential number, assigned with increasing values as data is added. Assume date_dim has an index on d_year and d_month. Since d_month functionally determines d_quarter, the groupby on d_year, d_quarter, d_month is semantically equivalent to that on d_year, d_month. In fact, some optimizers would eliminate fiscal quarter via the relevant FDs [?]. However, this FD cannot simplify the orderby statement to match the given index. For this, we would need an OD d_month orders d_quarter [?, ?].
In a data warehouse, most queries reference fact tables which include foreign keys to dimension tables. These foreign keys are often surrogate keys. However, queries usually reference other dimension attributes, not surrogate keys; e.g., the “between” predicate on d_year in Query 1. This predicate may require a potentially expensive join between the fact table and the date dimension table to identify all surrogate key values falling between year 2012 and year 2016. However, if we know that d_date_sk orders d_year, then it suffices to make two probes into the date dimension table: one to find the minimum relevant d_date_sk value corresponding to January 1 2012; and one to find the maximum relevant d_date_sk value corresponding to December 31 2016. This allows us to restate the “between” predicate in terms of these two surrogate key values, rather than by years, therefore eliminating the join [?].
With the help of ODs, query optimizers can eliminate costly operators such as joins and sorts, and can identify additional interesting orders: ordered streams between query operators that exploit the available indices, enable pipelining, and eliminate intermediate sorts and partitioning steps [?]. Sorting and interesting orders are integral parts of relational query optimizers, not only for SQL orderby and groupby, but for instance also for sortmerge joins [?] and implementing indexes [?].
Date and time are richly supported in the SQL standard and frequently appear in decision support queries over historical data. For example, the widelyused TPCDS benchmark contains 99 queries, of which 85 involve date operators and predicates and five involve time operators and predicates. If the concept of ODs were only applicable to date and time, they would already be very beneficial. As seen in Table Effective and Complete Discovery of Order Dependencies via Setbased Axiomatization, ODs are also useful in other domains arising from business semantics over many types of measures such as sales, flight schedules, stock prices, salaries and taxes [?, ?, ?].
To use ODs for data cleaning and query optimization as outlined above, we need to know which ODs hold on a given dataset. While ODs can be obtained manually through consultation with domain experts, this is known to be an expensive, time consuming, and errorprone process that requires expertise in the data dependency language [?]. The problem we study in this paper is how to discover automatically ODs from data. Such automatically discovered ODs can be manually validated by domain experts, which is an easier task than manual specification.
This problem has been studied before, but is not well understood. Our aim is to provide a deeper understanding of OD discovery. An OD discovery algorithm was recently proposed by Langer and Naumann [?], which has a factorial worstcase time complexity in the number of attributes. (This is the only prior OD discovery work of which we are aware.) In contrast to FDs, ODs are naturally expressed with lists rather than sets of attributes. The first sound and complete axiomatization for ODs is expressed in a list notation [?]; Langer and Naumann use this. For instance, salary orders is different from salary orders whereas the two analogous FDs are equivalent.
An insight we present is that ODs can be expressed with sets of attributes via a polynomial mapping into a setbased canonical form. The mapping allows us to design a fast and effective OD discovery algorithm that has exponential worstcase complexity, , in the number of attributes (and linear complexity in the number of tuples). This complexity is similar to previous FD and inclusion dependency discovery algorithms such as TANE [?]. In one of our experimental runs, using a flight dataset with 1000 tuples and 20 attributes, our algorithm takes under one second whereas the algorithm from [?] does not terminate after five hours (Section LABEL:sec:experiments).
We also develop sound and complete setbased axioms (inference rules) for ODs that enable pruning of the search space, which can alleviate the worstcase complexity in practice. Our pruning rules do not affect the completeness of the discovered ODs. The listbased algorithm in [?] uses aggressive pruning rules to overcome their factorial complexity, which we show leads to incompleteness (counter to the claim of complete discovery made in [?]). They can miss important ODs. For example, the subclass of ODs in which the same attribute(s) are repeated in the lefthandside and the righthandside such as yr orders is missed (see Section LABEL:sec:pruning). Their canonical form designed to accommodate a listbased lattice, is incomplete for representing ODs.
Lastly, the listbased canonical form for ODs used in [?] is not compact. By introducing a setbased canonical form for ODs, we can achieve much greater compactness in our representation. Many ODs that are considered minimal by the algorithm in [?] are found to be redundant by our algorithm. This is quite important for efficiency in OD discovery. We do not need to rediscover the “same” ODs repeatedly–that is, ODs that can be inferred from ones we have already discovered–and we can more aggressively prune portions of the search space which would only have “repeats”.
In Section Effective and Complete Discovery of Order Dependencies via Setbased Axiomatization, we define ODs and provide additional examples. In this paper, we make the following contributions.

Mapping. We translate ODs (formally defined in Section Effective and Complete Discovery of Order Dependencies via Setbased Axiomatization) into a novel setbased canonical form (Section LABEL:section:canonical) which leads to a new and efficient approach to OD discovery (Section LABEL:sec:discovery). By mapping ODs into equivalent setbased canonical ODs, we illustrate that they can be discovered efficiently by traversing a setcontainment lattice with exponential worstcase complexity in the number of attributes (the same as for FDs [?]), and just linear in the number of tuples (Section LABEL:sec:compAnalyis).

Setbased Axiomatization. We introduce axioms for setbased canonical ODs and prove these are sound and complete (Section LABEL:sec:canUODAxioms). Our inference rules reveal insights into the nature of canonical ODs, which lead to optimizations of the OD discovery algorithm to avoid redundant computation (Section LABEL:sec:discovery).

Completeness and Minimality. We prove that our discovery algorithm produces a complete, minimal set of ODs (Section LABEL:sec:discovery).

Experiments. We provide an experimental study (Section LABEL:sec:experiments) of the performance and effectiveness of our discovery techniques using real datasets. We demonstrate the speedup due to our pruning rules, and report ordersofmagnitude performance improvements over previous work [?].
We discuss related work in Section LABEL:section:relatedWork. In Section LABEL:section:conclusions, we conclude and consider future work.
We begin by formally defining ODs, present the established listbased axiomatization [?], and explain the two ways in which they can be violated.
We use the following notational conventions.

Relations. R denotes a relation (schema) and r denotes a specific relation instance (table). A, B and C denote single attributes, s and t denote tuples, and denotes the value of an attribute A in a tuple t.

Sets. and denote sets of attributes. Let denote the projection of tuple t on . is shorthand for . The empty set is denoted as .

Lists. X, Y and Z denote lists of attributes. X may represent the empty list, denoted as . denotes an explicit list. denotes a list with head A and tail T; i.e., the remaining list when the first element is removed. Let XY be shorthand for (X concatenate Y). Set denotes the set of elements in list X. Any place a set is expected but a list appears, the list is cast to a set; e.g., denotes . Let denote some other permutation of elements of list X.
Let an order specification be a list of attributes defining a lexicographic order; i.e., sorting on the first attribute in the list, breaking ties by the second attribute, breaking further ties by the third attribute, etc. This corresponds to the semantics of the SQL orderby clause.
First, we define the operator ‘’, which defines a weak total order over any set of tuples, where X denotes an order specification. Unless otherwise specified, numbers are ordered numerically, strings are ordered lexicographically and dates are ordered chronologically (all ascending).

Let X be a list of attributes. For two tuples r and s, , if

; or

and ; or

, , and .
Let if but . \@endtheorem
Next, we define order dependencies [?, ?, ?].
