Discovery of Paradigm Dependencies

Discovery of Paradigm Dependencies

Jizhou Sun, Jianzhong Li, Hong Gao
School of Computer Science and Technology, Harbin Institute of Technology
92 West Dazhi Street, Nan Gang District, Harbin, China
 sjzh@hit.edu.cn,  lijzh@hit.edu.cn,  honggao@hit.edu.cn
Abstract

Missing and incorrect values often cause serious consequences. To deal with these data quality problems, a class of common employed tools are dependency rules, such as Functional Dependencies (FDs), Conditional Functional Dependencies (CFDs) and Edition Rules (ERs), etc. The stronger expressing ability a dependency has, data with the better quality can be obtained. To the best of our knowledge, all previous dependencies treat each attribute value as a non-splittable whole. Actually however, in many applications, part of a value may contains meaningful information, indicating that more powerful dependency rules to handle data quality problems are possible.

In this paper, we consider of discovering such type of dependencies in which the left hand side is part of a regular-expression-like paradigm, named Paradigm Dependencies (PDs). PDs tell that if a string matches the paradigm, element at the specified position can decides a certain other attribute’s value. We propose a framework in which strings with similar coding rules and different lengths are clustered together and aligned vertically, from which PDs can be discovered directly. The aligning problem is the key component of this framework and is proved in NP-Complete. A greedy algorithm is introduced in which the clustering and aligning tasks can be accomplished simultaneously. Because of the greedy algorithm’s high time complexity, several pruning strategies are proposed to reduce the running time. In the experimental study, three real datasets as well as several synthetical datasets are employed to verify our methods’ effectiveness and efficiency.

I Introduction

Statistics show that dirty data is becoming more inevitable and widespread[1, 2], which often causes serious consequences[3, 4] and is expensive to clean. In recent years, the database communities have extensively investigated the problem of dealing with dirty data. Inconsistency and incompleteness are two important aspects of dirty data. A database is inconsistent if it violates some data quality rules such as functional dependencies[5], conditional functional dependencies[6, 7], extended conditional functional dependencies[8] and editing rules[9], fixing rules[10], etc. These rules are also helpful in imputing missing values in an incomplete database.

All these rules exploit relationships between entire attributes, just as defined in relational database where attributes are non-splittable. In many actual applications however, part of an attribute value (especially one with string-type) contains information useful in dealing with incomplete and inconsistent data: manufacturer of a product may name it by the specifications, an ISBN number contains information of the press, DOI number of a paper indicates its publishing information such as the organization, the volume and issue number, etc.

In this paper, we consider of discovering such type of dependencies (Paradigm Dependencies, PDs) from existing datasets. Take a motivating example, in a online shopping website, such as eBay, Amazon and Rakuten, etc., kinds of products with their specifications are listed in the demonstrating pages. However, the specification data may contain errors or even be incomplete. Error information may mislead the customers to buy goods they dislike. Commodities with incomplete information may be neglected in searching. For instance, a customer is looking for computers with memory size equals , a product meeting the requirement will become invisible to the potential buyer if its memory size value is unknown. To avoid these undesirable things from happening, PDs say that part of the id (or type identifier, serial number etc.) of a product can help finding its correct specification values. We employee a real world example to illustrate the feasibility:

Example 1

SL410, T520i and T560 are three notebook types of Thinkpad, as shown in table I. They are not named casually, but have followed some laws: the starting letter stands for its serial, the following numeric digit for Screen Size (4 for 14 inch and 5 for 15 inch), and the second digits of T520i and T560 show that they are firstly sold in 2012 and 2016 respectively.

If the Screen Size value of T560 is unfortunately lost, according to the rule ’The first numeric digit decides the screen size value’ along with T520i’s information, it is easy to derive the missing value, while in the traditional dependencies it is not the case.

Type Aligned Type Year Screen Size
SL410 SL410_ 2010 14 inch
T520i T_520i 2012 15 inch
T560 T_560_ 2016 15 inch
New S1 2016 New S1 2016 2016 12 inch
New S1 2017 New S1 2017 2017 13 inch
TABLE I: Several notebook types of Thinkpad

From real world applications, several observations can be found in the naming laws:

  • The string values are of different lengths, and digits with the same meaning between different string may occur in different positions.

  • The meaningful digits are often ordered, for example, the digits standing for screen size always appear before those for year.

  • Digits with similar meanings are also similar to each other. In the running example, the first numeric digit stands for screen size. 4 and 5 are not far from each other in the alphabet.

  • Not all tuples obey the same naming law, e.g., New S1 2016 and New S1 2017 are named in a different way from the other three.

It can be found that there are indeed available information in part of a string-type attribute. Meaningful naming laws exist in many areas in different ways, making it hard to obtain manually, and discovering rules automatically is of great significance.

A intuitive solution is clustering and aligning the strings first, then detecting dependencies in each clustered group. By clustering, strings under similar naming laws should be clustered together, and by aligning, characters with similar meaning should be aligned to the same column. In table I, for instance, the first three types may be clustered together and aligned by inserting values (presented by underlines in column Aligned Type). Dependency relationships between these aligned positions and other attributes can be found and expressed by star-free regular expressions in the detecting step, where several measures are needed, such as support and confidence, etc., similar to those in [11].

Contributions of this paper include:

  • A framework of finding paradigm dependencies is introduced.

  • The problem of string aligning is analysed and proven in NP-Complete.

  • A greedy algorithm is proposed and optimized, by which the aligning and clustering tasks are combined seamlessly.

  • Experiments are conducted on both real and synthetical datasets to verify our methods’ performance.

In the rest of this paper, related work is summarized in section II. In section III, the problems of clustering and aligning, are studied in detail. In section IV we discussed how to find dependencies using the clustered and aligned attributes. In section V are our experimental results and the paper is concluded in VI.

Ii Related Work

Ii-a Data Quality Rules

Arenas et al. proposed the concept of consistent query answer with respect to violation of traditional FDs, which identify permanent answers over all possible repairs of a dirty database[12]. In [7], Bohannon et al. introduced CFDs by introducing embedded values into FDs, which have stronger expressing ability. Matching dependencies [13] were introduced to expressing an expert’s knowledge in improving data quality. MDs declare that if two tuples are similar enough on some attributes, they should equal on a certain other attribute. Fan et al. proposed editing rules in [14], which can find certain fixes based on some already known clean information. Given some certain information, editing rules can tell which attributes to fix and how to update them. In [15], fixing rules were proposed. A fixing rule contains an evidence pattern, a set of negative patterns and a fact value. When a tuple fits some negative patterns, the corresponding fixing rule can capture the wrong values and knows how to fix them. For dirty timestamps, temporal constraints were introduced in [16], which declare the distance restrictions between timestamps. Guided by the constraints, the problem of repairing inconsistent timestamps was studied. In [17], regular expressions (REs) can be used to repair dirty data, making them obeying the given REs with minimal revision cost.

Ii-B Rules Discovery

There are numerous of algorithms proposed for discover FDs, including Tane[18], DepMiner[19], FastFDs[20]. Both Tane and DepMiner search the attribute lattice in a levelwise manner, and can directly obtain minimal FD cover. On the other hand, FastFDs search FDs in a greedy heuristic and depth-first manner, and may lead to non-minimal FDs, a further checking progress is required.

In [11], the problem of discovering CFDs was studied, algorithms for discovering minimal CFDs and dirty values in a data instance were presented. Fan et al. proposed algorithms discovering CFDs analog to those for FDs, i.e., CFDMiner for discovering constant CFDs, and Ctane and FastCFD for ordinary ones in [21].

Besides discovery of FDs and CFDs, there have been researches focus on discovering other rules. Discovery of Denial Constraints (DCs) can be found in [22], an efficient instance-driven discovery algorithm (extended from FastFD) was developed based on a set of inference rules. For temporal rules used in web data cleaning, [23] used machine learning methods such as association measures and outlier detection to handle the discovery problem, which is robust to noise.

Iii Clustering and Aligning

In this section, we introduce the clustering and aligning problems, and then analyze the difficulties in solving them. Table II summarizes the most important symbols used in the rest sections of the paper.

Notation Description
Distance Function
Diameter of a set of elements
Length of strings in paradigm
Number of strings in
The set of characters at the th column of
Sum of for all location s
A new paradigm created by
aligning (or merging) and
Lower bound of
Upper bound of
Bound Interval of
The set of critical intervals
Paradigms involved in
TABLE II: Several Important Frequently Used Symbols

Iii-a Clustering

In a good clustering method, two properties are desired:

  • Strings with similar naming law should be clustered together.

  • Strings with different name laws should be clustered into different groups.

For the first property, if strings with similar naming law are separated, the number of strings in a group will be reduced, which means low supports. On the other hand, if strings with different naming laws are grouped together, existing dependencies would be discarded due to underestimated confidences. All these undesired conditions will prevent true dependencies from being discovered.

In a classical clustering, one or more parameters are often required to control the number or sizes of clusters. Unfortunately, in our situation, it is hard to give such a parameter. In addition, simply grouping strings into different sets are insufficient. For example, if dependencies and can be discovered from groups and respectively, it is possible that by merging and into , a new dependency can be detected.

In this paper, we consider of an adaptive method similar to hierarchical clustering [24, 25, 26]. In our proposed framework, clustering parameters are not required and groups like and are all maintained.

Iii-B Aligning

Let be a set of strings over character set . Given a string , is the number of characters (or elements) in , and the th character in is denoted by , where .

Before the problem definition, we introduce several notations. A distance function over is:

In the rest of this paper, is metric by default, if not specified.

Definition 1 (Diameter of Character Sets)

For a set of characters, the diameter of is the largest distance between every pair of elements in :

Over ’’, the diameter satisfies monotonicity and triangle inequality:

Lemma 1

For two arbitrary characters and , if always holds. (Proof omitted.)

Lemma 2

For arbitrary three character sets , and , the following equations holds:

(1)
Proof 1

Let and be the pair of farthest characters in . We prove the inequality in all possible cases:

  • If , then .

  • If , the proof is similar.

  • In the rest case, and must come from and respectively, without of loss of generality, assume and , for any character : and . Meanwhile, because is a metric, we have , indicating Eq. 1.

Definition 2 (Paradigm)

A paradigm is a set of strings with equal length. The common length of strings in is also called ’s length, denoted as , and cardinality of is defined as the number of strings in , denoted as . The size of is defined by:

where is the set of characters in the th column:.

Now we are ready to define the aligning problem formally.

Definition 3 (String Aligning Problem (SAP))

Given a set of strings , a distance function and a threshold , find out a paradigm , where is obtained from by inserting values, such that .

The intuition behind is that similar characters will be aligned into a single column with high probability if the distance function is well designed. Only insertions are allowed because delete or change a character will cause loss of information, which is undesirable. The intuition of using to evaluate the size of a paradigm is that, smaller diameter means higher similariry degree. With minimized, similar characters would be aligned together with high possibility. Meanwhile, number of distinct elements in should not influence the size too much. For example, if ’’ and ’’ is already in , ’’ is natural to add into this set, and should not enlarge the size too much. For a contrary example, does not meet the latter property.

Unfortunately, the SAP is intractable.

Theorem 1

SAP is in NP-Complete, even if is a metric.

Proof 2

The lower and upper bounds of is and respectively. To prove , we just guess and the positions to insert s for each string in , and then verify whether the paradigm size is no more than . Obviously, this can be done by a nondeterministic algorithm in polynomial time, so

The NP-Completeness can be proven by reduction of the Shortest Common Super-sequence (SCS) problem[27].

The SCS problem is that for a set of strings, find out a string with , such that for any string , is a subsequence of . Given such a instance of SCS, the construction of String Aligning Problem (SAP) need some tricks. Let be the string set in SAP,

where is a string constituted with identical characters , which is a new character hasn’t appeared in .

The distance function is defined as follows:

The threshold is . It is easy to verify that satisfies the triangular inequality, and the construction can be done in polynomial time.

Now we prove that there exists a common super-sequence for the SCS Problem with if and only if there exists a paradigm for the SAP Problem with .

If. Because is in the string set, and only insertions are allowed, so . There are exactly columns containing the character , and everyone of them contains a character different from (i.e., or a ordinary one). For such a single column, the diameter of the corresponding character set satisfies according to the definition of . Thus we have

Along with the If condition , we have , which means and there are no more than 2 different not-null characters at each column. By removing the string from , the rest strings can be directly merged into a super-sequence with length equals .

Only If. If there exists a common super-sequence for the SCS Problem with , each string in the string set can be transformed into by inserting characters into some places. By inserting s instead of specified characters, into these places, a new string with length equals can be obtained. The new strings along with formed the solution of the SAP Problem. It is obvious that . At each column, there are no two characters with distance larger than 1, so the diameter at each column is no more than 1 and .

Due to its intractability, we consider of greedy solutions. The overall framework and algorithms are introduced in the rest of this section.

Iii-C Framework

Although the aligning problem is intractable when the number of strings is arbitrary, there exist determined algorithms giving the optimal resolution for a constant number of strings, by Dynamic Programming (DP). Actually, the problem of calculating Edit Distance is a specialization of the aligning problem with two input strings, by specifying the distance between different characters as a constant value. The algorithm is almost all the same, except for some difference in the recursive equation. In our setting, the recursive equation becomes:

In calculating Edit Distance, the distance function is replaced with the constant . In the rest of this paper, in situations of two strings, we use the word ’merge’ instead of ’align’ timely for sake of easy understanding.

Due to tractability of the two-string case and intractability of the general case, a straightforward idea is merging the most similar strings at each step ,i.e., in a greedy manner. The intuition behind is that strings with similar naming law should be merged as early as possible. By doing this, different characters with similar meanings will be aligned together with high possibilities. For example, ’’ and ’’ are obviously obeying the same naming law. If they are merged firstly, the three identical characters ’’, ’’ and ’’ can help us align the different characters ’’ and ’’ together.

We say that the aligning result paradigm is merged from the input strings. Generally, merged paradigms can be further merged with other strings or paradigms in similar way. Without loss of generality, a single string can be seen as a initial paradigm. Then we can denote by the merged paradigm from paradigms and . We call the super-paradigm of (or ), and (, resp.) is a sub-paradigm of .

There are two optional greedy algorithms:

In the first option, two strings in are merged into a paradigm , with minimized. Next, a string in is selected and merged into , which has never been selected before and with still minimized. This process continues until all strings in are merged into . Because a single paradigm is being enlarged in the running time, we call it Single Merge, as shown in algorithm 1.

0:  A set of strings over charset and a metric defined on .
0:  An aligning of with size as small as possible.
1:  , where and are selected from with minimized.;
2:  
3:  while  do
4:     Select a string in with minimized.
5:     .
6:     
7:  return  ;
Algorithm 1 Single Merge

In the alternative option, each string in is initialized as a single-string paradigm. All initial paradigms are added into the set . Then all the paradigms are merged iteratively in a pair-wise manner: at each step, a pair of paradigms and are selected and merged into a new one, with minimized. After each merging operation, and are removed from and is added. This process continues until a single paradigm is obtained in . Because a pair of paradigms are merged at each step, we call it Pairwise Merge, as shown in algorithm 2.

0:  A set of strings over charset and a metric defined on .
0:  An aligning of with size as small as possible.
1:  ;
2:  for each string  do
3:     Add paradigm into
4:  while  do
5:     Find out two paradigms and in , with minimized.
6:     Add into .
7:     Remove and from .
8:  return  the single paradigm in ;
Algorithm 2 Pairwise Merge

The most time-consuming operation is ’’, which has a time complexity of where is the length of inputting paradigms, and is the charset size. It can be shown that both Single Merge and Pairwise Merge require times of merging, where is the number of strings .

Pairwise Merge is more preferred in this paper, for it naturally fits a Hierarchical Clustering algorithm’s requirement without extra workloads: Each paradigm can be seen as a group of similar strings, if two paradigms and are similar enough, they would be merged into a larger one .

Another benefit is that, no clustering parameter is required in the pair-wise framework. By maintaining , and in a binary tree structure, discovery dependency becomes straightforward, which will be discussed in section IV.

Due to the high algorithm complexity, we consider several pruning strategies to improve the efficiency, with necessary theoretical supports.

As discussed before, the most time-consuming operation in the pair-wise algorithm is merging of paradigms. More over, merging is carried out for times, even though only of them are actually performed to generate larger paradigms (in Line 6 of algorithm 2) and all of the rest are just for finding out the pair with smallest size (Line 5). So it is possible to reduce the number of useless merging work, and we propose two pruning techniques.

Iii-D Bound Based Pruning

Instead of calculating the exact size of merging each pair of paradigms, a size interval may requires much less computation. We denote the lower and upper bound of by and respectively. The corresponding interval is denoted by .

The basic idea is that when finding the pair of paradigms with the lowest merging size, we maintain a interval for each candidate pair. If by we denote the lowest upper bound of these intervals, all candidate pairs with no higher lower bounds than can be safely pruned in the current iteration. For all of the rest candidates, we call them critical intervals, denoted by , tightening their bounds can help pruning more candidates. The tightening process continues until a single candidate left in , who is right the pair to be merged. We call this the refining process, and illustrate it by example.

Example 2

For candidate intervals , , and in Fig. 1. has the lowest upper bound , and ’s lower bounds are lower than , it is hard to tell which pair of paradigms is of the lowest size corresponding to them. On the other hand, ’s lower bound is greater than , the candidate pair’s merging size must not be the lowest, and can be safely pruned, presented by dotted line. Thus the critical intervals contains , and .

By refining intervals in , as shown in Fig. 1, becomes lower and and ’s lower bounds becomes higher. ’s lower bound becomes higher than and can be removed from and pruned. By now only two intervals remain in CR, i.e., and . By a further refining, becomes even lower and is pruned, is identified to be the pair with minimal merging size, as shown in Fig. 1.

Fig. 1: An example of interval refining

Before describing the pruning based algorithm, we introduce several necessary inequalities, and discuss how to evaluating the bounds using to them.

Monotonicity

Size of a paradigm is monotonic.

Lemma 3

Merging a paradigm into doesn’t decrease its size:

(2)

More over, The size of merging paradigm with a super-paradigm is no less than that of merging with a sub-paradigm.

Lemma 4
(3)

These two lemmas are obvious and straightforward to prove, thus are omitted here.

Triangular Inequality

Theorem 2 (Triangular Inequality)

The paradigm size satisfies the triangular inequalities under merging operations if is a metric, that is, for three arbitrary paradigms , and , the following equations always hold:

(4)
(5)
(6)

Eq. 4 has a not so triangle form compared with Eq. 5, so we call it the Pseudo Triangular Inequality.

Proof 3

We introduce a new paradigm . By the subscribe ’’ it means that when aligning with , the relative positions are the same as those in . Obviously, . It becomes sufficient to prove

In , for a random position , we denote the character sets at coming from , and by , and respectively for ease. Because the only change of is inserting a set of s (whose size equals ), remains unchanged. It becomes sufficient to prove , which directly holds according to Eq. 1.

Eq. 5 can be directly proved using Eq. 4 and Eq. 3, and Eq. 6 can be inferred from Eq. 5 by element substitutions.

In our framework, is likely to be unavailable, instead, the bounds and are maintained. Thus, the previous inequalities should be adjust before using, for example, Eq. 4 becomes . In the rest of this paper, we refer the inequalities by their adjusted versions, if not specified explicitly.

Eq. 2, 3 and 4 provide bounds when a new merged paradigm is added into . If and are identified as the pair with minimal size, then after adding into , and should be initialized for each in .

According to Eq. 2, is a candidate value of . With Eq. 3, (similarly, as well as ) can be another candidate value of . For , Eq. 4 asserts that and are available upper bounds.

On the other hand, Eq 5 and 6 play an important role when the former three are not applicable. For instance, before the iterations begin, each paradigm contains a single string, while Eq. 2-4 focus on new merged paradigms. In addition, while refining , no new paradigm is created and the critical intervals require tightening. By calculating the exact values of and , interval of would be tightened.

Now we discuss how to find pairs whose sizes should be exactly evaluated to refine the critical set . By , we denote the set of paradigms involved in . To tighten for all pairs , we propose a single-pivot-star refining technique: Select a paradigm as the pivot, then evaluate the exact value of for each paradigm in except it self. By doing so, bounds for each pair of paradigms can be updated by:

and

We demonstrate this by an example

Example 3

In Fig. 2, there are paradigms, presented by nodes, among which is selected as the pivot. , , and are evaluated exactly (equal , , and respectively), presented by solid lines and shaped like a star. All other pairs’ bounds are obtained according to Eq. 5 and 6: .

The intuition of single-pivot-star refining is that, all pairs’ bounds can be evaluated exactly or by Eq. 5 and 6 directly. Meanwhile, it carries out the minimal number (equals ) of necessary merging operations, because connecting all paradigms requires that much edges.

Pivot Selection

Now we show how to select a pivot wisely. is a set of intervals overlapping with each other. From Fig. 1, it can be shown that tightening intervals help reducing . Thus we evaluate a pivot’s ability of shortening intervals. A score function on is introduced:

(7)

That is, sum of widths of intervals that are related with . Then the paradigm with max score is selected as the pivot: . In single-pivot-start refining, we can see that all intervals about shrink into exact values, the corresponding total width reduced equals , which should better be maximized.

Fig. 2: An example of single-pivot-star refining.

By now, pruning techniques based on lower and upper bounds have been discussed. Next, we introduce Independency Base Pruning, to improve the algorithm efficiency further.

Iii-E Independency Based Pruning

When identifying , up to intervals are identified to be critical in the worst case. That is the case if almost all paradigm pairs are of the same size. It is not hard to get that . Meanwhile, in the refining process, up to actual merging operations are required and intervals are to be refined. As a result, merging and refining operations are performed, this will degrade the efficiency seriously.

To reduce ’s cardinality, it can be identified more wisely. By we denote the interval from which is found, e.g., in Fig. 1. We consider of verifying those intervals who are dependent with only. By ’dependent’, it means that one interval shares a common paradigm with another. We demonstrate this by an example:

Example 4

In Fig. 1 of Example 2, is because it holds the lowest upper bound. and are dependent with , because they share a common paradigm with (i.e., and respectively). On the other hand, is independent with because they share no common paradigm. With independency based pruning, is not added into , and the refining terminates after a single refining process (see Fig. 1).

Under this pruning strategy, a paradigm pair with higher size may be merged earlier than another pair with lower size. Fortunately, this adjustment doesn’t affect the final result, which can be proved theoretically.

Theorem 3

If and is of the minimal merging size among all pairs involving or , they will be actually merged sooner or later.

Proof 4

Relationship between , and all other paradigms is illustrated in Fig. 3. and have the minimal merging size with each other. Note that there probably exist pairs having lower merging size than .

Now it can be shown that and will be merged eventually, no matter sooner or later. When merging occurs between other paradigms, say and , is added in to , we have , according to Eq. 3. and still have the minimal merging size with each other. Thus, it can be concluded that when (or ) is merged, it must be merged with ( resp.), so it makes no difference by merging them sooner or later.

Fig. 3: Relationship between and other paradigms.

Iii-F Algorithm Flow

Our pruning based algorithm is illustrated in Algorithm 3. In lines 1-3, The paradigm set as well as size intervals are initialized. Then paradigms in are merged in pair-wise manner iteratively, until a single one is obtained (Lines 4-15). In each iteration, the critical interval set is identified (Lines 5-6). Then is refined iteratively until a single interval left (Lines 7-10). In line 8, a pivot paradigm is selected according to Eq. 7. Then intervals in are refined by calling algorithm 4, and and are updated accordingly (Line 10).

After the refining process terminated, from the single interval in , two paradigms can be identified to be merged into a new one (Line 11), the paradigm set is updated (Lines 12 and 13). For the new added paradigm , the sizes’ bounds of merging it with others are evaluated (Lines 14 and 15).

0:  A set of strings over charset and a metric defined on .
0:  An aligning of with size as small as possible.
1:  Initialize the paradigm set with .
2:  for each pair of paradigms  do
3:     
4:  while  do
5:      Interval with the lowest upper bound.
6:     Identify the critical set by .
7:     while  do
8:        Identify the pivot paradigm using Eq. 7.
9:        Refine(,)
10:        Update and accordingly.
11:     , where
12:     Add into .
13:     Remove and from .
14:     for each paradigm  do
15:        Update and using Eq. 2-4.
16:  return  the single paradigm in ;
Algorithm 3 Pruning Merge

Algorithm 4 provides a procedure tightening intervals in by . For each paradigm involved in , the exact value of is calculated (Lines 2-3). With these exact values, other paradigm-pairs’ sizes are updated using Eq. 5 and 6, if they make the interval tighter (Lines 5-6).

By pruning, number of merging times in Algorithms 3 is no more than that in Algorithm 2. Meanwhile, at least one merging operation is performed in lines 7-10. We can conclude that Algorithm 3 always terminates.

1:  .
2:  for each paradigm  do
3:     Set and with .
4:  for each pair of paradigms  do
5:     Update and using Eq. 6 and 5 according to and
Algorithm 4 Refine(, )

Iv Finding Paradigm-Dependency

Given a paradigm , can be very large, making it hard to express. We consider of compacting by deduplicating in columns, denoted by . For instance, in Table I of Example 1, Type of the first three tuples are aligned into a new column Aligned Type (a paradigm ). By duplication, we get: . Actually it’s of the form of star-free Regular Expressions: exactly one of the elements in braces ’’ is supposed to appear, and no more than one of elements in square brackets ’’ is supposed. For instance, TL510i is a string matches .

By now, with paradigms generated in the previous section, the notion of Paradigm Dependencies can be defined.

Definition 4 (Paradigm Dependency)

a paradigm dependency is of the form . Where is a paradigm generated on the a string-type attribute , is a integer between and , and are attributes in , which is the attribute set on relation .

The semantic of is that, given an instance on , for any two tuples and in with their values matching , and and are the two characters aligned to the th column of in and respectively, if always holds, we say that satisfies , denoted as .

The intuitive meaning of is that, characters in aligned to the specified location can decide the values of .

Before defining the discovery problem of paradigm dependencies, several measures are necessary for finding high quality dependencies.

Support is defined as the maximum number of tuples satisfying the dependency:

is the dataset from which dependencies are discovered. Support is a frequency measure based on the idea that values which occur together frequently have more evidence to substantiate that they are correlated and hence are more interesting. A threshold should be specified and every dependency discovered on should have a no lower support.

Confidence of over can be stated as:

it measures the level to which satisfies . It is employed in this paper for two reasons: 1) may contains incorrect values and 2) characters with the same meaning are not always aligned to the same column. Again, a threshold should be specified to filter the discovered dependencies.

It is possible that domain of the right hand attribute is very small. For a real world instance, there are only two mainstream cpu brands (AMD and Intel) for computers. Thus In , it becomes possible that only a single value occurs on the right hand side attribute . In this situation, all dependencies with on the right hand side are satisfied by , which have no meaningful information and should be discarded. To handle this, the Diversity measure is introduced:

where is the set of values of attribute occurred in , and by , it counts the number of distinct values.

On the other hand, it is also possible that there are so many different characters occurring on the th column in the aligned strings, that any pair of tuples have different values on the left hand side. That means, in Definition 4 always hold, as a sequence, all dependency with the the column on the left hand side are always satisfied. They are meaningless and should be discarded, by introducing the measure InnerSupport:

where is defined on and , and denotes the number of the most frequent value’s occurrence. For example, because is most frequent value and it occurs for times. Similarly, the corresponding threshold is required.

Now we are ready to define the discovery problem formally.

Definition 5 (Discovery of Paradigm Dependencies)

Given a dataset on attribute set , is a paradigm aligned on , where is a string-type attribute in , as well as four measure thresholds: , , and . Find out all paradigm dependencies satisfying the thresholds, where and .

Each tuple claims a key-value pair for dependency , denoted by , thus a multiple set of key-value pairs can be obtained: . It is not hard to obtain the support measure from :

where denotes the number of occurrences of in .

Then the confidence and diversity measures can be directly evaluated by their definitions.

Similar to support, inner support can be calculated by:

Actually, support and inner support are maximized meanwhile, i.e., on the same subset .

A straightforward method is evaluating the measurements on each paradigm generated in Algorithm 3 (Line 11), then discard unqualified ones.

There are two prune strategies to reduce useless work: 1) a paradigm with can be discarded directly, because it is obvious that the support measure cannot be higher than and 2) if both and are discarded due to low confidence, then has a low confidence measure as well, where the th column of and the th column of are aligned into the th column of . It is not hard to prove and we omit the detailed discussion here due to space limitation.

V experiments

V-a Settings

All experiments were conducted on a machine with 256 2.2GHz Intel cpus (among which, only one cpu is used) and 3TB of RAM. All algorithms had been implemented in Java with heap size set to 128GB. The underlying operating system is CentOS.

Real Datasets

We used three real datasets, namely the Notebook, CPU and RAM datasets. All of these datasets were manually collected from the corresponding official websites. Each record contains a string-type attribute, which is the product type, id or serial number, and can be seen as an identifier of that product. We call this attribute ID, values on which are to be aligned. Other attributes are used to describe specification of a product. For example, in the Notebook dataset, there are attributes such as Screen Size, Model of Video Card, CPU’s frequency, etc. Not all attribute values for a product can be obtained, and when finding paradigm dependencies, key-value pairs containing s are simply discarded. Statistical information of the three subsets is listed in Table III.

Data Sets Notebooks CPUs RAMs
#Brands 8 2 26
#Records 2004 1592 1532
#Attributes 35 25 12
AVG ID length 21.5 15.3 12.6
MAX ID length 50 23 25
TABLE III: Real world datasets statistics

V-A1 Synthetic Datasets

Synthetic data were used in the efficiency evaluation for the aligning problem, so only those values on ID were generated. When synthesizing datasets, several parameters were used, namely length of ID (), Number of Records (), Number of Clusters () and Variation Ratio (). The charset for generating strings is , i.e., numeric digits, letters and several other visible characters in ASCII.

Firstly, a single string with length was generated for each cluster, with characters randomly selected from . This string was used as a seed to generate more similar strings for the corresponding cluster.

Nextly, strings were generated according to seeds generated before. For each string, we select a seed at random, with uniform probabilities. Then by copying the seed with variation, a new string can be generated similar to that seed. By variation, it means that when copying, each character can be changed with a probability . For similarity, a character is more probable to be changed into one with the same type. For example, if the character ’f’ is changed, it will change into ’d’ more possible than into ’0’ or ’_’. To simulate real world situations, a character is also possible to be deleted from the string. By doing this, similar strings with different length can be obtained. When a new string is generated, it can also be used as a seed, to avoid the situation that all strings in a cluster generated from the same single seed.

Table IV shows some parameters considered in the experiments. When not explicitly stated, we use the default configuration value (highlighted in bold).

Parameter Range
String Length () [10,…,20]
Number of Strings () [1000,…,5000]
Number of Clusters () [10,…,50,…,100]
Variation Ratio () [0.01, …,0.05,0.10]
TABLE IV: Experiment parameter configuration ranges.

V-A2 Algorithms

The Single-Merge based algorithm (Algorithm 1) does not fit our framework in aligning and clustering simultaneously, thus it was not evaluated experimentally. The Pairwise merging algorithm (Algorithm 2) provides the basic idea of this paper, it was considered as a baseline algorithm to evaluate the pruning techniques’ efficiency. Two pruning strategies were proposed on the baseline algorithm, namely bound based and independency based techniques. The independency based one is very efficiency, such that when abandoned, the algorithm becomes extremely inefficient. This made it hard to be evaluated in an acceptable time period, so we use this technique in default in all versions of the pruning based method.

As discussed before, all inequalities for bound based pruning are used either in the refining phase (i.e., Eq. 6 and 5), or when new paradigm is created (i.e., Eq. 2-4). The former ones are basic of the pruning based method, so they were used by default.

We evaluated two versions of the pruning based algorithm: has or hasn’t used Eq. 2-4, the former is denoted by Prunning+ and the latter by Pruning-.

V-B Discovered Rules

Compacted Paradigm and the location Attribute
CORE I5-{2-46-8}{3-6}{0-79}0 Video Card
PHENOM X{34} {89}{1-9}{05}0[E] Frequency
ATHLONII {XN}{2-46E}[O] {1246-8BK}{0-8}{0-2458}[5][TEKUX] Core Numer
OCZ3{PFGX}{12}{036}{03}{03}{FL}{BV}{2-468}G[K] Memory Size
HX4{23}{01468}C1{2-6}{PF}BK{24}/{13}{26} Memory Frequency
P{DVG}{CTV}{23}{2-468}G{12689}{03-6}{03}{03}{FL}LK Transmission Standard
{AEGSTU}[H]{1579}{2-57}{5-7X} Screen Size
{AEGSTU}[TH]{1579}{2-57PV}{5-7RX}[AO] Video Card Model
VOSTRO-15-{35}56{28}–LAPTOP VOSTRO15-{35}56{28}-D{12}{1-35-8}{24}5{BRSL} Turbo Boost Frequency
TABLE V: Example dependencies found from the three real datasets

In discovering paradigm dependencies from the three dataset, thresholds were set as: , , and . As to the distance function , we specified zero distance for the same characters, and for different ones but with the same type, for ones with different type. For example, , and .

Eventually, there were , and dependencies discovered from the Notebook, RAM and CPU dataset respectively. All these rules are verified true manually. With lower thresholds, the result set can grows much larger.

In Table V, several discovered dependencies are listed in three groups, with each group corresponding to a single dataset. The paradigm in the dependency is demonstrated in a compact format, actually it is a star-free Regular Expression: exactly one of the elements in braces ’’ is supposed to appear, and no more than one of elements in square brackets ’’ is supposed, delimiters between optional elements are omitted to save space. The locations are marked by underlines for intuition.

Take the first dependency for example, for the CPU model I5, it means that the second element from backward can be a numeric digit from to except , and its value tells the model of video card integrated.

V-C Efficiency

On Real Datasets

Fig. 4: Number of Merging Operations of Different Methods.

We concern the running time, number of merging operations and number of refining operations in different methods. From Fig. 4 we can see that Baseline is of the highest number of merging, Pruning- reduced that by about two-thirds, and Pruning+ reduced it greatly.

However, in Fig. 5, the time consumption shows a quite different result: Pruning+ remained the most efficient, and Pruning- becomes very inefficient. That is because although some merging operations are saved, even much more time was wasted in the refining phase, maintaining necessary data structures becomes very costly if refining operation is too frequent. Fig. 6 shows comparison of the number of refining between Pruning+ and Pruning-.

Fig. 5: Time Consuming of Different Methods.
Fig. 6: Number of Refine Operations Comparison.

In Fig. 7 are statistics of the number of refines in selecting the best pair of paradigms. It shows that on all of the three datasets, in most of the time, Pruning+ refined the critical set only once (in Line 9 of Algorithm 3). Even in the worst case, only six refining operations were required.

Fig. 7: Statistics of Refine Counts in Pruning+.

On Synthetical Datasets In Fig. 8, the number of Baseline’s merging operations remains a constant, because it is unrelated to strings’ length. Pruning based methods can reduce the merging numbers to a great extent, especially Pruning+.

Fig. 8: Number of Merging Operations When Varying String Length.

As it does on real datasets, Pruning- has the worst efficiency, as shown in Fig. 9.

Fig. 9: Time Consuming When Varying String Length.

Fig. 10 and 11 illustrate the number of mergings and time consumed in different algorithms, when varying the number of input strings (). It can be observed that Baseline’s merging number is proportional to , and Pruning+ has almost linear time costs and merging operations, making it the best algorithm in efficiency.

Fig. 10: Number of Merging Operations When Varying Number of Strings.
Fig. 11: Time Consuming When Varying Number of Strings.

From Fig. 12, it can be seen that by changing number of clusters, or the variation ratio , the running time of Baseline and Pruning+ did not change regularly.

Fig. 12: An example of interval refining

Vi Conclusion

In this paper, a new type of dependencies, Paradigm Dependencies has been proposed, in which the left hand side is part a regular expression like paradigm. To discover such dependencies, a framework has been proposed to align and cluster meaningful strings simultaneously. The aligning problem has been proved in NP-Complete, and a greedy algorithm was introduced in which the clustering and aligning tasks can be combined together seamlessly. Due to the greedy algorithm’s high time complexity, several pruning strategies with theoretical support were proposed to reduce the running time. Then discovery of paradigm dependencies on the generated paradigms have been defined and discussed, based on four measurements. Finally, our methods’ effectiveness and efficiency have been verified on three real world datasets as well as synthetical datasets.

For future work, we consider of several aspects: 1) In our framework, dependencies are discovered on already merged paradigms. Actually, correlations between a paradigm’s column and an attribute’s values can help aligning the strings more wisely. So by changing our two-step framework into a iterative and interactive one may improve the effectiveness to some extent. 2) Due to the high complexity, we consider of redesigning the greedy algorithm, e.g., by parallelization, or trading off between effectiveness and efficiency, etc. 3) In the discovering phase, we assumed a single position in the left hand side. Actually, it is possible that multiple elements together in a string indicate another attribute’s value, which makes the problem more complicated. 4) The purpose of introducing paradigm dependencies is to handle dirty data, so it is nature to study data cleaning problems, such as inconsistent data repairing and missing values imputation, using these new proposed rules.

Acknowledgment

This paper was partially supported by the Key Research and Development Plan of National Ministry of Science and Technology under grant No. 2016YFB1000703, the Key Program of the National Natural Science Foundation of China under Grant No. 61190115, 61472099, 61632010, U1509216, National Sci-Tech Support Plan 2015BAH10F01, the Scientific Research Foundation for the Returned Overseas Chinese Scholars of Heilongjiang Province LC2016026 and MOE¨CMicrosoft Key Laboratory of Natural Language Processing and Speech, Harbin Institute of Technology.

References

  • [1] D. W. Miller Jr, J. D. Yeast, and R. L. Evans, “Missing prenatal records at a birth center: A communication problem quantified,” in AMIA Annual Symposium Proceedings, vol. 2005.   American Medical Informatics Association, 2005, p. 535.
  • [2] N. Swartz, “Gartner warns firms of ¡®dirty data¡¯,” Information Management Journal, vol. 41, no. 3, p. 6, 2007.
  • [3] D. P. Ray and Y. Ghahremani, “Credit card statistics, industry facts, debt statistics,” last modified December, vol. 26, 2013.
  • [4] B. Otto, “From health checks to the seven sisters: The data quality journey at bt,” Ph.D. dissertation, University of St. Gallen, 2009.
  • [5] E. F. Codd, “A relational model of data for large shared data banks,” Commun. ACM, vol. 13, no. 6, pp. 377–387, Jun. 1970.
  • [6] W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis, “Conditional functional dependencies for capturing data inconsistencies,” ACM Transactions on Database Systems (TODS), vol. 33, no. 2, p. 6, 2008.
  • [7] P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis, “Conditional functional dependencies for data cleaning,” in Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on.   IEEE, 2007, pp. 746–755.
  • [8] L. Bravo, W. Fan, F. Geerts, and S. Ma, “Increasing the expressivity of conditional functional dependencies without extra complexity,” in Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on.   IEEE, 2008, pp. 516–525.
  • [9] W. Fan, J. Li, S. Ma, N. Tang, and W. Yu, “Towards certain fixes with editing rules and master data,” Proceedings of the VLDB Endowment, vol. 3, no. 1-2, pp. 173–184, 2010.
  • [10] J. Wang and N. Tang, “Towards dependable data repairing with fixing rules,” in Proceedings of the 2014 ACM SIGMOD international conference on Management of data.   ACM, 2014, pp. 457–468.
  • [11] F. Chiang and R. J. Miller, “Discovering data quality rules,” PVLDB, vol. 1, no. 1, pp. 1166–1177, 2008.
  • [12] M. Arenas, L. Bertossi, and J. Chomicki, “Consistent query answers in inconsistent databases,” in Proceedings of the Eighteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, ser. PODS ’99, 1999, pp. 68–79.
  • [13] W. Fan, “Dependencies revisited for improving data quality,” in Proceedings of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2008, June 9-11, 2008, Vancouver, BC, Canada, 2008, pp. 159–170.
  • [14] W. Fan, J. Li, S. Ma, N. Tang, and W. Yu, “Towards certain fixes with editing rules and master data,” VLDB J., vol. 21, no. 2, pp. 213–238, 2012.
  • [15] J. Wang and N. Tang, “Towards dependable data repairing with fixing rules,” in International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014, 2014, pp. 457–468.
  • [16] S. Song, Y. Cao, and J. Wang, “Cleaning timestamps with temporal constraints,” PVLDB, vol. 9, no. 10, pp. 708–719, 2016.
  • [17] Z. Li, H. Wang, W. Shao, J. Li, and H. Gao, “Repairing data through regular expressions,” PVLDB, vol. 9, no. 5, pp. 432–443, 2016.
  • [18] Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen, “Efficient discovery of functional and approximate dependencies using partitions,” in Proceedings of the Fourteenth International Conference on Data Engineering, Orlando, Florida, USA, February 23-27, 1998, 1998, pp. 392–401.
  • [19] S. Lopes, J. Petit, and L. Lakhal, “Efficient discovery of functional dependencies and armstrong relations,” in Advances in Database Technology - EDBT 2000, 7th International Conference on Extending Database Technology, Konstanz, Germany, March 27-31, 2000, Proceedings, 2000, pp. 350–364.
  • [20] C. Wyss, C. Giannella, and E. L. Robertson, “Fastfds: A heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances - extended abstract,” in Proceedings of the Third International Conference on Data Warehousing and Knowledge Discovery, ser. DaWaK ’01, 2001, pp. 101–110.
  • [21] W. Fan, F. Geerts, L. V. S. Lakshmanan, and M. Xiong, “Discovering conditional functional dependencies,” in Proceedings of the 25th International Conference on Data Engineering, ICDE 2009, March 29 2009 - April 2 2009, Shanghai, China, 2009, pp. 1231–1234.
  • [22] X. Chu, I. F. Ilyas, and P. Papotti, “Discovering denial constraints,” PVLDB, vol. 6, no. 13, pp. 1498–1509, 2013.
  • [23] Z. Abedjan, C. G. Akcora, M. Ouzzani, P. Papotti, and M. Stonebraker, “Temporal rules discovery for web data cleaning,” PVLDB, vol. 9, no. 4, pp. 336–347, 2015.
  • [24] S. Guha, R. Rastogi, and K. Shim, “Cure: An efficient clustering algorithm for large databases,” SIGMOD Rec., vol. 27, no. 2, pp. 73–84, Jun. 1998.
  • [25] T. Zhang, R. Ramakrishnan, and M. Livny, “Birch: An efficient data clustering method for very large databases,” SIGMOD Rec., vol. 25, no. 2, pp. 103–114, Jun. 1996.
  • [26] G. Karypis, E. Han, and V. Kumar, “Chameleon: Hierarchical clustering using dynamic modeling,” IEEE Computer, vol. 32, no. 8, pp. 68–75, 1999.
  • [27] D. Maier, “The complexity of some problems on subsequences and supersequences,” J. ACM, vol. 25, no. 2, pp. 322–336, 1978.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
50683
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel