Correlated pattern mining has increasingly become an important task in data mining since these patterns allow conveying knowledge about meaningful and surprising relations among data. Frequent correlated patterns were thoroughly studied in the literature.
In this thesis, we propose to benefit from both frequent correlated as well as rare correlated patterns according to the bond correlation measure. Nevertheless, a main moan addressed to correlated pattern extraction approaches is their high number which handicap their extensive utilizations. In order to overcome this limit, we propose to extract a subset without information loss of the sets of frequent correlated and of rare correlated patterns, this subset is called “Condensed Representation“. In this regard, we are based on the notions derived from the Formal Concept Analysis FCA, specifically the equivalence classes associated to a closure operator dedicated to the bond measure, to introduce new concise representations of both frequent correlated and rare correlated patterns. We then design the new mining approach, called Gmjp, allowing the extraction of the sets of frequent correlated patterns, of rare correlated patterns and their associated concise representations. In addition, we present the Regenerate algorithm allowing the query of the condensed representation associated to the set as well as the RcpRegeneration algorithm dedicated to the regeneration of the whole set of rare correlated patterns from the representation.
The carried out experimental studies highlight the very encouraging compactness rates offered by the proposed concise representations and prove the good performance of the Gmjp algorithm. To improve the obtained performance, we introduced and evaluated the optimized version of Gmjp. The latter shows much better performances than do the initial version of Gmjp. In order to prove the usefulness of the extracted condensed representation, we conduct a classification process based on correlated association rules derived from closed correlated patterns and their associated minimal generators. The obtained rules were applied to the context of intrusion detection and achieve encouraging results.
Key Words: Formal Concept Analysis, Constraint Data Mining, Monotonicity, Anti-monotonicity, bond Correlation Measure, Itemset Extraction, Condensed Representation, Classification, Associative Rule.
Université de Tunis El-Manar Faculté des Sciences de Tunis
Présentée en vue de l’Obtention du Diplôme de Doctorat en Informatique Par: Souad BOUASKER
Caractérisation et Extraction des représentations concises des Motifs Corrélés basée sur l’Analyse Formelle de Concepts
Comité de Thèse
|Faouzi MOUSSA||Professeur, Faculté des Sciences de Tunis||Président|
|Nadia ESSOUSSI||Maitre de Conférences, F.S.E.G de Nabeul||Rapporteur|
|Philippe LENCA||Professeur, Telecom Bretagne||Rapporteur|
|Amel TOUZI GRISSA||Professeur, ESIG de Kairouan||Examinateur|
|Sadok BEN YAHIA||Professeur, Faculté des Sciences de Tunis||Directeur|
2 Novembre 2016
Laboratoire Informatique de Programmation, Algorithmique et Heuristiques LIPAH
I would like to thank the members of my thesis committee. I thank Professor Faouzi MOUSSA for agreeing to chair my thesis committee.
I would like to thank Associate-Professor Nadia ESSOUSSI and Professor Philippe LENCA for accepting to review my thesis report, and for providing me with detailed corrections and interesting comments.
I would like to thank Professor Amel GRISSA TOUZI for participating to the thesis committee.
I want to express my deep thanks to my thesis supervisor Professor Sadok BEN YAHIA for trusting me and for allowing me to grow as a research scientist. During the whole period of study, Professor BEN YAHIA contributes by giving me intellectual freedom in my work, engaging me in new creative ideas, supporting my participation to various conference, and requiring a high quality of work in all my efforts.
I want to express my special thanks to Assistant-Professor Tarek HAMROUNI for collaborating in the realization of different phases of my thesis project. I am very grateful for all the offered efforts to ensure high-quality of this research. I greatly benefited from his scientific insight, his high-level of expertise in the field of Data-Mining and his ability to explore possible improvements in order to make a deeper development of our research.
My sincere thanks go to all the members of the LIPAH Laboratory of the Faculty of Sciences of Tunis, for the friendship, for the encouraging ambiance and emotional atmosphere during the last years.
I cannot finish without thanking my family. A special dedicate goes to my precious treasure, my mother Radhia BENFRADJ BOUASKER, for supporting and encouraging me during my studies. I also would like to thank my brothers, my sisters for providing assistance in numerous ways. I want to express my gratitude to my husband, dear Mohamed, without his comprehension and encouragements, I could not have accomplished this project. A particular thought goes for my lovely baby girl Meriam, my angel baby, the greatest joy of my life.
My Doctoral Graduation is dedicated to the memory of my beloved father, Miled BOUASKER. I am honored to have you as a father. Thank you for the high trust, for learning to me the strength and the patience, and for motivating me to always keep reaching for excellence.
Thank you for the father you were.
La fouille des motifs corrélés est une piste de recherche de plus en plus attractive en fouille de données grâce à la qualité et à l’utilité des connaissances offertes par ces motifs. Plus précisément, les motifs fréquents corrélés ont été largement étudiés auparavant dans la littérature.
Notre objectif dans cette thèse est de bénéficier à la fois des connaissances offertes par les motifs corrélés fréquents ainsi que les motifs rares corrélés selon la mesure de corrélation bond. Cependant, un principal problème est lié à la fouille des motifs corrélés concerne le nombre souvent très élevé des motifs corrélés extraits. Un tel nombre handicape une exploitation optimale et aisée des connaissances encapsulées dans ces motifs. Pour pallier ce problème, nous nous intéressons dans cette thèse à l’extraction d’un sous-ensemble, sans perte d’information, de l’ensemble de tous les motifs corrélés. Ce sous-ensemble, le noyau d’itemsets, appelé “Représentations Concises”, à partir duquel tous les motifs redondants peuvent être régénérés sans perte d’informations. Le but d’une telle représentation est de minimiser le nombre de motifs extraits tout en préservant les connaissances cachées et pertinentes.
Afin de réaliser cet objectif, nous nous sommes basés sur les notions dérivées de l’analyse formelle de concepts AFC. Plus précisément, les représentations condensées, que nous proposons, sont issues des notions de classes d’équivalence induites par l’opérateur de fermeture associé à la mesure de corrélation bond. Après la caractérisation des représentations condensées proposées, nous introduisons l’algorithme Gmjp dédié à l’extraction des motifs corrélés fréquents, des motifs corrélés rares ainsi que leurs représentations condensées associées. Nous présentons également l’algorithme Regenerate d’interrogation de la représentation associée à l’ensemble des motifs corrélés rares et nous proposons aussi l’algorithme RCPRegeneration dédié à la régénération de l’ensemble total des motifs corrélés rares à partir de la représentation concise .
L’évaluation expérimentale menée met en valeur les taux de compacités très intéressants offerts par les différentes représentations concises proposées et justifie également les performances encourageantes de l’approche Gmjp. Afin d’améliorer les performances de l’algorithme Gmjp, nous proposons une version optimisée de Gmjp. Cette version optimisée présente des temps d’exécution beaucoup plus réduits que la version initiale. De plus, nous avons conduit un processus de classification associative basé sur les règles associatives corrélées dérivées à partir des motifs corrélés fermés et de leurs générateurs minimaux. Les résultats de classification des données de détection d’intrusions, sont très encourageants et ont prouvé une grande utilité de la fouille des motifs corrélés.
Mots Clés : Analyse Formelle de Concept, Fouille sous Contraintes, Monotonie, Anti-monotonie, Mesure bond, Extraction de motifs, Représentation concise, Classification, Règles associatives.
- 1 Introduction
I Review of Correlated Patterns Mining
2 Basic Notions
- 2.1 Introduction
- 2.2 Search Space
- 2.3 Formal Concepts Analysis
- 2.4 Conclusion
- 3 Correlated Patterns Mining: Review of the Literature
- 2 Basic Notions
II Condensed Representations of Correlated Patterns
4 Condensed Representations of Correlated Patterns
- 4.1 Introduction
- 4.2 Motivations behind our choice of the bond measure
- 4.3 Characterization of the Correlated patterns according to the bond measure
- 4.4 The closure operator
- 4.5 Condensed representations of rare correlated patterns
- 4.6 Condensed representation of frequent correlated patterns
- 4.7 Conclusion
5 Extraction Approach of Correlated Patterns and associated Condensed Representations
- 5.1 Introduction
- 5.2 Integration mechanism of the constraints
5.3 The Gmjp approach
- 5.3.1 Overview of the approach
- 5.3.2 First Step: The power of the bit vectors and of co-occurrent vectors
- 5.3.3 Second Step: Getting the Local Minimal and the Local Closed Rare Correlated Patterns without closure computations
- 5.3.4 Third Step: Filtering the Global Minimal and the Global Closed Rare Correlated patterns
- 5.3.5 Running example
- 5.4 Opt-Gmjp: The optimized version of Gmjp
- 5.5 Theoretical Time Complexity
- 5.6 The query and the regeneration strategies
- 5.7 Conclusion
- 4 Condensed Representations of Correlated Patterns
III Experiments and Classification Process
6 Experimental Validation
- 6.1 Introduction
- 6.2 Experimental Environment
- 6.3 Evaluation of the compactness rates offered by the proposed representations
- 6.4 Evaluation of the running time of Gmjp
- 6.5 Optimizations and Evaluations
- 6.6 Conclusion
7 Associative-Classification Process based on Correlated Patterns
- 7.1 Introduction
- 7.2 Overview of association rules
- 7.3 Extraction of the generic bases of association rules
- 7.4 Association rule-based classification process
- 7.5 Classification of Intrusion Detection Data
- 7.6 Application of the representation on Micro-array gene expression data
- 7.7 Conclusion
- 6 Experimental Validation
- IV Conclusion
List of Figures
- 2.1 Characterization of a Conjunctive Equivalence Class.
- 4.1 Localization of the correlated patterns for minsupp = 4 and minbond = 0.2 according to the extraction context given in Table 2.1.
- 4.2 An example of rare correlated equivalence classes for minsupp = 4 and minbond = 0.2.
- 5.1 Effect of the integration of the correlation constraint for minsupp = 3 and minbond = 0.3.
- 5.2 Effect of the application of the rarity constraint for minsupp = 3 and minbond = 0.2.
- 5.3 Overview of Gmjp approach.
- 5.4 Overview of Gmjp when extracting the representation.
- 5.5 The BSVectors and the COVectors associated to the items of the extraction context .
- 5.6 Mining Local Minimal and Local Closed Rare Correlated Patterns for the item .
- 5.7 An Overview of the Opt-Gmjp algorithm.
- 5.8 The tree data-structure used within the Opt-Gmjp approach and associated to the extraction context given by Table 5.3.
- 6.1 Sizes of the different representations when minsupp varies and minbond is fixed.
- 6.2 Sizes of the different representations when minbond varies and minsupp is fixed.
- 7.1 The STRING compact network view.
List of Tables
- 2.1 An example of an Extraction Context .
- 3.1 Summary of the properties of the studied correlation measures and coefficients.
- 3.2 Comparison between the correlated patterns mining approaches.
- 5.1 The notations used within the OptGmjp algorithm.
- 5.2 The Initial extraction context .
- 6.1 Characteristics of the benchmark datasets.
- 6.2 Compactness Rates associated to the set vs. the set.
- 6.3 Performance Analysis of Gmjp on UCI benchmarks (Time in seconds).
- 6.4 Impact of the variation of minsupp, for the three steps of the Gmjp algorithm (Time in seconds).
- 6.5 Impact of the variation of minbond, for the three steps of the Gmjp algorithm (Time in seconds).
- 6.6 Impact of the rarity threshold minsupp variation.
- 6.7 Impact of the correlation threshold minbond variation.
- 6.8 Performance comparison of Opt-Gmjp vs. Gmjp while varying minsupp (Time in seconds).
- 6.9 Performance comparison of Opt-Gmjp vs. Gmjp while varying minbond (Time in seconds).
- 6.10 Summarized Comparison of the Performance of Gmjp vs. Optimized Gmjp (Time in seconds).
- 6.11 Performance comparison of our Improved Opt-Gmjp vs. Jim [Segond and Borgelt, 2011] (Time in seconds).
- 7.1 Evaluation of the classification accuracy versus minbond variation for frequent and rare correlated patterns.
- 7.2 Evaluation of the classification accuracy of frequent patterns vs rare patterns when minconf varies.
- 7.3 Evaluation of the rare correlated association rules for the KDD 99 dataset.
- 7.4 Comparison between the proposed rare correlated association rules based classifier versus the state of the art approaches.
- 7.5 An example of Micro-array data.
- 7.6 Description of a sample of genes of the GSE1379 dataset.
- 7.7 An example of the discretized Micro-array data.
- 7.8 Discretized values substituted by their identifiers.
- 7.9 Execution Times and number of extracted association rules.
- 7.10 Association rules: Expression levels Expression levels.
List of Algorithms\@mkboth
LIST OF ALGORITHMSLIST OF ALGORITHMS\@starttocloa
Chapter 1 Introduction
1.1 Introduction and Motivations
The development of new information and communication technologies and the globalization of markets make the competition more and more increased among companies. In this sense, the need for access to an accurate information for decision-making is increasingly urgent. The actual problem is linked to lack of access to relevant information in the presence of the large amount of data. The collected data in various fields are becoming larger. This motivates the need to analyze and interpret data in order to extract useful knowledge.
In this context, the process of knowledge discovery from databases (KDD) is a complete process aiming to extract useful, hidden knowledge from huge amount of data [Agrawal and Srikant, 1994]. Data Mining is one of the main steps of this process and is dedicated to offer the necessary tools needed for an optimal exploration of data. Many state of the art approaches were focused on frequent itemset extraction and association rule generation. Nevertheless, two main problems handicap the good use of the returned knowledge from the set of frequent itemsets. The first problem is related to the quality of the offered knowledge since the degree of correlation of the extracted itemsets may be not interesting for the end user. The second problem is related often to the huge quantity of the extracted knowledge.
To overcome these problems, many previous works propose to integrate the correlation measures within the mining process [Brin et al., 1997, Lee et al., 2003, Omiecinski, 2003, Kim et al., 2004, Xiong et al., 2006]. Correlated pattern mining is then shown to be more complex but more informative than traditional frequent patterns mining. In fact, correlated patterns offer a precise information about the degree of apparition of the items composing a given itemset [Segond and Borgelt, 2011]. This key information specifies the simultaneous apparition frequency among items, i.e., their co-occurrence, as well as their apparition frequency, i.e., their occurrence.
Other state of the art approaches deal with the extraction of a subset, without information loss, of the whole set of correlated patterns. This subset, is named, “Condensed Representation“ and from which we are able to derive all the redundant correlated patterns. The condensed representations prove their high utility in different fields such as: bioinformatics [Martinez et al., 2009] and data grids [Hamrouni et al., 2015].
The main objective behind defining such a condensed representation is to reduce the number of the extracted patterns while preserving the same amount of pertinent knowledge. In addition to this, all of the extracted associated rules, derived from correlated patterns fulfilling a correlation measure such as all-confidence or bond, are valid with respect to minimal support and to minimal confidence thresholds [Omiecinski, 2003].
Frequent correlated itemset mining was then shown to be an interesting task in data mining. Since its inception, this key task grasped the interest of many researchers since it meets the needs of experts in several application fields [Ben Younes et al., 2010], such as market basket study. However, the application of correlated frequent patterns is not an attractive solution for some other applications, e.g., intrusion detection, analysis of the genetic confusion from biological data, detection of rare diseases from medical data, to cite but a few [Koh and Rountree, 2010, Mahmood et al., 2010, Romero et al., 2010, Szathmary et al., 2010, Manning et al., 2008]. As an illustration of the rare correlated patterns applications in the field of medicine, the rare combination of symptoms can provide useful insights for doctors.
To the best of our knowledge, there is no previous work that dealt with both frequent correlated as well as rare correlated patterns according to a specified correlation metric. Thus, motivated by this issue, we propose in this thesis to benefit from the knowledges returned from both frequent correlated as well as rare correlated patterns according to the bond correlation measure. To solve this challenging problem, we propose an efficient algorithmic framework, called GMJP, allowing the extraction of both frequent correlated patterns, rare correlated patterns as well as their associated concise representations.
Our first contribution consists in defining and studying the characteristics of the condensed representations associated to frequent correlated as well as the condensed representations associated to rare correlated ones. In this respect, we are based on the notions derived from the Formal Concept Analysis (FCA) [Ganter and Wille, 1999], specifically the equivalence classes associated to a closure operator dedicated to the bond measure to introduce our new concise representations of both frequent correlated and rare correlated patterns. The first concise representation associated to the set of rare correlated patterns, is composed by the maximal elements of the rare correlated equivalence classes, called “Closed Rare Correlated Patterns set“ union of their associated minimal generators called “Minimal Rare Correlated Patterns set“. Two other optimizations of the representation are also proposed. The first optimization is composed by the whole set of closed rare correlated patterns union of the minimal elements of the set. The second optimization is composed by the maximal elements of the of closed rare correlated patterns union of the whole set. We prove that both of these representations are also concise and exact. Our third optimized representation is a condensed approximate representation. The latter is composed by the maximal elements of the set union of the minimal elements of the set. According to the set of frequent correlated patterns, the condensed exact representation is composed by the Closed Correlated Frequent Patterns. We prove the theoretical properties of accuracy and compactness of all the proposed representations.
Our second contribution is the design and the implementation of a new mining approach, called Gmjp, allowing the extraction of the sets of frequent correlated patterns, of rare correlated patterns and their associated concise representations. Gmjp is a sophisticated mining approach that allows a simultaneous integration of two opposite paradigms of monotonic and anti-monotonic constraints. In addition, we present the Regenerate algorithm allowing the query of the condensed representation associated to the set as well as the RcpRegeneration algorithm dedicated to the regeneration of the whole set of rare correlated patterns from the representation.
Our third contribution consists in proposing an optimized version of Gmjp. The latter shows much better performance than the initial version of Gmjp. In order to prove the usefulness of the extracted condensed representation, we conduct a classification process based on correlated association rules derived from closed correlated patterns and their associated minimal generators. The obtained rules are applied to the context of intrusion detection and achieve promoting results.
The evaluation protocol of our approaches consists in experimental studies carried out over dense and sparse benchmark datasets commonly used in evaluating data mining contributions. The evaluation of the classification process is based on the KDD 99 database of intrusion detection data. We also conduct the process of applying the representation on the extraction of rare correlated association rules from Micro-array gene expression data related to Breast-Cancer. The diverse obtained association-rules reveals a variety of relationship between up and down regulated gene-expressions.
1.3 Thesis Organization
The remainder of this thesis is organized as follows:
Chapter 2 introduces the basic notions related to the itemset search space and to itemset extraction. We also define two distinct categories of constraints: monotonic and anti-monotonic. We equally introduce the environment of Formal Concept Analysis (FCA) which offers the basis for the proposition of our approaches, specifically the notions of Closure Operator, Minimal Generator, Closed Pattern, Equivalence class and Condensed representation of a set of patterns.
Chapter 3 offers an overview of the state of the art approaches dealing with correlated patterns mining. We start this chapter by defining the most used correlation measures. Then, we continue with the approaches related to frequent correlated patterns, followed by the state of the art of rare correlated patterns then the overview of the algorithms focusing on condensed representations of correlated patterns.
Chapter 4 focuses on characterizing the set of frequent correlated patterns as well as the set of rare correlated patterns. It introduces the condensed exact and approximate representations associated to the set as well as the concise exact representation associated to the set. The main content of this chapter was published in [Bouasker et al., 2012b] and in [Bouasker et al., 2015].
Chapter 5 introduces the Gmjp approach, allowing the extraction of the sets of frequent correlated patterns, of rare correlated patterns and their associated concise representations. The optimized version of Gmjp, named Opt-Gmjp, was also presented. This chapter also presents the theoretical complexity approximation of Gmjp. In addition, this chapter describes the Regenerate algorithm allowing the query of the condensed representation associated to the set as well as the RcpRegeneration algorithm dedicated to the regeneration of the whole set of rare correlated patterns from the representation. The main content of this chapter was published in [Bouasker et al., 2012a] and in [Bouasker et al., 2015].
Chapter 6 focuses on the experimental validation of the proposed approaches. The evaluation process is based on two main axes, the first is related to the compactness rates of the condensed representations while the second axe concerns the running time. This chapter evaluates the optimized version of Gmjp, which presents much better performance than do Gmjp over different benchmark datasets. The content related to the optimizations and evaluations was published in [Bouasker and Ben Yahia, 2015].
Chapter 7 describes the classification process based on correlated patterns. This chapter starts by presenting the framework of association rule extraction, it clarifies the properties of the generic bases of association rules. Then, we continue with the detailed presentation of the application of both frequent correlated and rare correlated patterns within the classification of some UCI benchmark datasets. We equally present the application of rare correlated patterns in the classification of intrusion detection data from the KDD 99 dataset. The obtained results showed the usefulness of our proposed classification method over four different intrusion classes. This chapter is concluded with the application of the representation on the extraction of rare correlated association rules from Micro-array gene expression data. These extracted rules aims to identify relations among up and down regulated gene expressions. The main content of this chapter was published in [Bouasker et al., 2012c] and in [Bouasker and Ben Yahia, 2013].
Chapter 8 concludes the thesis and sketches out our perspectives for future work.
Part I Review of Correlated Patterns Mining
Chapter 2 Basic Notions
The extraction of correlated patterns is shown to be more complex but more informative than traditional frequent patterns mining. In fact, these correlated patterns present a strong link among the items they compose and they prove their high utility in many real life applications fields.
This chapter is dedicated to the introduction of the basic notions needed for the presentation of our approaches. The second section deals with the basic notions related to the search space as well as the itemsets’s extraction. Then, we link in the third section with the presentation of the foundations of the formal concepts analysis (FCA) framework [Ganter and Wille, 1999]. The last section concludes the chapter.
2.2 Search Space
We begin by presenting the key notions related to itemset extraction, that will be used thorough this thesis. First, let us define an extraction context.
2.2.1 Extraction Context
An extraction context (also called Context or Dataset) is represented by a triplet = () with and are, respectively, a finite sets of transactions (or objects) and of items (or attributes), and is a binary relation between the transactions and the items. A couple (, ) if contains .
An example of an extraction context () is given by Table 2.1. In this context, the transaction set (resp. the object set ) and the items set A, B, C, D, E,. The couple (2, B) since the transaction 2 contains the item B .
We note, by sake of accuracy, that the notations of transactions database and extraction context have the same meaning thorough this thesis. They are denoted as .
Itemset or Pattern
A transaction , having an identifier denoted by TID (Tuple IDentifier), contains a non-empty set of items belonging to . A subset of where is called a -pattern or simply a pattern, and represents the cardinality of . The number of transactions of a context containing a pattern , , is called absolute support of and is denoted . The relative support of or the frequency of , denoted , is the quotient of the absolute support by the total number of the transactions of , i.e., .
We point that, thorough this thesis, we are mainly interested in itemsets i.e. the set of items as a kind of patterns. Consequently, we use a form without separators to denote an itemset. For example, BD stands for the itemset composed by the items B and D.
2.2.2 Supports of a Pattern
To evaluate an itemset, many interesting measures can be used. The most common ones are presented by Definition 3.
Supports of a Pattern
Let =() an extraction context and a non empty itemset . We distinguish three kinds of supports for an itemset :
- The conjunctive support: Supp() = : (, )
- The disjunctive support: Supp() = : (, ) , and,
- The negative support: Supp() = : (, ) .
More explicitly, for an itemset , the supports are defined as follows:
Supp(): is equal to the number of transactions containing all the items of .
Supp(): is equal to the number of transactions containing at least one item of .
Supp(): is equal to the number of transactions that do not contain any item of .
It is important to note that the “De Morgan” law ensures the transition between the disjunctive and the negative support of an itemset as follows : Supp(I) = - Supp(I).
Let us consider the extraction context given by Table 2.1 that will be used thorough the different examples. We have Supp(AD) = 1 = , Supp(AD) = 1, 3, 5 = , and, Supp((AD)) = 2, 4 = .
In the following, if there is no risk of confusion, the conjunctive support will be simply denoted by support. Note that Supp() = since the empty set is included in all transactions, while Supp() = since the empty set does not contain any item. Moreover, , Supp() = Supp(), while in the general case, for and , Supp() Supp(). A pattern is said to be frequent if Supp() is greater than or equal to a user-specified minimum support threshold, denoted minsupp [Agrawal and Srikant, 1994]. The following lemma shows the links that exist between the different supports of a non-empty pattern . These links are based on the inclusion-exclusion identities [Galambos and Simonelli, 2000].
- Inclusion-exclusion identities - The inclusion-exclusion identities ensure the links between the conjunctive, disjunctive and negative supports of a non-empty pattern .
2.2.3 Frequent Itemset - Rare Itemset - Correlated Itemset
Given a minimal threshold of support [Agrawal and Srikant, 1994], we distinguish between two kinds of patterns, frequent patterns and infrequent patterns (also called Rare patterns).
Frequent Itemset - Rare Itemset
Let an extraction context = , a minimal threshold of the conjunctive support minsupp, an itemset is said frequent if Supp() minsupp. Otherwise, is said infrequent or rare.
Let minsupp = 2. Supp(BCE) = 3, the pattern BCE is a frequent pattern. However, the pattern CD is a rare pattern since Supp(CD) = 1 2.
In the following, we need to define the smallest rare patterns according to the relation of inclusion set. They correspond to rare patterns having all subsets frequent, and are defined as follows:
Minimal rare patterns
The set of minimal rare patterns is composed of rare patterns having no rare proper subsets. This set is defined as: = : Supp() minsupp.
Let us consider the extraction context sketched by Table 2.1. For minsupp = 4, we have = , , , . and are minimal rare items, is a minimal rare itemset since it is composed by two frequent items: with Supp(B) = 4 and with Supp(C) = 4.
In fact, in order to reduce the high number of frequent itemsets and to improve the quality of the extracted frequent itemets, other interesting measures apart from the conjunctive support are introduced within the mining process. These latter are called “Correlation Measures”. The itemsets fulfilling a given correlation measure are called “Correlated Itemsets”. This latter type of itemsets is defined in a generic way in what follows:
Let a correlation measure M, a minimal correlation threshold minCorr, an itemset is said correlated according to the measure M, if M() minCorr. is said non correlated otherwise.
2.2.4 Categories of Constraints
Besides the minimal frequency constraint expressed by the minsupp threshold, other constraints can be integrated within the itemset’s extraction process. These constraints have two distinct types, “The monotonic constraints” and “The anti-monotonic constraints” [Bonchi and Lucchese, 2006].
A constraint is anti-monotone if , : fulfills fulfills .
A constraint is monotone if , : fulfills fulfills .
The frequency constraint, i.e. having a support greater than or equal to minsupp, is an anti-monotonic constraint. In fact, , , if and Supp() minsupp, then Supp() minsupp since Supp() Supp().
Dually, the constraint of rarity, i.e. having a support strictly lower than minsupp, is a monotonic constraint. In fact, , , if and Supp() minsupp, then Supp() minsupp since Supp() Supp().
The conjunction of anti-monotonic constraints (resp. monotonic) is an anti-monotonic (resp. monotonic) constraint.
A subset of is an order ideal if it fulfills the following properties:
If , then : .
If , then : .
A subset of is an order filter if it fulfills the following properties:
If , then : .
If , then : .
An anti-monotone constraint such as the frequency constraint induces an order ideal on the itemset lattice. Dually, a monotonic constraint as the rarity constraint induces an order filter on the itemset lattice. The set of itemsets fulfilling a given constraint is called a Theory [Mannila and Toivonen, 1997]. This theory is delimited by two borders, the positive and the negative one, that are defined as follows:
Negative/Positive Border [Bonchi and Lucchese, 2006]
When considering an anti-monotonic constraint , the border corresponds to the set of itemsets whose all subsets fulfill this constraint and whose all super-sets do not fulfill. Let a set of itemsets fulfilling an anti-monotonic constraint , the border is formally defined as:
() = : and :
In the case of monotonic constraint , the border corresponds to the set of patterns whose all supersets fulfills this constraint and whose all subsets do not fulfill.
Let a set of patterns fulfilling a monotonic constraint , the border is formally defined as follows:
() = : and :
However, we have to distinguish for a given constraint between positive and negative borders. Let a set of patterns fulfilling a constraint . The positive border is denoted by ) and corresponds to the patterns belonging to the border ) and fulfilling the constraint . The negative border is denoted by ) and corresponds to the set of patterns belonging to the border ) and not fulfilling the constraint . These two borders are formally expressed as follows:
= ) ,
= ) .
In the next sub-section, we focus on the definition and the presentation of the notions related to condensed representations associated to a set of patterns.
2.2.5 Condensed Representations of a set of Patterns
The extraction of interesting patterns may be a costly operation in execution time and in memory consumption. This is due to the high number of the generated candidates. In this regard, an interesting issue consists in extracting sets of patterns with more reduced sizes. From which it is possible to regenerate the whole sets of patterns. These reduced sets are called “‘Condensed Representations”. In the case where the regeneration is performed in an exact way without information loss then the condensed representation is said exact. Otherwise, the condensed representation is said approximative. These representations are formally defined in what follows.
Condensed Representations [Mannila and Toivonen, 1997]
A concise representation of a set of interesting itemsets is a representative set allowing the characterization of the initial set in an exact or an approximative way.
Let be a concise representation of a set of frequent patterns . is said concise exact representation, if starting from , we are able to determine for a given pattern whether it is a frequent pattern or not and to determine its conjunctive support also. For example, the closed frequent patterns [Pasquier et al., 2005] constitute a concise exact representation of the set of frequent itemsets.
Otherwise, is a concise approximative representation of a set of patterns if it is not able to exactly determine the support values of all the itemsets belonging to the set. The representation returns approximate values of these supports. For example, the maximal frequent itemsets [Roberto and Bayardo, 1998] constitute an approximative concise representation of the frequent patterns set. In fact, thanks to maximal frequent itemsets we are able to determine whether a given itemset is frequent or rare but it is not possible to exactly derive its conjunctive support value.
In general, a representation constitutes “a perfect cover” if it fulfills the conditions established by the following definition:
A set is said a perfect cover of a set if and only if allows to cover without information loss and the size of never exceeds that of the set .
Various proposals aiming to reduce the size of a set of patterns are based on the foundations of formal concepts analysis [Ganter and Wille, 1999]. The next section is dedicated to the presentation of the formal concept analysis’s framework.
2.3 Formal Concepts Analysis
The formal concept analysis initially introduced by Wille in 1982 [Wille, 1982] treats formal concepts. A formal concept is a set of objects, The Extension, to which we applied a set of attributes, The Intention. The formal concept analysis provides a classification and an analysis tool whose principal element is the itemsets’s lattice defined as follows:
An itemsets’s lattice is a conceptual and hierarchical schema of patterns. It is also said lattice of set inclusion. In fact, the power set of is ordered by set inclusion in the itemsets’ lattice.
This lattice shows the frequent itemsets, the rare ones as well as the in set of minimal rare patterns composing the positive border of the whole set of rare patterns.
2.3.2 Galois Connection
220.127.116.11. Closure Operator
In what follows, we present the fundamental basis of a closure operator.
Let a set. A Partial Order over the set is a binary relation over the elements of , such as for , , , the following properties holds [Davey and Priestley, 2002] :
1. Reflexivity :
2. Anti-symmetry : and
3. Transitivity : and
A set with a partial order , denoted by (,), is a partially ordered set [Davey and Priestley, 2002].
Through the following definition, we introduce the notion of closure operator.
Closure Operator [Ganter and Wille, 1999]
Let a partially ordered set (, ). An application from (, ) to (, ) is a closure operator, if and only if fulfills the following properties. For all sub-sets :
1. Isotonic :
2. Extensive :
3. Idempotency :
We now define, the closure operator related to the conjunctive search space where the conjunctive support
characterizes the associated patterns.
18.104.22.168. The Galois Connection
Galois Connection [Ganter and Wille, 1999]
Let an extraction context , , . Let the application from the power-set of 111The power-set of a set , is constituted by the sub-sets of , is denoted by (). to the power-set of items , and associate to the set of objects the set of items that are common to all the objects :
Let the application, from the power-set of to the power-set of , which associate to each set of items (commonly called pattern) the set of objects containing all the items :
The couple of applications (,) is a Galois connection between the power-set of and the power-set of .
The images of and of by as well as those of and of by the application are :
() ; ( ; (.
[Ganter and Wille, 1999]
Given a Galois connection, the following properties are fulfilled: , , and , , :
1. () ();
2. () ();
3. () ().
Thanks to Definition 18, we introduce the closure operators associated to the Galois connection.
Closure Operators of the Galois Connection [Ganter and Wille, 1999]
Lets consider the power-sets and provided with the inclusion set link , i.e, the partially ordered sets ( ) and ( ). The operators 222We use the index c since the closure operator gathers itemsets sharing the same common conjunctive support. and such as of ( ) in ( ) and of ( ) in ( ) are the closure operators of the Galois connection.
Let the extraction context illustrated by Table 2.1, we then have :
() ; () ; () .
() ; () ; () .
2.3.3 Equivalence Classes, Closed Patterns and Minimal Generators
The application of the closure operator induces an equivalence relation in the power-set ), partitioning it on equivalence classes [Ayouni et al., 2010, Bastide et al., 2000], denoted by -equivalence-class, defined as follows.
A -Equivalence-Class contains all the itemsets belonging exactly to the same transactions and sharing the same closure according to the closure operator.
Within a -Equivalence-Class, the maximal element, according to the set inclusion, is said, “Closed Pattern” where as the minimal elements which are incomparable according to the set inclusion, are called “Minimal Generators”. They are defined in what follows.
Closed Pattern [Bastide et al., 2000]
An itemset is a closed itemset iff, ()=.
Minimal Generator [Bastide et al., 2000]
An itemset is a minimal generator of a closed pattern if ()= and , if and ()= then = .
The following proposition introduces an interesting property of the minimal generators set.
[Stumme et al., 2002] Let be the set of minimal generators extracted from a context , the set fulfills an order ideal property on the itemset lattice.
A conjunctive equivalence class is a set containing all the patterns having the same conjunctive closure. Thus, these patterns owns the same value of conjunctive support. The minimal generators are the smallest elements, according to the set inclusion property, in their equivalence classes. Whereas, the largest element in this class corresponds to the closed pattern. An example of a conjunctive equivalence class is given by Figure 2.1. In this class, ABCE is the closed pattern whereas AB and AE are the associated minimal generators. All the elements belonging to this class share exactly the same conjunctive support, equal to 2.
At this level, we have presented the basic notions related to itemset’s extraction and to condensed representations.
Different approaches, derived from Formal Concept Analysis (FCA), were proposed in order to reduce the size of the set of frequent itemsets. In addition, correlated pattern mining constitutes an interesting alternative to get more informative patterns with a manageable size and a high quality returned knowledge.
The next chapter will be dedicated to the presentation, going from the general to the more specific, of the state of the art approaches related to correlated patterns mining. A Comparative study of these approaches will be also conducted.
Chapter 3 Correlated Patterns Mining: Review of the Literature
In this chapter, we focus on presenting an overview of the literature approaches, which are related to our topic of mining correlated patterns. Our study goes from general to more specific. In this respect, we present in Section 3.2 the approaches related to constraint-based data mining, we deal with the two kinds of constraints. Then, in Section 3.3, we specially concentrate on correlated pattern mining. We start by introducing the most common correlation measures, then we join with the state of the art of rare correlated patterns mining followed by frequent correlated patterns mining approaches. A synthetic summary of the studied approaches is proposed in Section 3.4. The chapter is concluded in Section 3.5.
3.2 Constraint-based Itemset Mining
Within a process of pattern extraction, it is more difficult to localize the set of patterns fulfilling a set of constraints of different natures than to extract theories associated to a conjunction of constraints of the same nature [Bonchi and Lucchese, 2006]. Indeed, the opposite nature of the constraints makes that the reduction strategies are applicable to only a part of the constraints and not to all the constraints. Therefore, the extraction process will be more complicated and more expensive in terms of execution costs and memory greediness.
Many approaches have paid attention to the extraction of interesting patterns under constraints [Boulicaut and Jeudy, 2010]. One of the first algorithms belonging to this context is DualMiner [Bucila et al., 2003]. The latter allows the reduction of the search space while considering both of the monotonic and the anti-monotonic constraints. However, as highlighted by [Boley and Gärtner, 2009], DualMiner suffers from a main drawback related to the high cost of constraints evaluation.
In [Lee et al., 2006a], the authors have proposed an approach of pattern extraction under constraints. The ExAMiner algorithm [Bonchi et al., 2005] was also proposed in order to mine frequent patterns under monotonic constraints. It is important to mention that the effective reduction strategy adopted by ExAMiner could not be of use in the case of the monotonic constraint of rarity that we treat in this work, since this latter is sensitive to the changes in the transactions of the extraction context.
Many other works have also emerged. We cite for example, the VST algorithm [De Raedt et al., 2002] which allows the extraction of all the strings satisfying the set of monotonic and anti-monotonic constraints. Later, the FAVST algorithm [Lee and De Raedt, 2004] was introduced in order to improve the performance of the VST algorithm by reducing the number of scans of the database. Other approaches, belonging to this framework, have also been proposed such as the DPC-COFI algorithm and the BifoldLeap algorithm [El-Hajj et al., 2005]. The strategy of these approaches consists in extracting the maximal frequent itemsets which fulfill all of the constraints and from which the set of all the frequent valid itemsets will be derived.
In [Guns et al., 2013], the authors proposed the MiningZinc framework dedicated to constraint programming for itemset mining. The constraints are defined, within the MiningZinc system, in a declarative way close to mathematical notations. The solved tasks within the proposed system concerns closed frequent itemset mining, cost-based itemset mining, high utility itemset mining and discriminative patterns mining. In a more generic way, in [Guns, 2016], the author presented a generic overview of methods devoted to bridge the gap between the two fields of constraint-based itemset mining and constraint programming.
3.3 Correlated Pattern Mining
This section is dedicated to the study of the correlated pattern mining. First, we start by introducing the commonly used correlation measures, presenting their properties and comparing them.
3.3.1 Correlation Measures
The integration of the correlation measures within the mining process allows to reduce the number of the extracted patterns while improving the quality of the retrieved knowledge. The quality is expressed by the degree of correlation between the items composing the result itemsets. To achieve this goal, different correlation measures were proposed in the literature, we start with the bond measure.
22.214.171.124 The bond measure
The bond measure [Omiecinski, 2003] is mathematically equivalent to Coherence [Lee et al., 2003], Tanimoto-coefficient [Tanimoto, 1958], and Jaccard. In [Ben Younes et al., 2010], the authors propose a new expression of bond in Definition 22.
The bond measure
The bond measure of a non-empty pattern is defined as follows:
This measure conveys the information about the correlation of a pattern by computing the ratio between the number of co-occurrences of its items and the cardinality of its universe, which is equal to the transaction set containing a non-empty subset of . It is worth mentioning that, in the previous works dedicated to this measure, the disjunctive support has never been used to express it.
The use of the disjunctive support allows to reformulate the expression of the bond measure in order to bring out some pruning conditions for the extraction of the patterns fulfilling this measure. Indeed, as shown later, the bond measure fulfills several properties that offer interesting pruning strategies allowing to reduce the number of generated pattern during the extraction process. Note that the value of the bond measure of the empty set is undefined since its disjunctive support is equal to . However, this value is positive since bond (I) = = . As a result, the empty set will be considered as a correlated pattern for any minimal threshold of the bond correlation measure.
It has been proved, in [Ben Younes et al., 2010], that the bond measure fulfills other interesting properties. In fact, bond is: () Symmetric since we have , , bond() = bond(); () descriptive i.e. is not influenced by the variation of the number of the transactions of the extraction context.
In addition, it has been shown in [Wu et al., 2010] that it is desirable to select a descriptive measure which is not influenced by the number of transactions that contain none of pattern items. The symmetric property fulfilled by the bond measure makes it possible not to treat all the combinations induced by the precedence order of items within a given pattern. Noteworthily, the anti-monotony property, fulfilled by the bond measure as proven in [Omiecinski, 2003], is of interest. Indeed, all the subsets of a correlated pattern are also necessarily correlated. Then, we can deduce that any pattern having at least one uncorrelated proper subset is necessarily uncorrelated. It will thus be pruned without computing the value of its bond measure. In the next definition, we introduce the relationship between the bond measure and the cross-support property.
Cross-support property of the bond measure [Xiong et al., 2006]
Thanks to the cross-support property, having a minimal threshold minbond and an itemset , if and such as then is not correlated since bond() < minbond;
We continue, in what follows, with the presentation of the all-confidence measure.
126.96.36.199 The all-confidence measure
The all-confidence measure [Omiecinski, 2003] is defined as follows:
The all-confidence measure
The all-confidence measure [Omiecinski, 2003] is defined for any non-empty set as follows:
= 0.50. The ABCE itemset is correlated according to the all-confidence measure. All the direct subsets of ABCE are also correlated. We have all-confidence(ABE) = all-confidence(ACE) = = 0.50, all-confidence(BCE) = = 0.75.
For the itemset AD, we have = 0.33 0.4 and we have all-confidence(AD) = = 0.33. The AD itemset does not fulfill the cross-support property, thus it is a non-correlated itemset. This example illustrates the conservation of the anti-monotonicity and the cross-support properties of the all-confidence measure.
We continue in what follows with the hyper-confidence measure.
188.8.131.52 The hyper-confidence measure
The hyper-confidence measure denoted by h-conf of an itemset is defined as follows.
The hyper-confidence measure
The hyper-confidence measure of an itemset , , , is equal to:
h-conf()=minConf( , , , ), , Conf( , , , ),
where Conf stands for the Confidence measure associated to association rules.
The hyper-confidence measure is equivalent to the all-confidence measure, it thus fulfills the anti-monotonicity and the cross-support properties.
We continue in what follows with the any-confidence measure.
184.108.40.206 The any-confidence measure
This measure is defined, for any non empty set as follows:
The any-confidence measure
The any-confidence measure [Omiecinski, 2003] does not preserve nor the anti-monotonicity neither the cross-support properties.
Let us consider the extraction context given by Table 2.1. For a minimal correlation threshold equal to 0.80. The any-confidence value of AB is equal to, any-confidence(AB) = = = 0.66. AB do not fulfill the minimal threshold of correlation, thus it is a non-correlated itemset according to the any-confidence measure. Whereas, the AD itemset is correlated and its correlation value is equal to 1. We also have, = 0.75 0.80, however, any-confidence(AD) 1 0.80. This example illustrates the non preservation of the anti-monotonicity as well as the cross-support properties.
We present in what follows the Coefficient.
220.127.116.11 The Coefficient
The coefficient is defined as follows :
The Coefficient [Brin et al., 1997]
The coefficient of an itemset , with and , is defined as follows:
Some relevant properties of the coefficient are given by the following proposition.
The coefficient is a statistic and symmetric measure [Brin et al., 1997].
We recapitulate the different properties of the presented measures in Table 3.1. The “” symbol indicates that the measure fulfills the property.
In our previous study, we specifically focused on correlation measures which are most used in correlated patterns mining. Withal, the cosine and the kulczynski measures were not studied since these two measures are rarely used on correlated patterns mining due to the non conservation of the anti-monotonicity property [Wu et al., 2010]. The lift measure is used within the association rule evaluation.
We conclude, according to this overview, that the most interesting measures are bond and all-confidence. This is justified by the fact that these two measures fulfilled the pertinent properties of anti-monotonicity and cross-support.
We present, in what follows, the state of the art approaches dealing with correlated patterns mining. We precisely start with rare correlated pattern mining.
3.3.2 Rare Correlated Patterns Mining
Various approaches devoted to the extraction of correlated patterns under constraints have been proposed. However, the recuperation of all the patterns that are both highly correlated and infrequent is based on the naive idea to extract the set of all frequent patterns for a very low threshold minsupp and then to filter out these patterns by a measure of correlation.
Another idea is to extract the whole set of the correlated patterns without any integration of the rarity constraint. The obtained set contains obviously all the frequent correlated as well as the rare correlated patterns. It is relevant to note that the application of these two ideas is very expensive in execution time and in memory consumption due to the explosion of the number of candidates to be evaluated.
The approach proposed in [Cohen et al., 2000]
is based on the previous principle. This approach allows to extract the items’s pairs correlated according to the Similarity measure but without computing their support. In fact, the Similarity measure allows to evaluate the similarity between two items and corresponds to the quotient of the number of the simultaneous appearance divided by the
number of the complementary appearance. Consequently, the Similarity measure is semantically equivalent to the bond measure. However, any analysis of this measure have been conducted.
In fact, this approach proposes to assign to each item a signature composed by the identifier list of the transactions to which the item belongs. Then, the Similarity is computed and it corresponds to the number of the intersections of their signatures divided by the union of their signatures. We conclude that the frequency constraint was not integrated in order to recuperate the highly correlated itemsets with a weak support. From these patterns, the association rules with a high confidence and a weak support are generated.
In this same context, we mention the DiscoverMPatterns algorithm [Ma and Hellerstein, 2001]. In fact, this latter is devoted to the extraction of the correlated patterns based on the all-confidence measure. Nevertheless, a first version of the approach was dedicated to the extraction of all the correlated patterns without any restriction of the support value in order to specifically get the rare correlated itemsets. Then, within the second version of the approach, the minimum support threshold constraint was integrated. Consequently, this constraint integration allows to extract the frequent correlated patterns.
Another principle of the resolution of the rare correlated patterns extraction consists in extracting all the frequent patterns for a very weak minimal support threshold. Evidently, the obtained set contains a subset of the infrequent correlated patterns. Xiong et al. relied on this idea to introduce the Hyper-CliqueMiner algorithm [Xiong et al., 2006]. The output of this algorithm is the set of frequent correlated patterns for a very low minsupp value. It is to note, that the good performances of this algorithm are justified by the use of the anti-monotonic property of the correlation measure as well as the cross-support property which allows to reduce significantly the evaluated candidates and thus to reduce the time needed.
The approach proposed in [Sandler and Thomo, 2010] stands also within this principle. This approach allows to extract the frequent and frequently correlated 2-itemsets. It is judged as a naive approach that is based on the extraction of all the solution set for a very low minsupp values. Then, a post processing is performed in order to maintain only the high correlated itemsets. The FT-Miner algorithm [Hu and Li, 2009] outputs the correlated infrequent itemsets according to the N-Confidence semantically equivalent to all-Confidence. The all-Confidence measure was also treated in the Partition algorithm [Omiecinski, 2003], which allows to extract the correlated patterns according to both all-Confidence and bond measures. The choice of the measure to be considered depends on the user’s input preferences.
The approach proposed in [Okubo et al., 2010] also belongs to the same trend of approaches dealing with correlated infrequent itemsets. Indeed, it is based on the principle that the patterns which are weakly correlated according to the bond correlation measure are generally rare in the extraction context. The expressed constraint corresponds to a restriction of the maximum correlation value. This is a monotonic constraint since it corresponds to the opposite of the anti-monotonic constraint of minimal correlation. In order to get rid from rare patterns that represent exceptions, and they are not informative, a minimal frequency constraint was also integrated. The idea consists then in extracting the top rare patterns which are the most informative ones.
The problem of integrating constraints during the process of correlated pattern mining was also studied in the works, respectively, proposed in [Brin et al., 1997] and in [Grahne et al., 2000]. These approaches deal with constrained correlated pattern mining, they rely on the correlation coefficient. They exploit the various pruning opportunities offered by these constraints and benefit from the selective power of each type of constraints. However, the coefficient does not fulfill the anti-monotonic constraint as does the bond measure. Besides, these approaches are limited to the extraction of a small subset which is composed only by minimal valid patterns i.e. the minimal patterns which fulfill all of the imposed constraints. Furthermore, the authors do not propose any concise representation of the extracted correlated patterns.
Also, in [Surana et al., 2010], a study of different properties of interesting measures was conducted in order to suggest a set of the most adequate properties to consider while mining rare associations rules.
It is deduced that for all these approaches, the monotonic constraint of rarity was never included within the mining process in order to retrieve all the rare highly correlated patterns.
3.3.3 Frequent Correlated Patterns Mining
In [Lee et al., 2003], the authors proposed the CoMine approach which is dedicated to the extraction of frequent correlated patterns according to the all-confidence and to the bond measures. We distinguish two different versions of the CoMine approach. The first version treats the bond measure while the second treats the all-Confidence measure. CoMine also constitute the core of the I-IsCoMine-AP and I-IsCoMine-CT algorithms [SHEN et al., 2011].
Also, the bond measure was studied in [Le Bras et al., 2011], the authors proposed an apriori-like algorithm for mining classification rules. Moreover, the authors in [Segond and Borgelt, 2011] proposed a generic approach for frequent correlated pattern mining. Indeed, the bond correlation measure and eleven other correlation measures were used. All of them fulfill the anti-monotonicity property. Correlated patterns mining was then shown to be more complex and more informative than frequent pattern mining [Segond and Borgelt, 2011].
Many other works have also emerged. In [Wu et al., 2010], the authors provide a unified definition of existing null-invariant correlation measures and propose the GAMiner approach allowing the extraction of frequent high correlated patterns according to the Cosine and to the Kulczynsky measures. In this same context, the NICOMiner algorithm was also proposed in [Kim et al., 2011] and it allows the extraction of correlated patterns according to the Cosine measure. We highlight that the Cosine measure has the specificity of being not monotonic neither anti-monotonic.
In this same context, we also cite the Atheris approach [Soulet et al., 2011] which allows the extraction of condensed representation of correlated patterns according to user’s preferences. In [Barsky et al., 2012], the authors introduced the concept of flipping correlation patterns according to the Kulczynsky measure. However, the Kulczynsky measure does not fulfill the interesting anti-monotonic property as the bond measure.
The all-confidence measure was handled within the work proposed in [Karim et al., 2012]. The approach outputs the correlated patterns (also called the associated patterns), the non correlated patterns (also called the independent patterns). Also, in [Kiran and Kitsuregawa, 2013] the authors propose a method to extract all-confidence frequent correlated patterns and they also discuss the impact of fixing the minsupp threshold value over the quality of the obtained itemsets and propose to fix a minimal correlation threshold for each item.
In the next subsection, we study the approaches of extracting the condensed representations of frequent correlated patterns.
3.3.4 Condensed Representations of Correlated Patterns Mining
The problem of mining concise representations of correlated patterns was not widely studied in the literature. We mention the Ccmine [Kim et al., 2004] approach of mining closed correlated patterns according to the all-confidence measure which constitute a condensed representation of frequent correlated patterns. We also precise that the authors in [Ben Younes et al., 2010] proposed the CCPR-Miner algorithm allowing the extraction of closed frequent correlated patterns according to the bond measure.
In this context, we also cite the Jim approach [Segond and Borgelt, 2011]. In fact, Jim allows to extract the closed correlated frequent patterns which constitute a perfect cover of the whole set of frequent correlated patterns. The choice of the considered correlation measure is fixed by the user’s parameters within the Jim approach.
In fact, the Jim approach is, on the one hand the most efficient state of the art approach extracting condensed representation of frequent correlated patterns according to the bond measure. On the other hand, Jim is the unique approach which dealt with the same kind of patterns as we treat in our mining approach, that we present in the following chapters. In this sense, in our experimental study, we will focus on comparing our mining approach by the Jim approach.
Based on the previous review of the literature, we conclude that most of the approaches dealt with the bond and the all-confidence measures. These latter fulfill the interesting anti-monotonic property, that allows to reduce the search space by early pruning irrelevant candidates. Therefore, the frequent correlated set of patterns results from the conjunction of both constraints of the same type: the correlation and the frequency.
In fact, the recuperation of all the patterns that are both highly correlated and infrequent is based on the naive idea to extract the set of all frequent patterns for a very low threshold minsupp and then to filter out these patterns by a measure of correlation. Another resolution strategy consists in extracting the whole set of the correlated patterns without any integration of the rarity constraint. Then, a post-processing is performed in order to uniquely retrieve the rare correlated itemsets.
In other words, the monotonic constraint of rarity was never integrated within the mining process and thus the exploration of the search space of candidates that does not fulfill the rarity constraint is obviously barren. In addition, another problem is related to the high consuming of the memory and the CPU resources due to the combinatorial explosion of the number of candidates depending on the size of the mined dataset. We highlight, that Jim [Segond and Borgelt, 2011] is the unique approach that dealt with different anti-monotonic correlation measures. However, Jim is limited to frequent correlated patterns and do not consider the rare correlated ones.
Table 3.2 recapitulates the characteristics of the different visited approaches. This table summarizes the following properties:
The correlation measure: This property describes the considered correlation measure.
The kind of the extracted patterns: This property describes the kind of patterns outputted by the mining algorithm
The nature of constraints: This property describes the nature of the constraints included within the algorithm: anti-monotonic or monotonic.
To the best of our knowledge, no previous work was dedicated to the extraction of concise representations of patterns under the conjunction of constraints of distinct types. This problem is then a challenging task in data mining, which strengthens our motivation for the treatment of this problematic. Therefore, the work proposed in this thesis is the first one that puts the focus on mining concise representations of both frequent and rare correlated patterns according to the anti-monotonic bond measure.
|Extraction||Correlation||Kind of the extracted||Nature of|
|The approach of||bond||correlated||anti-monotonic|
|[Cohen et al., 2000]||2-itemsets|
|DiscoverMPattern||all-confidence||all the correlated||anti-monotonic|
|[Ma and Hellerstein, 2001]||itemsets|
|Partition||all-confidence||all the correlated||anti-monotonic|
|[Lee et al., 2003]|
|[Lee et al., 2003]|
|[Kim et al., 2004]||frequent correlated|
|[Xiong et al., 2006]||and a subset of rare|
|The approach of||all-confidence||correlated frequent|
|[Sandler and Thomo, 2010]||and a subset of rare||anti-monotonic|
|The approach of||bond||weakly correlated||monotonic|
|[Okubo et al., 2010]|
|[Ben Younes et al., 2010]||frequent correlated|
|[Segond and Borgelt, 2011]||anti-monotonic||frequent correlated|
|measures||and frequent correlated|
In this chapter, we proposed an overview of the state of the art approaches dealing with correlated patterns mining preceded by a presentation of the different correlation measures. We deduced that, there is no previous work that dealt with both frequent correlated as well as rare correlated patterns according to a specified correlation metric. Thus, motivated by this issue, we propose in this thesis to benefit from the knowledge returned from both frequent correlated as well as rare correlated patterns according to the bond measure. To tackle this challenging task, we propose in the next chapter the characterization of both frequent correlated patterns, rare correlated patterns and their associated concise representations.
Part II Condensed Representations of Correlated Patterns
Chapter 4 Condensed Representations of Correlated Patterns
The main moan that can be related to frequent pattern mining approaches stands in the fact that the latter do not offer the information concerning the correlation degree among the items in the extraction context. This stands behind our motivation to provide to the user the key information about the correlation between items as well as the frequency of their occurrence. This aim is reachable thanks to the integration of the correlation measures within the mining process.
The correlation measure, that we treat throughout this thesis, is bond. Our motivations behind the choice of this measure is explicitly described in Section 4.2. In Section 4.3, we focus on the correlated patterns associated to the bond measure, we characterize this set of patterns. Section 4.4 is devoted to the presentation of the closure operator associated to bond. We introduce the associated exact condensed representations in Section 4.5 and in Section 4.6. Section 4.7 concludes the chapter.
4.2 Motivations behind our choice of the bond measure
Based on the study of the state of the art approaches proposed in the previous chapter, we find that the almost of the existing approaches are dealing with the bond and the all-confidence measures. The bond measure fulfills the anti-monotony property which is an interesting property. Indeed, the latter reduce the search space when pruning the non potential candidates, therefore optimizing the extraction time as well as the memory consumption.
It has been proved in the literature that the bond measure presents many interesting properties. In fact, the bond measure is:
symmetric since , , bond() = bond();
descriptive since it is not influenced by the number of transactions that contain none of the items composing the pattern;
fulfills the cross-support property [Xiong et al., 2006]. Thanks to this property, given a minimal threshold minbond and an itemset , if and such as then is not correlated since bond() minbond;
induces an anti-monotonic constraint for a fixed minimal threshold minbond. In fact, , , if , then bond() bond(). Therefore, the set of correlated patterns forms an order ideal. Indeed, all the subsets of a correlated pattern are necessarily correlated ones.
We present in the following an interesting relation between the value of the bond measure and the conjunctive and disjunctive supports values for each couple of two patterns and such as [Ben Younes et al., 2010].
Let , and . If bond() = bond(), then Supp() = Supp() and Supp() = Supp().
According to the previous proposal, if bond() =
bond(), then Supp() =
In fact, both and
have the same conjunctive support and, according to the Morgan law, we build the following relation between the disjunctive and the negative supports of a pattern: Supp() = - Supp().
On the other hand, if bond()
Supp() (i.e. one of the two supports is different or both).
In this context, we propose to study the bond correlation measure in an integrated mining process aiming to extract both frequent and rare correlated patterns as well as their associated condensed representations. In this regard, we present in the next section the specification of the frequent correlated patterns as well as the rare correlated patterns according to the bond measure.
4.3 Characterization of the Correlated patterns according to the bond measure
4.3.1 Definitions and Properties
The bond measure [Omiecinski, 2003] is mathematically equivalent to Coherence [Lee et al., 2003], Tanimoto coefficient [Tanimoto, 1958], and Jaccard [Jaccard, 1901]. It was redefined in [Ben Younes et al., 2010] as follows:
The bond measure
The bond measure of a non-empty pattern is defined as follows:
The bond measure takes its values within the interval . While considering the universe of a pattern [Lee et al., 2003], i.e., the set of transactions containing a non empty subset of , the bond measure represents the simultaneous occurrence rate of the items of the pattern in its universe. Thus, the higher the items of the pattern are dispersed in its universe, (i.e. weakly correlated), the lower the value of the bond measure is, as Supp() is smaller than Supp(). Inversely, the more the items of are dependent from each other, (i.e. strongly correlated), the higher the value of the bond measure is, since Supp() would be closer to Supp().
The set of correlated patterns associated to the bond measure is defined as follows.
Considering a correlation threshold minbond, the set of correlated patterns, denoted , is equal to: = bond() minbond.
Let us consider the dataset given by Table 2.1. For minbond = 0.5, we have bond() = = 0.4 0.5. The itemset is then not a correlated one. Whereas, since bond() = = 0.6 0.5, the itemset is a correlated one.
In the following, we define the set composed by the maximal correlated patterns as follows:
Maximal correlated patterns
The set of maximal correlated patterns constitutes the positive border of correlated patterns and is composed by correlated patterns having no correlated proper superset. This set is defined as: = : , or equivalently: = : bond() minbond.
As far as we integrate the frequency constraint with the correlation constraint, we can distinguish between two sets of correlated patterns, which are the "Frequent correlated patterns" set and the "Rare correlated patterns" set. These two distinct sets will be characterized separately in the remainder.
4.3.2 Frequent Correlated Patterns
The set of frequent correlated patterns
Considering the support threshold minsupp and the correlation threshold minbond, the set of frequent correlated patterns, denoted , is equal to: = Supp() minsupp and bond() minbond.
In fact, the set is composed by the patterns fulfilling at the same time the correlation and the frequency constraints. A pattern is said to be “Frequent Correlated” if its support exceeds the minimal frequency threshold minsupp and its correlation value also exceeds the minimal correlation threshold minbond. The set corresponds to the conjunction of two anti-monotonic constraints of correlation and of frequency. Thus, it induces an order ideal on the itmeset lattice.
4.3.3 Rare Correlated Patterns
The set of rare correlated patterns associated to the bond measure is defined as follows.
The set of rare correlated patterns
Considering the support threshold minsupp and the correlation threshold minbond, the set of rare correlated patterns, denoted , is equal to: = Supp() minsupp and bond() minbond.
Consider the extraction context sketched by Table 2.1 (Page 2.1). For minsupp = 4 and minbond = 0.2, the set consists of the following patterns where each triplet represents the pattern, its conjunctive support value and its bond value: = (, 3, ), (, 1, ), (, 2, ), (, 3, ), (, 1, ), (, 2, ), (, 3, ), (, 1, ), (, 3, ), (, 2, ), (, 2, ), (, 1, ), (, 2, ), (, 3, ), (, 2, ). This associated set as well as the set of the previous example are depicted by Figure 4.1. The support shown at the top left of each frame represents the conjunctive one. As shown in Figure 4.1, the rare correlated patterns are localized below the border induced by the anti-monotonic constraint of correlation and over the border induced by the monotonic constraint of rarity.
We deduce from Definition 32 that the set corresponds to the intersection between the set of correlated patterns and the set of rare patterns, i.e., = . The following proposition derives from this result.
Let . We have:
Based on the order ideal of the set of correlated patterns, we have :
Based on the order filter of the set of rare patterns, we have : .
Proof. The proof follows from the properties induced by the constraints of rarity and correlation. The set , whose elements fulfill
the constraint “being a rare correlated pattern”, results from the conjunction between two theories corresponding to both constraints of distinct types.
So, the set is neither an order ideal nor an order filter. The search space of this set is delimited
by: (i) The maximal correlated elements which are also rare, i.e. the rare patterns among the set of maximal correlated patterns
(cf. Definition 30) and;
(ii) The minimal rare elements which are correlated, i.e. the correlated patterns among the set of minimal rare patterns
(cf. Definition 5).
Therefore, each rare correlated pattern is necessarily included between an element from each set of the two aforementioned sets.
Therefore, the localization of these elements is more difficult than the localization of theories corresponding to constraints of the same nature. Indeed, the conjunction of anti-monotonic constraints (resp. monotonic) is an anti-monotonic one (resp. monotonic) [Bonchi and Lucchese, 2006]. For example, the constraint "being a correlated frequent pattern" is anti-monotonic, since it results from the conjunction of two anti-monotonic constraints namely, "being a correlated pattern" and "being a frequent pattern". This constraint induces, then, an order ideal on the itemset lattice [Ben Younes et al., 2010]. However, the constraint “being a rare and a not correlated pattern” is monotonic, since it results from the conjunction of two monotonic constraints namely, "being a not correlated pattern” and "being a rare pattern”. This constraint induces, then, an order filter on the itemset lattice.
In order to assess the size of the set, and given the nature of the two constraints induced by the minimal thresholds of rarity and correlation respectively minsupp and minbond, the size of the set of rare correlated patterns varies as shown in the following proposition.
a) Let minsupp and minsupp be two minimal thresholds of conjunctive support and and be the two sets of patterns associated to each threshold for the same value of minbond.
We have: if minsupp minsupp, then and consequently .
b) Let minbond and minbond be two minimal thresholds of bond measure and let and be the two sets of patterns associated to each threshold for the same value of minsupp. We have: if minbond minbond, then , consequently .
Proof. - The proof of a) derives from the fact that for , if Supp() minsupp, then Supp() minsupp