# Finding Good Itemsets by Packing Data

###### Abstract

The problem of selecting small groups of itemsets that represent the data well has recently gained a lot of attention. We approach the problem by searching for the itemsets that compress the data efficiently. As a compression technique we use decision trees combined with a refined version of MDL. More formally, assuming that the items are ordered, we create a decision tree for each item that may only depend on the previous items. Our approach allows us to find complex interactions between the attributes, not just co-occurrences of 1s. Further, we present a link between the itemsets and the decision trees and use this link to export the itemsets from the decision trees. In this paper we present two algorithms. The first one is a simple greedy approach that builds a family of itemsets directly from data. The second one, given a collection of candidate itemsets, selects a small subset of these itemsets. Our experiments show that these approaches result in compact and high quality descriptions of the data.

## 1 Introduction

One of the major topics in data mining research is the discovery of interesting patterns in data. From the introduction of frequent itemset mining and association rules [2], the pattern explosion was acknowledged: at high frequency thresholds only common knowledge is revealed, while at low thresholds prohibitively many patterns are returned.

Part of this problem can be solved by reducing these collections either lossless or lossy, however even then the resulting collections are often so large that they cannot be analyzed by hand or even machine. Recently, it was therefore argued [14] that while the efficiency of the search process has received ample attention, there still exists a strong need for pattern mining approaches that deliver compact, yet high quality, collections of patterns (see Section 6 for a more detailed discussion). Our goal is to identify the family of itemsets that form the best description of the data. Recent proposals to this end all consider just part of the data, by either only considering co-occurrences [30] or being lossy in nature [20, 5, 7]. In this paper, we present two methods that do describe all interactions in the data. Although different in approach, both methods return small families of itemsets, which are selected to provide high-quality lossless descriptions of the data in terms of local patterns. Importantly, our parameterless methods regard the data symmetrically. That is, we consider not just the 1s in the data, but also the 0s. Therefore, we are able to find patterns that describe all interactions between items in the data, not just co-occurrences.

As a measure of quality for the collection of itemsets we employ the practical variant of Kolmogorov Complexity [23], the Minimum Description Length (MDL) principle [13]. This principle implies that we should do induction through compression. It states that the best model is the model that provides the best compression of the data: it is the model that captures best the regularities of the data, with as little redundancy as possible.

The main idea of our approach is to use decision trees to determine the shortest possible encoding of an attribute, by using the values of already transmitted attributes. For example, let us assume two binary attributes and . Now say that for 90% of the time when the attribute has a value of , the attribute has a value of . If this situation occurs frequently, we recognize this dependency, and include the item in the tree deciding how to encode .

Using such trees allows us to find complex interactions between the items while at the same time MDL provides us with a parameter-free framework for removing fake interactions that are due to the noise in the data. The main outcome of our methods is not the decision trees, but the group of itemsets that form their paths: these are the important patterns in the data since they capture the dependencies between the attributes implied by the decision trees.

The two algorithms we introduce to this end are orthogonal in approach. Our first method builds the encoding decision trees directly from the data; it greedily introduces splits until no split can help to compress the data further. Just as naturally as we can extract itemsets from these trees, we can consider the trees that can be built from a collection of itemsets. That link is exploited by our second method, which tries to select the best itemsets from a larger collection.

Experimental evaluation shows that both methods return small collections of itemsets that provide high quality data descriptions. These sets allow for very short encoding of the data, which inherently shows that the most important patterns in the data are captured. As the number of itemsets are small, we can easily expose the resulting itemsets to further analysis, either by hand or by machine.

The rest of this paper is as follows. After the covering preliminaries in Section 2, we discuss how to use decision trees to optimally encode the data succinct in Section 3. Next, in Section 4, we explain the connection between decision trees and itemsets. Section 5 introduces our method with which good itemsets can be selected by weighing these through our decision tree encoding. Related work is discussed in Section 6, after which we present the experiments on our methods in Section 7. We round up with discussion and conclusions in Sections 8 and 9.

## 2 Preliminaries and Notation

In this section we introduce preliminaries and notations used in subsequent sections.

A binary dataset is a collection of transactions, binary vectors of length . The th element of a random transaction is represented by an attribute , a Bernoulli random variable. We denote the collection of all the attributes by . An itemset is a subset of attributes. We will often use the dense notation .

Given an itemset and a binary vector of length , we use the notation to express the probability of . If contains only 1s, then we will use the notation , if contains only 0s, then we will use the notation .

Given a binary dataset we define to be an empirical distribution,

We define the frequency of an itemset to be .

In the paper we use the common convention . All logarithms in the paper are of base .

In the subsequent sections we will need some knowledge of graphs. All the graphs in the paper are directed. Given a graph we denote by the set of vertices and by the edges of . A directed graph is said to be acyclic (DAG) if there is no cycle in the graph. A directed graph is said to be directed spanning tree if each node (except one special node) has exactly one outgoing edge. The special node has no outgoing edge and is called sink.

## 3 Packing Binary Data with Decision Trees

In this section we present our model for packing the data and a greedy algorithm for searching good models.

### 3.1 The Definition of the Model

Our goal in this section is to define a model that is used to transmit a binary dataset from a transmitter to a receiver. We do this by transmitting one transaction at the time, the order of which does not matter. Within a single transaction we transmit the items one at the time.

Assume that we are transmitting an attribute . As the attribute may have two values, we need to have two codes to indicate its value. We define the table in which these two codes are stored to be a coding table. Obviously, the codes need to be optimal, that is, as short as possible. From information theory [10], we have the optimal Shannon codes of length . Here, the optimal code lengths are thus and . We need to transmit the attribute times. The cost of these transmissions is

This is the simplest case of encoding . Note that we are not interested in the actual codes, but only in their lengths: they allow us to determine the complexity of a model.

A more complex and more interesting approach to encode succinct is to have several coding tables from which the transmitter chooses one for transmission. Choosing the coding table is done via a decision tree that branches on the values of other attributes in the same transaction. That is, we have a decision tree used for encoding in which each leaf node is associated with a different coding table of . The leaf is selected by testing the values of other attributes within the same transaction.

###### Example 1.

Assume that we have three attributes, , , and and consider the trees given in Figure 1. In Figure 1(a) we have the simplest tree, a simple coding table with no dependencies at all. A more complex tree is given in Figure 1(b) where the transmitter chooses from two coding table for based on the value of . Similarly in, Figure 1(d) we have three different coding tables for . The choice of the coding table in this case is based on the values of and .

Let us introduce some notation. Let be a tree encoding . We use the notation . We set to be the set of all items used in for choosing the coding table.

To define the cost of transmitting we first define to be the set of all leaves in . Let be a leaf and be the probability of being chosen. Further, is the probability of given that is chosen. We now know that the optimal cost, denoted by , is

###### Example 3.

The number of bits needed by in Figure 1(a) to transmit in a random transaction is

Similarly, if we assume that , the number of bits needed by to transmit in a random transaction is

In order for the receiver to decode the attribute he must know what coding table was used. Thus, he must be able to use the same decision tree that the transmitter used for encoding . To ensure this, the transmitter must know when decoding . So, the attributes must have an order in which they are sent and the decision trees may only use the attributes that have already been transmitted.

The aforementioned requirement is easily characterized by the following construction. Let be a directed graph with nodes, each node corresponding to an attribute. The graph contains all the edges of form where , where is the tree encoding . We call the dependency graph. It is easy to see that there exists an order of the attributes if and only if is an acyclic graph (DAG). If constructed from a set of trees is indeed DAG we call the set a decision tree model.

###### Example 4.

Consider a graph given in Figure 2(a) constructed from the trees , , and (Figure 1). We cannot use this combination of trees for encoding since there is a cycle in the graph. On the other hand if we use trees , , and , then the resulting graph (given in Figure 2(b)) is acyclic and thus these trees can be used for the transmission.

### 3.2 Encoding the Data

In order for the receiver to be able to decode the attributes, he must know both the coding tables and the trees. Hence, we need to transmit both of these. First, we cover how the coding tables, the leafs of the decision trees, are transmitted.

To transmit the coding tables we use the concept of Refined MDL [13]. Refined MDL is an improved version of the more traditional two-part MDL (sometimes referred to as the crude MDL). The basic idea of the refined variant is that instead of transmitting the coding tables, the transmitter and the receiver use so called universal codes. Universal codes are the cornerstone of Refined MDL. As these are codes can be derived without any further shared information, this allows for a good weighing of the actual complexity of the data and model, with virtually no overhead. While the practicality of applying such codes depends on the type of the model, our decision trees are particularly well-suited.

These universal codes provide a cost called the complexity of the model. This cost can be calculated as follows: let be a leaf in the decision tree (i.e. coding table), and be the number of transactions for which is used. Then the complexity of this leaf, denoted by , is

In general, there is no known closed formula for the complexity of the model. Hence estimates are usually employed [29]. However, for our tree models we can apply an existing linear-time algorithm that solves the complexity for multinomial models [21]. We should also point out that the Refined MDL is asymptotically equivalent to Bayes Information Criteria (BIC) if the number of transactions goes to infinity and the number of free parameters stays fixed. However, for moderate numbers of transactions there may be significant differences [13].

Now that the coding tables can be transmitted, we need to know how to transmit the actual tree . To encode the tree we simply transmit the nodes of the tree in a sequence. We use one bit to indicate whether the node is a leaf, or an intermediate node . For an intermediate node we additionally use bits, where is the number of attributes in , to indicate the item that is used for the split.

The combined cost of a tree , denoted by , is

that is, the cost is the number of bits needed to transmit the tree and the attribute in each transaction of .

###### Example 5.

Assume that we have a dataset with transactions and items. Assume also that . We know that the complexity of the leaves in this case is . The cost of the tree (Figure 1(c) is

Given a decision tree model we define the cost . The cost is the number of bits needed to transmit the trees, one for each attribute, and the complete dataset .

We should point out that for data with many items, the term grows and hence the threshold increases for selecting an attribute into any decision tree. This is an interesting behavior, as due to the finite number of transactions, for datasets with many items there is an increased probability that two items will correlate, even though they are independent according to the generative distribution.

### 3.3 Greedy Algorithm

Our goal is to find the decision tree model with the lowest complexity cost. However, since many problems related to the decision trees are NP-complete [26] we will resort to a greedy heuristic to approximate the decision tree model with the lowest . It is based on the ID3 algorithm.

To fully introduce the algorithm we need some notation: By we mean the simplest tree packing without any other attributes (see Figure 1(a)). Given a tree , a leaf , and an item not occurring in the path from to the root of , we define to be a new tree where is replaced by a non-leaf node testing the value of and having two leaves as the branches.

The algorithm GreedyPack starts with a tree model consisting only of trivial trees. The algorithm finds the tree which saves the most bits by splitting. To ensure that the decision tree model is valid, GreedyPack builds a dependency graph describing the dependencies of the trees and makes sure that is acyclic. The algorithm terminates when no further split can be made that saves any bits.

## 4 Itemsets and Decision Trees

So far we have discussed how to transmit binary data by using decision trees. In this section we present how to select the itemsets representing the dependencies implied by the decision trees. We will use this link in Section 5. A similar link between itemsets and decision trees is explored in [27] although our setup and goals are different.

Given a leaf , the dependency of the item is captured in the coding table of . Hence we are interested in finding itemsets that carry the same information. That is, itemsets from which we can compute the coding table. To derive the codes for the leaf it is sufficient to compute the probability

(1) |

Our goal is to express the probabilities on the right side of the equation using itemsets. In order to do that let be the path from to its root. Let be the items along the path which are tested positive. Similarly, let be the attributes which are tested negative. Using the inclusion-exclusion principle we see that

(2) |

We compute in a similar fashion. Let us define for a given leaf to be

Combining Eqs. 1–2 we see that the collection satisfies our goal.

###### Proposition 6.

The coding table associated with the leaf can be computed from the frequencies of .

###### Example 7.

Let , , and be the leaves (from left to right) of in Figure 1(d). Then the corresponding families of itemsets are , , and .

We can easily see that the family is essentially the smallest family of itemsets from which the coding table can be derived uniquely.

###### Proposition 8.

Let be a family of itemsets. Then there are two data sets, say and , for which but .

Given a tree we define to be . We also define where is a decision tree model.

## 5 Choosing Good Itemsets

The connection between itemsets and decision trees made in the previous section allows us to consider an orthogonal approach to identify good itemsets. Informally, our goal is to construct decision trees from a family of itemsets , selecting the subset from that provides the best compression of the data. More formally, our new approach is as follows: given a downward closed family of itemsets , we build a decision tree model providing a good compression of the data, with .

Before we can describe our main algorithm, we need to introduce some further notation. Firstly, given two trees and not using attribute , we define to be the join tree with as the root node, as the positive branch of , and as the negative branch of . Secondly, to define our search algorithm we need to find the best tree

that is, , returns the best tree for for which the related sets are in and only splits on attributes in .

To compute the optimal tree , we use the exhaustive method (presented originally in [27]) given in Algorithm 2. The algorithm is straightforward: it tests each valid item as the root and recurses itself on both branches.

We can now describe the actual algorithm for constructing decision tree models with a low cost. Our method automatically discovers the order in which the attributes can be transmitted most succinct. For this, it needs to find sets of attributes for each attribute such that these should be encoded before . The collection should define an acyclic graph and the actual trees are . We use as a shorthand for the total complexity of the best model built from .

We construct the set iteratively. At the beginning of the algorithm we have and we increase the sets one attribute at a time. We allow ourselves to mark the attributes. The idea is that once the attribute is marked, then we are not allowed to augment any longer. At the beginning none of the nodes are marked.

To describe a single step in the algorithm we consider a graph , where represent the attributes and is a special auxiliary node. We start by adding edges having the weight , thus the cost of the best tree possible from using only the attributes in . Then, for each unmarked node we find out what other extra attribute will help most to encode it succinct. To do this, we add the edge for each with the weight . Now, let be the minimum directed spanning tree of having as the sink. Consider an unmarked node such that . That node is now the best choice to be fixed, as it helps to encode the data best. We therefore mark attribute and add to each for each ancestor of in . This process is repeated until all attributes are marked. The details of the algorithm are given in Algorithm 3.

The marking of the attributes guarantees that there can be no cycles in . In fact, the marking order also tells us a valid order for transmitting the attributes. Further, as at least one attribute is marked at each step, this guarantees that the algorithm terminates in steps.

Let be the collection of sources. The following proposition tells us that the augmentation performed by SetPack does not compromise the optimality of collections next to .

###### Proposition 9.

Assume the collection of sources . Let be the collection of sources such that and . Let be the collection that Algorithm 3 produces from in a single step. Then there is a collection such that and that .

###### Proof.

Let be the graph constructed by Algorithm 3 for the collection . Construct the following graph : For each such that add the edge . For each add the edge , where . But is a directed spanning tree of . Let be the directed minimum spanning tree returned by the algorithm. Let if and if . Note that defines a valid model and because is optimal we must have . ∎

###### Corollary 10.

Assume that is a family of itemsets having 2 items, at maximum. The algorithm SetPack returns the optimal tree model.

Let us consider the complexity of the algorithms. The algorithm SetPack runs in a polynomial time. By using dynamic programming we can show that Generate runs in time. We also tested a faster variant of the algorithm in which the exhaustive search in Generate is replaced by the greedy approach similar to the ID3 algorithm. We call this variant SetPackGreedy.

## 6 Related Work

Finding interesting itemsets is a major research theme in data mining. To this end, many measures have been suggested over time. A classic measure for ranking itemsets is frequency, for which there exist efficient search algorithms [2, 15]. Other measures involve comparing how much an itemset deviates from the independence assumption [3, 1, 11, 4]. In yet other approaches more flexible models are used, such as, Bayes networks [18, 17], Maximum Entropy estimates [24, 31]. Related are also low-entropy sets: itemsets for which the entropy of the data is low [16].

Many of these approaches suffer from the fact that they require a user-defined threshold and further that at low thresholds extremely many itemsets are returned, many of which convey the same information. To address the latter problem we can use closed [28] or non-derivable [6] itemsets that provide a concise representation of the original itemsets. However, these methods deteriorate even under small amounts of noise.

Alternative to these approaches of describing the pattern set, there are methods that instead pick groups of itemsets that describe the data well. As such, we are not the first to embrace the compression approach to data mining [12]. Recently, Siebes et al. [30] introduced the MDL-based Krimp algorithm to battle the frequent itemset explosion at low support thresholds. It returns small subsets of itemsets that together capture the distribution of the data well. These code tables have been successfully applied in classification [22], measuring the dissimilarity of data [33], and data generation [34]. While these applications shows the practicality of the approach, Krimp can only describe the patterns between the items that are present in the dataset. On the other hand, we consider the s and the s in the data symmetrically and hence we are able to provide more detailed descriptions of the data; including patterns between the presence and absence of items.

More different from our methods are the lossy data description approaches. These strive to describe just part of the data, and as such may overlook important interactions. Summarization [7] is a compression approach that identifies a group of itemsets such that each transaction is summarized by one set with as little loss of information as possible. Yet different are pattern teams [20], which are groups of most-informative length- itemsets [19], selected through an external interestingness measure. As this approach is computationally intensive, the number of team members is typically . Bringmann et al. [5] proposed a similar selection method that can consider larger pattern sets. However, it also requires the user to choose a quality measure to which the pattern set has to be optimized, unlike our parameter-free and lossless method.

Alternatively we can view the approach in this paper as building a global model for data and then selecting the itemsets that describe the model. This approach then allows us to use MDL as a model selection technique. In a related work [32] the authors build decomposable models in order to select a small family of itemsets that model the data well.

The decision trees returned by our methods, and particularly the DAG that they form, have a passing resemblance to Bayes networks [9]. However, as both the model construction and complexity weighing differ strongly, so do the outcomes. To be more precise, in our case the distributions are modeled and weighted via decision trees whereas in the Bayes network setup any distribution is weighted equally. Furthermore, we use the correspondence between the itemsets and the decision trees to output local patterns, as opposed to Bayes networks which are traditionally used as global models.

## 7 Experiments

This section contains the results of the empirical evaluation of our methods using toy and real datasets.

### 7.1 Datasets

For the experimental validation of the two packing
strategies we use a group of datasets with strongly
differing statistics.
From the LUCS/KDD repository [8] we took a number of
often used databases to allow for comparison to other methods.
To test our methods on real data we used the Mammals presence database
and the Helsinki CS-courses dataset.
The latter contains the enrollment records of
students taking courses at the Department of Computer Science
of the University of Helsinki.
The mammals dataset consists of the absence/presence of European
mammals [25] in geographical areas of 50x50 kilometers.^{1}^{1}1The full version of the dataset is available for research purposes upon request, http://www.european-mammals.org.
The details of these datasets are provided in Table 1.

Dataset | % of 1’s | ||
---|---|---|---|

anneal | 898 | 71 | 20.1 |

breast | 699 | 16 | 62.4 |

courses | 3506 | 98 | 4.6 |

mammals | 2183 | 40 | 46.9 |

mushroom | 8124 | 119 | 19.3 |

nursery | 12960 | 32 | 28.1 |

pageblocks | 5473 | 44 | 25.0 |

tic–tac–toe | 958 | 29 | 34.5 |

### 7.2 Experiments with Toy Datasets

To evaluate whether our method correctly identifies (in)dependencies, we start our experimentation using two artificial datasets of 2000 transactions and 10 items. For both databases, the data is generated per transaction, and the presence of the first item is based on a fair coin toss. For the first database, the other items are similarly generated. However, for the second database, the presence of an item is 90% dependent on the previous item. As such, both datasets have item densities of about 50%.

GreedyPack | Krimp | ||||||||
---|---|---|---|---|---|---|---|---|---|

Dataset | (bits) | (bits) | # trees | # sets | min–sup | # sets | # bits | ratio (%) | |

anneal | 23104 | 12342 | 53.4 | 71 | 1203 | 1 | 102 | 22154 | 34.6 |

breast | 8099 | 2998 | 37.0 | 16 | 17 | 1 | 30 | 4613 | 16.9 |

courses | 76326 | 61685 | 80.8 | 98 | 1230 | 2 | 148 | 71019 | 79.3 |

mammals | 78044 | 50068 | 64.2 | 40 | 845 | 200 | 254 | 90192 | 42.3 |

mushroom | 442062 | 115347 | 26.1 | 119 | 999 | 1 | 424 | 231877 | 20.9 |

nursery | 337477 | 180803 | 53.6 | 32 | 3409 | 1 | 260 | 258898 | 45.5 |

pageblocks | 15280 | 7611 | 49.8 | 44 | 219 | 1 | 53 | 10911 | 5.0 |

tic–tac–toe | 25123 | 14137 | 56.3 | 29 | 619 | 1 | 162 | 28812 | 62.3 |

If we apply GreedyPack, our greedy decision tree building method, to these datasets we see that it is unable to compress the independent database at all. Opposing, the dependently generated dataset can be compressed into only 50% of the original number of bits. Inspection of the resulting itemsets show that the resulting model correctly describes the dependencies in detail: The resulting itemsets are .

### 7.3 The Greedy Method

Recall that our goal is to find high quality descriptions of the data. Following the MDL principle, the quality of the found descriptions can objectively be measured by the compression of the data. We present the compressed sizes for GreedyPack in Table 2. The encoding costs include the size of the encoded data and the decision trees. The initial costs, as denoted by , are those of encoding the data using naïve single-node TrivialTrees. Each of these experiments required 1–10 seconds runtime, with an exception of s for mushroom.

From Table 2, we see that all models returned by GreedyPack strongly reduce the number of bits required to describe the data; this implicitly shows that good models are returned. The quality can be gauged by taking the compression ratios into account. In general, our greedy method reduces the number of bits to only half of what the independent model requires. As two specific examples of the found dependencies, in the courses dataset the course Data Mining was packed using Machine Learning, Software Engineering, Information Retrieval Methods and Data Warehouses. Likewise, AI and Machine Learning were used to pack the Robotics course.

Like discussed above, our approach and the Krimp [30] algorithm have stark differences in what part of the data is considered. However, as both methods use compression, and result good itemsets, it is insightful to compare the algorithms. For the latter we here allow it to compress as well as possible, and thus, consider candidates up to as low min-sup thresholds as feasible.

Let us compare between the outcomes of either method. For Krimp these are itemsets, for ours it is the combination of the decision trees and the related itemsets. We see that Krimp typically returns fewer itemsets than GreedyPack. However, our method returns itemsets that describe interactions between both present and absent items.

Next, we observed that especially the initial Krimp compression requires many more bits than ours, and as such Krimp attains better compression ratios. However, if we disregard the ratios and look at the raw number of bits the two methods require, we see that Krimp generally requires twice as many bits to describe only the 1’s in the data than GreedyPack does to represent all of the data.

### 7.4 Validation through Classification

To further assess the quality of our models we use a simple classification scheme [22]. First, we split the training database into separate class-databases. We pack each of these. Next, the class labels of the unseen transactions were assigned according to the model that compressed it best.

We ran these experiments for three databases, viz. mushroom, breast and anneal. A random 90% of the data was used to train the models, leaving 10% to test the accuracy on. The accuracy scores we noted, resp. 100%, 98.0% and 93.4%, are fully comparable to (and for the second, even better than) the classifiers considered in [22].

### 7.5 Choosing Good Itemsets

Candidate Itemsets | SetPack | SetPackGreedy | Krimp | |||||||
---|---|---|---|---|---|---|---|---|---|---|

Dataset | min-sup | # sets | # sets | # sets | # bits | # sets | ||||

anneal | 175 | 8837 | 20777 | 89.9 | 103 | 20781 | 89.9 | 69 | 31196 | 53 |

breast | 1 | 9920 | 5175 | 63.7 | 42 | 5172 | 63.9 | 49 | 4613 | 30 |

courses | 55 | 5030 | 64835 | 84.9 | 268 | 64937 | 85.1 | 262 | 73287 | 93 |

mammals | 700 | 7169 | 65091 | 83.4 | 427 | 65622 | 84.1 | 382 | 124737 | 125 |

mushroom | 1000 | 123277 | 313428 | 70.9 | 636 | 262942 | 59.5 | 1225 | 474240 | 140 |

nursery | 50 | 25777 | 314081 | 93.0 | 276 | 314295 | 93.1 | 218 | 265064 | 225 |

pageblocks | 1 | 63599 | 11961 | 78.3 | 92 | 11967 | 78.3 | 95 | 10911 | 53 |

tic–tac–toe | 7 | 34019 | 23118 | 92.0 | 620 | 23616 | 94.0 | 277 | 28957 | 159 |

In this subsection we evaluate SetPack, our itemset selection algorithm. Recall that this algorithm selects itemsets such that they allow for building succinct encoding decision trees. The difference with GreedyPack is that in this setup the resulting itemsets should be a subset of a given candidate family. Here, we consider frequent itemsets as candidates. We set the support threshold such that the experiments with SetPack were finished within –2 hours, with an exception of 23 hours for considering the large candidate family for mushroom. For comparison we use the same candidates for Krimp. We also compare to SetPackGreedy, which required 1–12 minutes, 7 minutes typically, with an exception of hours for mushroom.

Comparing the results of this experiment (Table 3) with the results of GreedyPack in the previous experiment, we see that the selection process is more strict: now even fewer itemsets are regarded as interesting enough. Large candidate collections are strongly reduced in number: up to three orders of magnitude. On the other hand, the compression ratios are still very good. The reason that GreedyPack produces smaller compression ratios is because it is allowed to consider any itemset.

Further, the fact alone that even with this very strict selection the compression ratios are generally well below 90% show that these few sets are indeed of high importance to describing the major interactions in the data.

If we compare the number of selected sets to Krimp, we see that our method returns in the same order as many itemsets. These descriptions require far less bits than those found by Krimp. As such, ours are a better approximation of the Kolmogorov complexity of the data.

Between SetPack and SetPackGreedy the outcomes are very much alike; this goes for both the obtained compression as well as the number of returned itemsets. However, the greedy search of SetPackGreedy allows for much shorter running times.

## 8 Discussion

The experimentation on our methods validates the quality of the returned models. The models correctly detect dependencies in the data while ignoring independencies. Only a small number of itemsets is returned, which are shown to provide strong compression of the data. By the MDL principle we then know these describes all important regularities in the data distribution in detail efficiently and without redundancy. This claim is further supported by the high classification accuracies our models achieve.

The GreedyPack algorithm generally uses more itemsets and obtains better packing ratios than SetPack. While GreedyPack is allowed to use any itemset, SetPack may only use frequent itemsets. This suggests that we may able to achieve better ratios if we use different candidates, for example, low-entropy sets [16].

The running times of the experiments reported in this work range from seconds to hours and depend mainly on the number of attributes and rows of the datasets. The exhaustive version SetPack may be slow on very large candidate sets, however, the greedy version SetPackGreedy can even handle such families well. Considering that our current implementation is rather naïve and the fact that both methods are easily parallelized, both GreedyPack and SetPackGreedy are suited for the analysis of large databases.

The main outcomes of our models are the itemsets that identify the encoding paths. However, the decision trees from which these sets are extracted can also be regarded as interesting as these provide an easily interpretable view on the major interactions in the data. Further, just considering the attributes used in such a tree as an itemset also allows for simple inspection of the main associations.

In this work we employ the MDL criterion to identify the optimal model. Alternatively, one could consider using either BIC or AIC, both of which can easily be applied to judge between our decision tree-based models.

## 9 Conclusions

In this paper we presented two methods that find compact sets of high quality itemsets. Both methods employ compression to select the group of patterns that describe all interactions in the data best. That is, the data is considered symmetric and thus both the 0s and 1s are taken into account in these descriptions. Experimentation with our methods showed that high quality models are returned. Their compact size, typically tens to thousands of itemsets, allow for easy further analysis of the found interactions.

## References

- [1] C. C. Aggarwal and P. S. Yu. A new framework for itemset generation. In Proceedings of the ACM SIGACT-SIGMOD-SIGART symposium on Principles of Database Systems (PODS), pages 18–24. ACM Press, 1998.
- [2] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo. Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining, pages 307–328. AAAI, 1996.
- [3] S. Brin, R. Motwani, and C. Silverstein. Beyond market baskets: Generalizing association rules to correlations. In ACM SIGMOD International Conference on Management of Data, pages 265–276. ACM Press, 1997.
- [4] S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In ACM SIGMOD International Conference on Management of Data, pages 255–264, 1997.
- [5] B. Bringmann and A. Zimmermann. The chosen few: On identifying valuable patterns. In IEEE International Conference on Data Mining (ICDM), pages 63–72, 2007.
- [6] T. Calders and B. Goethals. Mining all non-derivable frequent itemsets. In Proceedings of the 6th European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, pages 74–85, 2002.
- [7] V. Chandola and V. Kumar. Summarization - compressing data into an informative representation. In Proceedings of the IEEE Conference on Data Mining, pages 98–105, 2005.
- [8] F. Coenen. The LUCS-KDD discretised/normalised ARM and CARM data library. 2003.
- [9] G. F. Cooper and E. Herskovits. A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9:309–347, 1992.
- [10] T. Cover and J. Thomas. Elements of Information Theory, 2nd ed. John Wiley and Sons, 2006.
- [11] W. DuMouchel and D. Pregibon. Empirical bayes screening for multi-item associations. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 67–76, 2001.
- [12] C. Faloutsos and V. Megalooikonomou. On data mining, compression and kolmogorov complexity. In Data Mining and Knowledge Discovery, volume 15, pages 3–20. Springer, 2007.
- [13] P. D. Grünwald. The Minimum Description Length Principle. MIT Press, 2007.
- [14] J. Han, H. Cheng, D. Xin, and X. Yan. Frequent pattern mining: Current status and future directions. In Data Mining and Knowledge Discovery, volume 15. Springer, 2007.
- [15] J. Han and J. Pei. Mining frequent patterns by pattern-growth: methodology and implications. SIGKDD Explorations Newsletter, 2(2):14–20, 2000.
- [16] H. Heikinheimo, E. Hinkkanen, H. Mannila, T. Mielikäinen, and J. K. Seppänen. Finding low-entropy sets and trees from binary data. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 350–359, 2007.
- [17] S. Jaroszewicz and T. Scheffer. Fast discovery of unexpected patterns in data, relative to a bayesian network. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 118–127, 2005.
- [18] S. Jaroszewicz and D. A. Simovici. Interestingness of frequent itemsets using bayesian networks as background knowledge. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 178–186, 2004.
- [19] A. J. Knobbe and E. K. Y. Ho. Maximally informative k-itemsets and their efficient discovery. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 237–244, 2006.
- [20] A. J. Knobbe and E. K. Y. Ho. Pattern teams. In Proceedings of the 10th European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, pages 577–584, 2006.
- [21] P. Kontkanen and P. Myllymäki. A linear-time algorithm for computing the multinomial stochastic complexity. Information Processing Letters, 103(6):227–233, 2007.
- [22] M.van Leeuwen, J. Vreeken, and A. Siebes. Compression picks the item sets that matter. In Proceedings of the 10th European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, pages 585–592, 2006.
- [23] M. Li and P. Vitányi. An Introduction to Kolmogorov Complexity and its Applications. Springer-Verlag, 1993.
- [24] R. Meo. Theory of dependence values. ACM Trans. Database Syst., 25(3):380–406, 2000.
- [25] A. J. Mitchell-Jones, G. Amori, W. Bogdanowicz, B. Krystufek, P. J. H. Reijnders, F. Spitzenberger, M. Stubbe, J. B. M. Thissen, V. Vohralik, and J. Zima. The Atlas of European Mammals. Academic Press, 1999.
- [26] K. V. S. Murthy. On growing better decision trees from data. PhD thesis, Johns Hopkins Univ., Baltimore, 1996.
- [27] S. Nijssen and É. Fromont. Mining optimal decision trees from itemset lattices. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 530–539, 2007.
- [28] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. Lecture Notes in Computer Science, 1540:398–416, 1999.
- [29] J. Rissanen. Fisher information and stochastic complexity. IEEE Transactions on Information Theory, 42(1):40–47, 1996.
- [30] A. Siebes, J. Vreeken, and M. van Leeuwen. Item sets that compress. In Proceedings of the SIAM Conference on Data Mining, pages 393–404, 2006.
- [31] N. Tatti. Maximum entropy based significance of itemsets. Knowledge and Information Systems (KAIS), 2008. Accepted for publication.
- [32] N. Tatti and H. Heikinheimo. Decomposable families of itemsets. In Proceedings of the 12th European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2008.
- [33] J. Vreeken, M. van Leeuwen, and A. Siebes. Characterising the difference. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 765–774, 2007.
- [34] J. Vreeken, M. van Leeuwen, and A. Siebes. Preserving privacy through data generation. In Proceedings of the IEEE Conference on Data Mining, pages 685–690, 2007.