Using Background Knowledge to Rank Itemsets
Abstract
Assessing the quality of discovered results is an important open problem in data mining. Such assessment is particularly vital when mining itemsets, since commonly many of the discovered patterns can be easily explained by background knowledge. The simplest approach to screen uninteresting patterns is to compare the observed frequency against the independence model. Since the parameters for the independence model are the column margins, we can view such screening as a way of using the column margins as background knowledge.
In this paper we study techniques for more flexible approaches for infusing background knowledge. Namely, we show that we can efficiently use additional knowledge such as row margins, lazarus counts, and bounds of ones. We demonstrate that these statistics describe forms of data that occur in practice and have been studied in data mining.
To infuse the information efficiently we use a maximum entropy approach. In its general setting, solving a maximum entropy model is infeasible, but we demonstrate that for our setting it can be solved in polynomial time. Experiments show that more sophisticated models fit the data better and that using more information improves the frequency prediction of itemsets.
1 Introduction
Discovering interesting itemsets from binary data is one of the most studied branches in pattern mining. The most common way of defining the interestingness of an itemset is by its frequency, the fraction of transactions in which all items cooccur. This measure has a significant computational advantage since frequent itemsets can be discovered using levelwise or depthfirst search strategies [2, 21]. The drawback of frequency is that we cannot infuse any background knowledge into the ranking. For example, if we know that items and occur often, then we should expect that also occurs relatively often.
Many approaches have been suggested in order to infuse background knowledge into ranking itemsets. The most common approach is to compare the observed frequency against the the independence model. This approach has the advantage that we can easily compute the estimate and that the background knowledge is easily understandable. The downside of the independence model is that it contains relatively little information. For example, if we know that most of the data points contain a small number of ones, then we should infuse this information for ranking patterns. For a more detailed discussion see Section 6.
Assessing the quality of patterns can be seen as a part of the general idea where we are required to test whether a data mining result is statistically significant with respect to some background knowledge (see [10] as an example of such a framework). However, the need for such assessment is especially important in pattern mining due to two major problems: Firstly, the number of discovered patterns is usually large, so a screening/ranking process is needed. Secondly, many of the discovered patterns reflect already known information, so we need to incorporate this information such that we can remove trivial results.
Our goal in this work is to study how we can infuse background knowledge into pattern mining efficiently. We will base our approach on building a statistical model from the given knowledge. We set the following natural goals for our work:

The background knowledge should be simple to understand.

We should be able to infer the model from the data efficiently.

We should be able to compute expected frequencies for itemsets efficiently.
While these goals seem simple, they are in fact quite strict. For example, consider modeling attribute dependencies by using Bayesian networks. First of all, inferring the expected frequency of an itemset from the Bayesian network is done using a message passing algorithm, and is not guaranteed to be computationally feasible [5]. Finally, understanding parentchild relationships in a Bayesian network can be discombobulating.
In this work we will consider the following simple statistics: column margins, row margins, number of zeroes between ones (lazarus counts), and the boundaries of ones. We will use these statistics individually, but also consider different combinations. While these are simple statistics, we will show that they describe many specific dataset structures, such as banded datasets or nested datasets.
We will use these statistics and the maximum entropy principle to build a global model. In a general setting inferring such a model is an infeasible problem. However, we will demonstrate that for our statistics inferring the model can be done in polynomial time. Once this model is discovered we can use it to assess the quality of an itemset by comparing the observed frequency against the frequency expected by the model. The more the observed frequency differs from the expected value, the more significant is the pattern.
We should point out that while our main motivation is to assess itemsets, the discovered background model is a true statistical global model and is useful for other purposes, such as model selection or data generation.
2 Statistics as Background Knowledge
In this section we introduce several count statistics for transactional data. We also indicate for what kinds of datasets they are useful. We begin by presenting preliminary notations that will be used throughout the rest of the paper.
A binary dataset is a collection of transactions, binary vectors of length . The set of all possible transactions is written as . The th element of a transaction is represented by an attribute , a Bernoulli random variable. We denote the collection of all the attributes by . An itemset is a subset of attributes. We will often use the dense notation . Given an itemset and a binary vector of length , we use the notation to express the probability of . If contains only 1s, then we will use the notation .
Given a binary dataset we define to be an empirical distribution,
We define the frequency of an itemset to be . A statistic is a function mapping a transaction to a integer. All our statistics will be of form , that is, our background knowledge will be the fraction of transactions in the data for which .
Column margins
The simplest of statistics one can consider are the column margins or item probabilities. These probabilities can be used to define an independence model. This model has been used before in other works to estimate itemset frequencies [4, 1]. It has the advantage of being computationally fast and easy to interpret. However, it is a rather simplistic model and few datasets actually satisfy the independence assumption. We will include the item probabilities in all of our models.
Row margins
We define to be a random variable representing the number of ones in a random transaction. We immediately see that obtains integer values between and . Consequently, is the probability of a random transaction having ones. Given a transaction , we will also denote by the number of ones in , i.e. .
The use of row margins (in conjunction with column margins) has been proposed before by Gionis et al. [9], to asses the significance of (among others) frequent itemsets. However, their approach is different from ours (see Section 6). It was shown that for datasets with very skewed row margin distribution, most frequent itemsets, clusterings, correlations, etc. can be explained entirely by row and column margins alone. Supermarket basket data falls into this category, since the transactions there typically contain only a handful of items.
Lazarus counts
A lazarus event in a transaction is defined as the occurrence of a zero within a string of ones. This requires that a total order is specified on the attributes . For simplicity, we assume that this order is . Let be a transaction, then the lazarus count of is defined as
Clearly, the lazarus count of a transaction ranges from to . If the lazarus count of is , then is said to satisfy the consecutiveones property.
A specific case of datasets with the consecutiveones property are banded datasets. These are datasets whose rows and columns can be permuted such that the nonzeroes form a staircase pattern. The properties of such banded binary matrices have been studied by Garriga et al., who presented algorithms to determine the minimum number of bit flips it would take for a dataset to be (fully) banded [8]. A banded dataset can be characterized by the following statistics: The major part of the transactions will have low lazarus counts, and typically the row margins are low as well.
Transaction bounds
For certain types of data it can be useful to see at which positions the ones in the transactions of a dataset begin and end. Given a transaction , we define the and statistics as
If contains only s, then we define . A dataset is called nested if for each pair of rows, one is always a subset of the other [14]. For such data, the rows and columns can be permuted such that the rows have consecutive ones starting from the first column. Thus, assuming the permutation is done, transactions in nested datasets will have a low lazarus count and low left bound . Nestedness has been studied extensively in the field of ecology for absence/presence data of species (see, for example, [14] for more details).
3 Maximum Entropy Model
The independence model can be seen as the simplest model that we can infer from binary data. This model has many nice computational properties: learning the model is trivial and making queries on this model is a simple and fast procedure. Moreover, a fundamental theorem shows that the independent model is the distribution maximizing the entropy among all the distributions that have the same frequencies for the individual attributes.
Our goal is to define a more flexible model. We require that we should be able to infer such model efficiently and we should be able to make queries. In order to do that we will use the maximum entropy approach.
3.1 Definition of the Maximum Entropy Model
Say that we have computed a set of certain statistics from a dataset and we wish to build a distribution, such that this distribution satisfies the same statistics. The maximum entropy approach gives us the means to do that.
More formally, assume that we are given a function mapping a transaction to an integer value. For example, this function can be , the number of ones in a transaction. Assume that we are given a dataset with attributes and let be its empirical distribution. We associate a statistic with for any to be the proportion of transactions in for which , that is, .
In addition to the statistics we always wish to use the margins of the individual attributes, that is, the probability of an attribute having a value of 1 in a random transaction. We denote these column margins by , where .
The maximum entropy model is derived from the statistics and . The distribution is the unique distribution maximizing the entropy among the distributions having the statistics and . To be more precise, we define to be the set of all distributions having the statistics and ,
The maximum entropy distribution maximizes the entropy in ,
(1) 
Note that and consequently depend on the statistics and , yet we have omitted them from the notation for the sake of clarity.
Our next step is to demonstrate a classic result that the maximum entropy distribution has a specific form. In order to do so we begin by defining indicator functions such that , that is, indicates whether the attribute has a value of in a transaction . We also define such that if and only if .
Theorem 3.1 (Theorem 3.1 in [6])
The maximum entropy distribution given in Eq. 1 has the form
(2) 
for some specific set of parameters , and a set . Moreover, a transaction if and only if for all distributions in .
It turns out that we can represent our maximum entropy model as a mixture model. Such a representation will be fruitful when we are solving and querying the model. To see this, let and be the parameters in Eq. 2. We define a distribution for which we have
(3) 
where is a normalization constant so that is a proper distribution. Since depends only on we see that is actually an independence distribution. This allows us to replace the parameters with more natural parameters . We should stress that is not equal to .
Our next step is to consider the parameters . First, we define a distribution
(4) 
where is a normalization constant such that is a proper distribution. We can now express the maximum entropy model using and . By rearranging the terms in Eq. 2 we have
(5) 
where . The right side of the equation is a mixture model. According to this model, we first sample an integer, say , from . Once is selected we sample the actual transaction from .
We should point out that we have some redundancy in the definition of . Namely, we can divide each by and consequently remove from the equation. However, keeping in the equation proves to be handy later on.
3.2 Using Multiple Statistics
We can generalize our model by allowing multiple statistics together. By combining statistics we can construct more detailed models, which are based on these relatively simple statistics. In our case, building and querying the model remains polynomial in the number of attributes, when combining multiple statistics.
We distinguish two alternatives to combine count statistics. Assume, for the sake of exposition, we have two statistics and , with and for all . Then we can either consider using the joint probabilities for and ; or, we may use the marginal probabilities and separately. The joint case can be easily reduced to the case of a single statistic by considering . Solving and querying the model in the marginal case can be done using the same techniques and the same time complexity as with the joint case.
4 Solving the Maximum Entropy Model
In this section we introduce an algorithm for finding the correct distribution. We will use the classic Iterative Scaling algorithm. For more details of this algorithm and the proof of correctness we refer to the original paper [7].
4.1 Iterative Scaling Procedure
The generic Iterative Scaling algorithm is a framework that can be used for discovering the maximum entropy distribution given any set of linear constraints. The idea is to search the parameters and update them in an iterative fashion so that the statistics we wish to constrain will converge towards the desired values. In our case, we update and iteratively so that the statistics of the distribution will converge into and . The sketch of the algorithm is given in Algorithm 1.
In order to use the Iterative Scaling algorithm we need techniques for computing the probabilities and (lines 3 and 7). The former is a special case of computing the frequency of an itemset and is detailed in the following subsection. For the latter, assume that we have an algorithm, say ComputeProb, which given a set of probabilities will return the probabilities for each , where is the independence model parameterized by , that is, . We will now show that using only ComputeProb we are able the obtain the required probabilities.
Note that the first equality in Eq. 5 implies that we can compute from . To compute the latter we call ComputeProb with (for ) as parameters.
4.2 Querying the Model
We noted before that computing is a special case of computing the frequency of an itemset . In order to do that let us write
Let us write . Note that since is the independence model, we have . To compute we call ComputeProb with parameters if and if .
4.3 Computing Statistics
In order to use the Iterative Scaling algorithm we need an implementation of ComputeProb, a routine that returns the probabilities of statistic with respect to the independence model. Since ComputeProb is called multiple times, the runtime consumption is pivotal. Naturally, memory and time requirements depend on the statistic at hand. In Table 1 the complexities of ComputeProb are listed for several statistics and joint statistics. All statistics are computed using the same general dynamic programming idea: To compute the statistics for items we first solve the problem for items and using that result we will be able to compute efficiently the statistics for . The details of these calculations are given in the appendix.
Statistic  memory  time 

row margins (from scratch)  
row margins (backward method)  
lazarus counts  
joint row margins and lazarus counts  
joint transaction bounds  
joint row margins, transaction bounds 
5 Experiments
In this section we present the results of experiments on synthetic and real data. The source code of the algorithm can be found at the authors’ website^{1}^{1}1http://www.adrem.ua.ac.be/implementations/.
5.1 Datasets
The characteristics of the datasets we used are given in Table 2.
We created three synthetic datasets. The first one has independent items with randomly chosen frequencies. The second dataset contains two clusters of equal size. In each cluster the items are independent with a frequency of 25% and 75% respectively. Hence, the row margin distribution has two peaks. In the third synthetic dataset, the items form a Markov chain. The first item has a frequency of 50%, and then each subsequent item is a noisy copy of the preceding one, that is, the item is inverted with a 25% probability.
The realworld datasets we used are obtained from the FIMI Repository^{2}^{2}2 http://fimi.cs.helsinki.fi/data/ and [16]. The Chess data contains chess board descriptions. The DNA dataset [16] describes DNA copy number amplifications. The Retail dataset contains market basket data from an anonymous Belgian supermarket [3] and the Webview1 dataset contains clickstream data from an ecommerce website [13]. For both datasets the rare items with a frequency lower than 0.5% were removed. This resulted in some empty transactions which were subsequently deleted.
Note that for all of the models, except the margin model, an order is needed on the items. In this paper we resort to using the order in which the items appear in the data. Despite the fact that this order is not necessarily optimal, we were able to improve over the independence model.
Margins  Lazarus  Bounds  

Dataset  iter  time  iter  time  iter  time  
Independent  100000  20  2  0.01 s  2  0.02 s  2  0.02 s 
Clusters  100000  20  2  0.01 s  9  0.07 s  10  0.07 s 
Markov  100000  20  2  0.01 s  8  0.05 s  12  0.07 s 
Chess  3196  75  3  0.6 s  400  153 s  28  8 s 
DNA  4590  391  8  313 s  96  90 m  119  66 m 
Retail  81998  221  4  26 s  11  110 s  19  171 s 
Webview1  52840  150  3  5 s  14  45 s  93  267 s 
5.2 Model Performance
We begin by examining the loglikelihood of the datasets, given in Table 3. We train the model on the whole data and then compute its likelihood, giving it a BIC penalty [18], equal to where is the number of free parameters of each model, and is the number of transactions. This penalty rewards models with few free parameters, while penalizing complex ones.
Compared to the independence model, the likelihoods are all better on all datasets with the exception of the Independent data. This is expected since the Independent data is generated from an independence distribution so using more advanced statistics will not improve the BIC score.
When looking at the other two synthetic datasets, we see that the margin model has the highest likelihood for the Clusters data, and the lazarus model for the Markov data. The Clusters data has two clusters, in both of which the items are independent. The distribution of the row margins has two peaks, one for each cluster. This information cannot be explained by the independence model and adding this information improves the loglikelihood dramatically. The items in the Markov dataset are ordered since they form a Markov chain. More specifically, since each item is a (noisy) copy of the previous one, we expect the transactions to consist of only a few blocks of consecutive ones, which implies that their lazarus count is quite low, and hence the lazarus model performs well.
Chess is originally a categorical dataset, which has been binarized to contain one item for each attributevalue pair. Hence, it is a rather dense dataset, with constant row margins, and lazarus counts and bounds centered around a peak. That is, Chess does not obey the independence model and this is seen that the likelihoods of our models are better than that of the independence model.
The likelihood of the lazarus model for the DNA dataset, which is very close to being fully banded [8], is significantly lower than that of the independence model and the margin model, which indicates that using lazarus counts are a good idea in this case. The bounds model comes in second, also performing very well, which can again be explained by the bandedness of the data.
Finally, Retail and Webview1 are sparse datasets. The margin model performs well for both datasets. Therefore we can conclude that a lot of the structure of these datasets is captured in the row and column margins. Interestingly, although close to the margin model, the bounds model is best for Webview1.
Dataset  Independent  Margins  Lazarus  Bounds 

Independent  
Clusters  1 719 959  
Markov  1 861 046  
Chess  131 870  
DNA  107 739  
Retail  1 774 291  
Webview1  774 773 
5.3 Frequency Estimation of Itemsets
Next, we perform experiments on estimating the supports of a collection of itemsets. The datasets are split in two parts, a training set and a test set. We train the models on the training data, and use the test data to verify frequency estimates of itemsets. We estimate the frequencies of the top closed frequent itemsets in the test data (or all closed itemsets if there are less).
Table 4 reports the average absolute and relative errors of the frequency estimates, for the independence, margins, lazarus and bounds models. For the Independent data, the independence model performs best, since the other models overlearn the data. For all other datasets, except Chess, we see that using more information reduces both the average absolute and relative error of the frequency estimates. For instance, for Clusters the average absolute error is reduced from to using row and column margins, and likewise for Markov the average relative error is reduced from to . The DNA, Retail and Webview1 data are sparse. Therefore, the itemset frequencies are very low, as well as the absolute errors, even for the independence model. However, the relative errors are still quite high. In this case our models also outperform the independence model. For example, the relative error is reduced from to by using the margins model on the Webview1 dataset. For DNA, the average relative error drops using lazarus counts.
The only exception is the Chess dataset where the average absolute and relative errors do not improve over the independence model. Note, however, that this dataset is originally categorical, and contains a lot of dependencies between the items. Interestingly enough, our models perform better than the independence model in terms of (penalized) likelihood. This suggests that in this case using additional information makes some itemsets interesting.
Our final experiment (given in Table 5) is the average improvement of loglikelihood of each itemset compared to the independence model: Let be the empirical frequency of itemset from the test data, and let be the estimate given by one of the models, then the loglikelihood of is computed as . We compute the difference between the loglikelihoods for the independence model and the other models, and take the average over the top closed frequent itemsets. Again, for Independent and Chess, the independence model performs the best. For all the other datasets, we clearly see an improvement with respect to the independence model. For Clusters and Markov, the average loglikelihood increases greatly, and is the highest for the margin model. For DNA, Retail and Webview1, the increase in likelihood is somewhat lower. The reason for this is that both the estimates and observed frequencies are small and close to each other. For DNA the average increase is highest when using lazarus counts, while for Retail and Webview1 the margin model is best.
Independent  Margins  

Dataset  absolute  relative  absolute  relative 
Independent  0.11% 0.10%  1.54% 1.23%  0.11% 0.10%  1.52% 1.21% 
Clusters  9.39% 0.73%  63.08% 11.81%  0.20% 0.11%  1.37% 0.86% 
Markov  4.79% 2.29%  47.80% 20.86%  2.19% 1.64%  21.11% 13.41% 
Chess  1.81% 1.35%  2.35% 1.80%  1.94% 1.43%  2.52% 1.91% 
DNA  0.58% 1.08%  85.89% 30.91%  0.56% 1.04%  84.27% 31.76% 
Retail  0.05% 0.11%  48.89% 27.95%  0.04% 0.09%  37.70% 28.64% 
Webview1  0.11% 0.07%  92.61% 17.27%  0.10% 0.06%  79.77% 23.94% 
Lazarus  Bounds  
Dataset  absolute  relative  absolute  relative 
Independent  0.11% 0.10%  1.54% 1.23%  0.12% 0.10%  1.61% 1.30% 
Clusters  7.00% 1.12%  47.04% 10.77%  8.21% 0.93%  55.33% 11.92% 
Markov  2.27% 1.61%  22.41% 14.88%  3.36% 2.24%  33.54% 21.12% 
Chess  2.06% 1.49%  2.68% 1.99%  1.81% 1.35%  2.35% 1.79% 
DNA  0.45% 0.75%  80.24% 72.73%  0.54% 0.94%  82.63% 31.38% 
Retail  0.04% 0.10%  42.90% 27.81%  0.04% 0.09%  43.71% 27.27% 
Webview1  0.11% 0.06%  88.35% 20.87%  0.11% 0.06%  90.60% 18.63% 
Dataset  Margins  Lazarus  Bounds 

Independent  0.03 0.12  0.00 0.09  0.10 0.36 
Clusters  4517.4 1168.3  2412.6 1030.1  1353.2 660.4 
Markov  1585.9 1310.5  1536.9 1283.2  931.5 854.1 
Chess  0.40 0.52  0.81 0.98  0.01 0.17 
DNA  119.9 235.7  133.1 274.5  106.2 224.8 
Retail  9.69 22.49  4.62 11.05  5.37 24.94 
Webview1  135.79 90.32  64.72 40.41  57.71 62.95 
6 Related Work
A special case of our framework greatly resembles the work done in [9]. In that work the authors propose a procedure for assessing the results of a data mining algorithm by sampling the datasets having the same margins for the rows and columns as the original data. While the goals are similar, there is a fundamental difference between the two frameworks. The key difference is that we do not differentiate individual rows. Thus we do not know that, for example, the first row in the data has ones but instead we know how many rows have ones. The same key difference can be seen between our method and Rasch models where each individual row and column of the dataset is given its own parameter [17].
Our approach and the approach given in [9] complement each other. When the results that we wish to assess do not depend on the order or identity of transactions, it is more appropriate to use our method. An example of such data mining algorithms is frequent itemset mining. On the other hand, if the data is to be treated as a collection of transactions, e.g. for segmentation, then we should use the approach in [9]. Also, our approach has a more theoretically sound ground since for sampling datasets the authors in [9] rely on MCMC techniques with no theoretical guarantees that the actual mixing has happened.
Comparing frequencies of itemsets to estimates has been studied in several works. The most common approach is to compare the itemset against the independence model [4, 1]. A more flexible approach has been suggested in [12, 11] where the itemset is compared against a Bayesian network. In addition, approaches where the maximum entropy models derived from some given known itemsets are suggested in [15, 20]. A common problem for these more general approaches is that the deriving of probabilities from these models is usually too complex. Hence, we need to resort to either estimating the expected value by sampling, or build a local model using only the attributes occurring in the query. In the latter case, it is shown in [19] that using only local information can distort the frequency estimate and that the received frequencies are not consistent with each other. Our model does not suffer from these problems since it is a global model from which the frequency estimates can be drawn efficiently.
7 Conclusions
In this paper we considered using count statistics to predict itemset frequencies. To this end we built a maximum entropy model from which we draw estimates for the frequency of itemsets and compare the observed value against the estimate. We introduced efficient techniques for solving and querying the model. Our experiments show that using these additional statistics improves the model in terms of likelihood and in terms of predicting itemset frequencies.
References
 [1] Charu C. Aggarwal and Philip S. Yu. A new framework for itemset generation. In PODS ’98: Proceedings of the seventeenth ACM SIGACTSIGMODSIGART symposium on Principles of database systems, pages 18–24. ACM Press, 1998.
 [2] Rakesh Agrawal, Heikki Mannila, Ramakrishnan Srikant, Hannu Toivonen, and A. Inkeri Verkamo. Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining, pages 307–328. AAAI/MIT Press, 1996.
 [3] Tom Brijs, Gilbert Swinnen, Koen Vanhoof, and Geert Wets. Using association rules for product assortment decisions: A case study. In Knowledge Discovery and Data Mining, pages 254–260, 1999.
 [4] Sergey Brin, Rajeev Motwani, and Craig Silverstein. Beyond market baskets: Generalizing association rules to correlations. In Joan Peckham, editor, SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, pages 265–276. ACM Press, May 1997.
 [5] Robert G. Cowell, A. Philip Dawid, Steffen L. Lauritzen, and Davig J. Spiegelhalter. Probabilistic Networks and Expert Systems. Statistics for Engineering and Information Science. SpringerVerlag, 1999.
 [6] Imre Csiszár. Idivergence geometry of probability distributions and minimization problems. The Annals of Probability, 3(1):146–158, Feb. 1975.
 [7] J. Darroch and D. Ratcliff. Generalized iterative scaling for loglinear models. The Annals of Mathematical Statistics, 43(5):1470–1480, 1972.
 [8] Gemma C. Garriga, Esa Junttila, and Heikki Mannila. Banded structure in binary matrices. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, August 2427, 2008, pages 292–300. ACM New York, NY, USA, 2008.
 [9] Aristides Gionis, Heikki Mannila, Taneli Mielikäinen, and Panayiotis Tsaparas. Assessing data mining results via swap randomization. TKDD, 1(3), 2007.
 [10] Sami Hanhijärvi, Markus Ojala, Niko Vuokko, Kai Puolamäki, Nikolaj Tatti, and Heikki Mannila. Tell me something I don’t know: randomization strategies for iterative data mining. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2009), pages 379–388, 2009.
 [11] Szymon Jaroszewicz and Tobias Scheffer. Fast discovery of unexpected patterns in data, relative to a bayesian network. In KDD ’05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 118–127, New York, NY, USA, 2005. ACM.
 [12] Szymon Jaroszewicz and Dan A. Simovici. Interestingness of frequent itemsets using bayesian networks as background knowledge. In KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 178–186, New York, NY, USA, 2004. ACM.
 [13] Ron Kohavi, Carla Brodley, Brian Frasca, Llew Mason, and Zijian Zheng. KDDCup 2000 organizers’ report: Peeling the onion. SIGKDD Explorations, 2(2):86–98, 2000. http://www.ecn.purdue.edu/KDDCUP.
 [14] H. Mannila and E. Terzi. Nestedness and segmented nestedness. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, page 489. ACM, 2007.
 [15] Rosa Meo. Theory of dependence values. ACM Trans. Database Syst., 25(3):380–406, 2000.
 [16] S. Myllykangas, J. Himberg, T. Böhling, B. Nagy, J. Hollmén, and S. Knuutila. Dna copy number amplification profiling of human neoplasms. Oncogene, 25(55):7324–7332, Nov. 2006.
 [17] G. Rasch. Probabilistic Models for Some Intelligence and Attainnment Tests. Danmarks paedagogiske Institut, 1960.
 [18] Gideon Schwarz. Estimating the dimension of a model. Annals of Statistics, 6(2):461–464, 1978.
 [19] Nikolaj Tatti. Safe projections of binary data sets. Acta Inf., 42(8–9):617–638, 2006.
 [20] Nikolaj Tatti. Maximum entropy based significance of itemsets. Knowledge and Information Systems, 17(1):57–77, 2008.
 [21] Mohammed Javeed Zaki. Scalable algorithms for association mining. IEEE Trans. Knowl. Data Eng., 12(3):372–390, 2000.
Appendix A Computing Statistics
To simplify notation, we define to be the set of the first items.
Row Margins
Our first statistic is the number of ones in a transaction .
We consider a cut version of by defining . Hence, is the probability of a random transaction having ones, if . On the other hand is the probability of a random transaction having at least ones. We can exploit the fact that to reduce computation time. The range of is , whereas the range of is .
Our next step is to compute the probabilities when is the independence model. Let us write , the probability of attaining the value of 1. We will first introduce a way of computing the probabilities from scratch. In order to do that, note the following identity
(6) 
This identity holds since is the independence model. Hence, to compute we start with , add , , and so on, until we have processed all variables and reached . Note that we are computing simultaneously the probabilities for all . We can perform the computation in steps and memory slots.
We can improve this further by analyzing the flow in the Iterative Scaling Algorithm. The algorithm calls ComputeProb either using parameters or .
Assume that we have from the previous computation. These probabilities will change only when we are updating . To achieve more efficient computation we will first compute from , update and then update . To do that we can reverse the identity given in Eq. 6 into
(7) 
if . When we have . Using these identities we can take a step back and remove from . To compute we can apply the identity in Eq. 6 with . Once is updated we can again use Eq. 6 to update . All these computations can be done in time. On the other hand, in order to use Eq. 7 we must remember the probabilities for all , so consequently our memory consumption rises from to .
Lazarus Events
Our aim is to efficiently compute the probabilities where is an independence distribution. Again, let us write . The desired probabilities are computed incrementally in steps, starting from , up to . In order to determine the Lazarus count probabilities, we use an auxiliary statistic, the last occurrence .
We will compute the probabilities and then marginalize them to obtain . To simplify the following notation, we define the probability .
First, assume that , which implies that . In such case the Lazarus count increases by . We can write the following identity
(8) 
for . To handle the boundary cases, for and we have
(9) 
and finally when and we obtain
(10) 
Secondly, consider the instances where , in which case must be equal to . We obtain for all and .
Using the equations above, we are able to compute using memory slots. Specifically, for out of all combinations of and , is potentially nonzero. Since in each step a quadratic number of probabilities is updated, time is needed. However, we can reduce this to quadratic time and linear space. First of all, note that the righthand size of Equation 8, being a sum of size linear in , can be rewritten as a sum of constant size. Moreover, this sum only uses probabilities involving having , which were computed in the previous step. Hence for , the probability is equal to
For , is equal to
Finally, noting that
we can compute by gradually summing the second terms. We can compute in constant time using only and . Hence we can discard the terms , where . Therefore, only memory and time is needed.
Joint Transaction Bounds
We need to compute for an independence distribution , with . For the sake of brevity, let us denote . Note that is nonzero if or . Hence there are probabilities to compute. We distinguish three cases
We can construct the in quadratic time, by looping over and , and maintaining the products and , which can be updated in constant time in each iteration. Using these products and the individual item probabilities, we can construct .