[
Abstract
Large contingency tables summarizing categorical variables arise in many areas. One example is in biology, where large numbers of biomarkers are crosstabulated according to their discrete expression level. Interactions of the variables are of great interest and are generally studied with loglinear models. The structure of a loglinear model can be visually represented by a graph from which the conditional independence structure can then be easily read off. However, since the number of parameters in a saturated model grows exponentially in the number of variables, this generally comes with a heavy computational burden. Even if we restrict ourselves to models of lower order interactions or other sparse structures we are faced with the problem of a large number of cells which plays the role of sample size. This is in sharp contrast to highdimensional regression or classification procedures because, in addition to a highdimensional parameter, we also have to deal with the analogue of a huge sample size. Furthermore, highdimensional tables naturally feature a large number of sampling zeros which often leads to the nonexistence of the maximum likelihood estimate. We therefore present a decomposition approach, where we first divide the problem into several lowerdimensional problems and then combine these to form a global solution. Our methodology is computationally feasible for loglinear interaction models with many categorical variables each or some of them having many levels. We demonstrate the proposed method on simulated data and apply it to a biomedical problem in cancer research.
bimj.DOIsuffix \Volume46 \Issue1 \Year2004 \pagespan1 \Receiveddate2 April 2009 \Reviseddate \Accepteddate \Dateposted
Decomposition and Model Selection for Large Contingency Tables]Decomposition and Model Selection for Large Contingency Tables
C. Dahinden]Corinne Dahinden M. Kalisch]Markus Kalisch^{1}^{1}1Corresponding author: email: kalisch@stat.math.ethz.ch, Phone: +41 44 632 3435, Fax: +41 44 632 1228 P. Bühlmann]Peter Bühlmann
1 Background
We consider the problem of estimation and model selection in loglinear models for large contingency tables involving many categorical variables. This problem encompasses the estimation of the graphical model structure for categorical variables. This structure learning task has lately received considerable attention as it plays an important role in a broad range of applications. The conditional independence structure of the distribution can be read off directly from the structure of a graphical model (a graph) and hence provides a graphical representation of the distribution that is easy to interpret (see Lauritzen (1996)). Graphical models for categorical variables correspond to a class of hierarchical loglinear interaction models for contingency tables. Thus, fitting a graph corresponds to model selection in a hierarchical loglinear model , where is the vector of cell probabilities of a table, is a parameter vector and is the design matrix (see section 2.2).
Fitting the loglinear model for large contingency tables in full detail turns out to be a very hard computational problem in practice. In the following, we list three possible goals for fitting loglinear models. The goals are ranked in increasing order according to their computational difficulty:
 Graphical Structure

Finding the graphical structure for discrete categorical variables is easiest but it doesn’t allow to infer the magnitude of the coefficients in the loglinear model, see also formula (3).
 Parameter Vector

The next level of difficulty is the estimation of the unknown parameter vector in a loglinear model whose full dimension equals the number of cells in the contingency table. For large tables, the dimension of is huge but under some sparsity assumptions it is possible to accurately estimate such a highdimensional vector using suitable regularisation. The major problem is here that besides the highdimensionality of , the analogue of the sample size (the rowdimension of ) is huge, e.g. for 40 categorical variables having 3 levels each.
 Probability Vector

The most difficult problem is the estimation of the probability vector whose dimension equals again the number of cells in the table. It is rather unrealistic to place some sparsity assumptions on in the sense that many entries would equal exactly zero which would enable feasible computation. Therefore, it is impossible to ever compute an estimate of the whole probability vector (e.g. having dimensionality ). Nevertheless, thanks to sparsity of the parameter vector and the junction tree algorithm, it is possible to compute accurate estimates for any reasonablesized collection of cells in the contingency table.
There is hardly any method which can achieve all these goals for contingency tables involving many, say more than 20, variables. One approach to address the loglinear modeling problem for large contingency tables is presented in Jackson et al. (2007), where some dimensionality reduction is achieved by reducing the number of levels per variable. The reduction is accomplished via collapsing two categories by aggregating their counts if the two categories behave sufficiently similar. If variables are considered, this method reduces the problem at best to binary variables. For this special case with binary factors, an approach based on many logistic regressions can be used for fitting loglinear interaction models whose computational complexity is feasible even if the number of variables is large (Wainwright et al., 2007). Another method to address the loglinear modeling problem for large contingency tables is proposed in Kim (2005), where the variables are grouped such that they are highly connected within groups but less between groups and graphical models are fitted for these subgroups. The subgraph models are then combined using socalled graphs of prime separators. The implementation of the combination however is not an easy task and no exact algorithm is given on how to combine the models.
Our presented methodology allows to achieve all the goals above for categorical variables having possibly different numbers of levels: inference of a graphical model for discrete variables, of a sparse parameter vector in a loglinear model and of a collection of cell probabilities. Motivated by the approach in Kim (2005), we also propose a decomposition approach, where the dimensionality reduction is achieved by recursively collapsing the large contingency table on certain variables (decomposition) and thereby reducing the problem to smaller tables which can be handled more easily. All the fitted lowerdimensional loglinear models are combined appropriately to represent an estimation of the joint distribution of all variables. The procedure enables us to handle very large tables e.g. up to hundreds of categorical variables, where some or all of them can have more than two categories. This multicategory framework is much more challenging than the approach in Wainwright et al. (2007) for large binary tables. In Ravikumar et al. (2009), an extension to the multicategory case within the class of pairwise Markov random fields is sketched: it is argued that higherorder interactions could be included in a pairwise Markov field but no methodology or algorithm is described how to actually do this difficult computational task.
2 Definitions
In this section, we introduce several important theoretical concepts. First, we define the general loglinear interaction model. Subsequently, we introduce the hierarchical loglinear model and the graphical model as restricted versions of the loglinear interaction model. Then, we define collapsibility and decomposability, which are crucial concepts for breaking up a graphical model into smaller pieces.
2.1 Loglinear interaction model
We adopt here the notation of Darroch et al. (1980). Assume we have some factors or categorical variables, indexed by a set . Each factor has a set of possible levels . The contingency table is the cartesian product of the individual sets: . An individual cell in the contingency table is denoted by and the corresponding cell count by . A marginal index for variable set is denoted by . For example, assume we have two binary variables, then , and the contingency table is given by . An individual cell is for example and the index describing the margin along the second variable () is . The total number of cells in a contingency table is . In our example, . A natural way of representing the distribution of the cell counts is via a vector of probabilities . In our example, this would correspond to defining four probabilities . If a total number of individuals are classified independently, then the distribution of the corresponding cell counts is multinomial with probability . Finally, the general loglinear interaction model specifies the unknown distribution as follows:
(1) 
where are functions of cell which only depend on the variables in . These functions are called interactions between the variables in . If , is called main effect, if first order interaction and an interaction of order if . For identifiability purposes we impose constraints on the functions, namely that order interaction functions are orthogonal to interaction functions of lower order.
2.2 Hierarchical loglinear models
A hierarchical loglinear model is a loglinear interaction model with the additional constraint, that a vanishing interaction forces all interactions of higher order to be zero as well:
Hierarchical models can be specified via the socalled generators or generating class which is a set of subsets of consisting of the maximal interactions which are present. More precisely, the generating class has the following property:
(2) 
Consider an example with three binary factors. An example for a hierarchical loglinear model is the model consisting of all main effects, an interaction between and and an interaction between and : this corresponds to . However, the loglinear interaction model with main effect and interaction between and (but no main effect ) is not in the class of hierarchical loglinear models.
If we go back to formula (1) and rewrite it in matrix formulation, we get:
(3) 
where is a vector of unknown coefficients and the design matrix. Each row of corresponds to a certain cell and the columns of correspond to the functions . The number of columns needed to represent the function depends on the number of different states can take on. For example consider a categorical variable that can take on 3 levels. Then, is called a main effect (as ) and (the columns of corresponding to ) is 2dimensional. Originally, it would be 3dimensional but for identifiability purposes, the subspace spanned by is chosen orthogonal to the already existing columns of lower order interaction (here orthogonal to the intercept) and we further choose it orthogonal within the subspace. By choosing the identifiability constraints this way, the parameterization of the matrix used in (3) is equivalent to choosing a poly contrast in terms of ANOVA. If we go back to our example with two binary factors where , (3) becomes:
(4)  
where the first column is the intercept and belongs to , the second column belongs to and has entry 1 whenever variable takes on the first level and 1 else and similarly for the third column. The fourth column belongs to the interaction between variable 1 and 2. A description of for the general case can be found in Dahinden et al. (2007). In the following we will denote the components of belonging to with .
2.3 Graphical models
We first introduce some terminology. A graph is defined as a pair , where is the set of vertices or nodes and is the set of edges linking the vertices. Each node represents a (categorical) random variable. Here we only consider undirected graphs which means that is equivalent to . A path from to is a sequence of distinct nodes for all . Given three sets of variables , we say that separates from in if all paths from vertices in to vertices in have to pass through . Consider a random vector with a given distribution. We say that the distribution of is globally Markov with respect to a graph if for any 3 disjoint subsets the following property holds:
(5) 
where the symbol “” denotes (conditional) independence. This states that we can read off conditional independence relations directly from the graph if the distribution is globally Markov with respect to the graph. Graphical models therefore provide a way to represent conditional (in)dependence relations between variables in terms of a graph structure. We say that a set of nodes of forms a complete subgraph of if every pair in that set is connected by an edge. A maximal complete subgraph is called a clique.
The undirected graphical model (from now on “graphical model” for short) represented by a graph corresponds to a hierarchical loglinear model where the cliques of the graph are the generators of the model. If we go back to our example in the previous section and assume that in formula (2.2), then the hierarchical loglinear model (2.2) can be represented by the graphical model in Fig. 1(a); if , then the corresponding graphical model is the one in Fig. 1(b).
Conversely, assume that the generators of a loglinear model are given by a set . By connecting all the vertices appearing in the same generator with each other and placing no other edge, the socalled interaction graph is built. By the definition of the interaction graph and by looking at formula (1), it becomes clear that the distribution induced by the loglinear model is Markov with respect to the interaction graph and we can read off conditional independencies directly from the graph. It is also clear that corresponds to a graphical model via its interaction graph if and only if is the set of cliques of this graph. In that case we say that is a graphical generating class. If there are cliques in the interaction graph which are not in , the hierarchical loglinear interaction model is not graphical and its interaction structure cannot be adequately represented by the graph alone. However, the graph may still completely represent all conditional independencies of the underlying distribution. The simplest example of a hierarchical loglinear model which is not graphical is and . Its interaction graph is shown in Fig. 1(c) which has as its only clique the complete graph . Since the set of cliques and the set of generators are not identical, model is not graphical.
Any joint probability distribution of discrete random variables can be expanded in terms of a loglinear interaction model. For some distributions it is possible to represent all (conditional) independencies in an undirected graphical model and these distributions are called faithful to their interaction graph or we say that the graph is a perfect map of the distribution. In other words, the graph captures all and only the conditional independence relations of the distribution.
2.4 Collapsibility
Collapsing over a variable simply means summing over that variable and thereby reducing (collapsing) the table to the remaining dimensions.
When collapsing over a variable, spurious associations between the remaining
variables may be
introduced and original associations can vanish. A criterion that addresses
this problem is collapsibility. We say that a variable is
collapsible with respect to a specific interaction , when the
interaction in the original contingency table is identical to the
interaction in the collapsed contingency table (i.e., in equation
(1) remains the same).
The general result regarding collapsibility, which goes back to a
theorem stated in Bishop et al. (1975), can be summarized as
follows:
order interactions between the remaining variables are not changed in the collapsed table.  (6)  
Conversely, lowerorder interactions between the remaining variables are affected by collapsing. 
Suppose, for example, that variable only interacts with one other variable . If we collapse over , no interaction changes but main effects may be changed. Furthermore, suppose that is independent of all other variables. Then neither main effects nor interactions change when collapsing over .
2.5 Decomposability
Assume a graphical model on the vertex set . A triple of disjoint subsets of the vertex set forms a decomposition if (i) , (ii) separates from and (iii) is complete.
Decomposability is defined recursively: A graph is decomposable if it is complete or if there exists a decomposition where the subgraphs of restricted to the vertex sets and are decomposable.
Denote by the set of all cliques of a decomposable graph and by the set of all separators. For a decomposable graph with decomposition into cliques and separators , the probability of a cell is given by the following formula (see for example Proposition 4.18 of Lauritzen (1996)):
(7) 
where is the socalled index of the separator. The formal definition of the index is a bit cumbersome and is given in Lauritzen (1996). However, intuitively it can be thought of as the number of times the set acts as a separator. For example: In Fig. 1(d), node separates and (the cliques consisting of single nodes and ), and , and also and . Therefore the index of the separating node is 3, as node 2 acts three times as separator. If we look at Fig. 1(e), we see that the node only separates from the clique as the single nodes and are no longer cliques since there is an edge between them. Therefore, the node only acts once as separator and thus the corresponding index is 1.
It might not be possible to decompose the graph into decomposable components. By definition, this is the case for nondecomposable graphs. The simplest example of a nondecomposable graph is given in Fig. 1(f). For a nondecomposable graph one can always add a minimal number of edges to the graph, such that it becomes decomposable (this step is called minimal triangulation; see Olesen and Madsen (2002)). Formula (7) also holds for such triangulated graphs.
3 Estimating a LogLinear Model Using a Decomposition Approach
In this section, we propose our novel method for recovering both the graph structure and the parameters of a discrete graphical model. The underlying idea is to learn models over small sets of variables and stitch them together. More precisely, we first estimate an initial graph using nodewise regressions, add triangulations if necessary to make this graph decomposable and then use the decomposed components as the smaller sets of variables. The smaller sets are then analyzed one at a time and stitched together using formula (7). In the following, we will describe each step of our method in more detail. The details of loglinear model selection on the smaller sets of variables are deferred to section 4. An outline of our decomposition approach for loglinear model estimation is given in the display Algorithm 1.
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
3.1 Estimation of initial graph by nodewise regression
If we knew the underlying true graph and if it was sparse, we could use a decomposition and collapse the contingency table on subtables given by the cliques and the separators from the decomposition. Then we could perform model selection in the collapsed tables and combine the estimates according to formula (7). Of course, we do not know the graph and therefore we don’t know and for the decomposition. In this section we propose a method of how to come up with an initial graph estimate.
A loglinear model measures the associations among the variables. The association between two variables can also be measured by doing regression from one variable upon the others. It is thus reasonable to apply a regression method to find groups of variables which are highly associated within a group but only weakly associated between groups.
See for the following also Algorithm 1 lines 1 to 7. Assume a graphical model on the node set with corresponding random variables . For every node in the graph, we run a regression with as response variable and all remaining variables as covariates. We then draw an edge between nodes and if and only if the covariate has an influence on the response . Ideally, the regression method involves interactions among the covariates (unless we assume a binary Ising model as in Wainwright et al. (2007); Ravikumar et al. (2009)).
Inspired by Kim (2005), we use a nonparametric regression approach. But instead of their single regression tree strategy we use a Random Forest approach (see Breiman (2001)). Trees can naturally incorporate interactions between variables without running severely into the curse of dimensionality and are therefore ideal for our purposes. We prefer Random Forest instead of a single tree, since Random Forest oftentimes yield much more stable results for variable selection.
There are three common ways of measuring the importance of individual variables in Random Forest regression. First, the importance measure can be the number of times a variable has been chosen as split variable (selection frequency). Second, the decrease in the socalled Gini index can be used. Third, the permutation accuracy which measures the prediction accuracy before and after permuting a variable can be used.
By performing nodewise regression from each variable on all others and using any importance measure mentioned above, an importance matrix can be built whose element describes the importance of variable (acting as covariate) to the variable (acting as response). Note that this matrix is not symmetric.
There is one technical difficulty, which we would like to mention here. A high entry in the importance matrix indicates a strong association between the corresponding row and column variable. However, depending on the importance measure, the values between various predictor variables as well as between different regressions might not be comparable. It has been shown that popular importance criteria in Random Forest such as the Gini index, the selection frequency or the permutation accuracy are all strongly biased towards variables with more categories (see Strobl et al. (2007)). We therefore use the “cforest” method proposed by Strobl et al. (2007) which provides a variable importance measure that can be reliably used for variable selection even in situations where the predictor variables vary in their scale of measurement or their number of categories.
Still, however, the importance measures are only consistent within rows but not between rows as the variable importance not only depends on the predictor variables in a regression but also on the response variable. Therefore, values cannot be directly compared between rows. For that reason we only consider the ranks of the importance matrix entries within rows. This yields the importance matrix of ranks with , where small numbers correspond to small ranks. Thus, two variables and are strongly (conditionally) dependent if and only if and are both large. In this context, we define a symmetrized importance matrix of ranks with . An initial graph estimate could now be obtained by placing only those edges, whose corresponding entry in is larger than a given cutoff.
3.2 Triangulation and recursive decomposition
Since it is not clear how to find a suitable cutoff for the initial graph, we employ a recursive scheme (see Algorithm 1 lines 8 to 26). First, we start with a complete graph . Then, we delete the edge with the smallest importance according to . If the resulting graph is not decomposable, we extend it to its minimal triangulation , which is guaranteed to be decomposable (see section 2.5). If has a clique that is small enough to be analyzed on its own (this depends on computational resources and could be the case for binary variables), we split it off. If it has no such clique, we keep deleting edges in according to smallest importance (and set the corresponding entry of deleted edges in to infinity, so that it is not considered for deletion again) until there is a small enough clique. Let’s assume that corresponds to a clique in the (triangulated) thinned graph, separates from and . We then split off this clique (and we then estimate a loglinear model on it, see section 4.1).
From here, we restart the whole procedure with the remaining graph, i.e. setting . As corresponding importance matrix, we take only those rows and columns of which correspond to the nodes in . This is repeated until the remaining graph consists of cliques whose cardinalities are less or equal to . Therefore, the amount of edges which we recursively delete depends on the maximal size of cliques which we allow, denoted by which is a tuning parameter of our procedure of initial graph estimation. As mentioned above, is usually chosen by computational requirements as the optimal size of the initial graph seems to be of minor importance.
Note that it is not crucial that we have the sparsest possible subgraphs to collapse on, as long as any reasonable loglinear model selection procedure can be applied for these subgraphs. Such a decomposition is implemented in the R package LLdecomp, available at http://www.rproject.org.
3.3 Combination of results
Assume we have collapsed the table on the cliques and separators of the graph induced by the recursive thinning and decomposition procedure. Furthermore, assume that we have fitted a model for each of these subtables (the collapsed tables on and ). We then get the loglinear model corresponding to the full graph by using formula (7):
(8)  
where and are the design matrices resulting from restricting the total design matrix to nodes in and respectively. The same notation applies to and . Formula (8) describes how to aggregate the results of the collapsed tables. In addition, one can derive from (8) that if we have 3 disjoint subsets where separates from , then we can safely collapse over without changing an interaction between variables in or between variables consisting of a mix of and . The only interactions which might change are the ones between variables which are exclusively in (in the following denoted by separator interactions). This is in accordance with the result stated in (2.4). But as formula (8) holds, the introduced interactions have a very small coefficient. We therefore expect that if we threshold the parameter vector , most of the introduced zeros belong to socalled separator interactions : with , i.e. interactions exclusively contained in a separator. Consequently, we set the threshold that the introduced zeros belong to equal parts to separator and nonseparator interactions. See Figure 2 for a graphical illustration of the procedure. In section 5.3 we will argue empirically that such a thresholding rule works well.
4 Graphical Model Selection Procedures
In this section, we state five methods for loglinear model selection. Three of them (DGL, DSF and DF) can be used for model selection within our proposed decomposition approach (see line 19 in Algorithm 1). The remaining two (WW and RF) are established alternative methods and will be used for comparison.
Our methods use up to three tuning parameters: For decomposition of the graph, we use a bound on the size of cliques in our graph (; see line 16 of algorithm 1). Furthermore, in section 3.3 we introduce a threshold for correcting the interactionadding effect of collapsing to marginal models. Finally, in this section we use regularization parameters for selecting the graph within components.
4.1 Model selection for decomposition approach
The first method we propose is inspired by the Lasso, originally formulated by Tibshirani (1996) for estimation and variable selection in linear regression. Extending this idea, a model selection approach for loglinear models has been developed in Dahinden et al. (2007). The coefficient vector is estimated with the grouplassopenalty (Yuan and Lin, 2006):
(9) 
where and . Therefore is up to an additive constant , which does not depend on , the loglikelihood function. This minimization has to be calculated under the additional constraint that the cell probabilities add to 1. The grouplassopenalty has the property that the solution of (9) is independent of the choice of the orthogonal subspace of and furthermore, the penalty encourages sparsity at the interaction level. Thus, the vector corresponding to the interaction (see section 2.1) has all components either nonzero or zero. Furthermore, by using grouplassopenalty model selection we avoid the sampling zero problem, which is problematic regarding the existence of the MLE (see e.g. Christensen (1991)). The tuning parameter can be assessed by crossvalidation: we divide the individual counts into a number of equal parts and in turn leave out one part and use the rest to form a training contingency table with cell counts .
We abbreviate our decomposition approach using grouplassopenalty for arbitrary values of by DGL (“Decomposition Group Lasso”). When is chosen by crossvalidation, we abbreviate the method by DGL:CV. If, in addition to that, hardthresholding for the parameter vector is used in the specific way described in section 3.3, we abbreviate this method by DGL:F (“Final model of suggested procedure”).
Our second method is a stepwise forward procedure and aims to minimize the AICtype criterion , where is the maximized value of the likelihood function for the corresponding model with degrees of freedom; corresponds to the genuine AIC. Here we also vary the parameter ; a large parameter leads to sparser models. We abbreviate this method for arbitrary values of by DSF (“Decomposition Stepwise Forward”) and by DSF:AIC if .
Finally, we propose a third method where no model selection is performed after decomposition and the MLE on the decomposed model is used. This corresponds to DSF with . We abbreviate this method by DF (“Decomposition Full model”).
4.2 Alternative approaches for comparison
Our first method for comparison is given in Wainwright et al. (2007), where the problem of estimating the graph structure of binary valued Markov networks is considered. They propose to estimate the neighborhood of any given node by performing penalized logistic regressions on the remaining variables using some penalty parameter . Assume we have binary random variables and observations thereof . Furthermore, we assume that the data are generated under the socalled Ising model:
(10) 
where is a symmetric matrix and is a normalizing constant which ensures that the probabilities add up to one. If we go back to the loglinear interaction model described in section 2.1 with binary variables, i.e. the cell , then by comparing formula (1) to (10) we see that the Ising model is a loglinear model whose highest interactions are of order one and the parameterization is, in terms of ANOVA, with Helmert instead of poly contrasts. Therefore, the interaction graph builds up by connecting the nodes and for which . Wainwright et al. (2007) prove that under certain sparsity assumptions their method correctly identifies the underlying graph structure. Note that both our method and Wainwright et al. (2007) use nodewise regression. The difference is, that we estimate the node neighborhoods by performing normal regression and hardthresholding importance measures derived from regression weights. We emphasize that the Wainwright et al. (2007) approach works for binary variables only, while our decomposition approach explained in section 3 works for general multicategory variables. For arbitrary values of the tuning parameter we abbreviate this method by WW. If the tuning parameter is chosen by crossvalidation, we abbreviate the method by WW:CV. It turns out, that WW:CV sometimes yields dense models. Therefore, we abbreviate by WW:MIN the solution for the minimal for which the normalisation constant could be computed (due to computational limitations connected with the junction tree algorithm).
Our second method for comparison is Random Forest. As explained in section 3.1, Random Forest together with a suitable cutoff on the rank importance matrix can be used to derive an initial graph estimate. This yields a graph but no estimation of the loglinear interaction model is provided. We abbreviate this method by RF.
For convenience, in Table 1 we give a short overview of the methods that were defined in this section and are going to be used in the simulation study in the next section.
Abbreviation  Description 

DGL  Decomposition approach using grouplassopenalty without fixing the penalty parameter. 
DGL:CV  As DGL but penalty parameter fixed by crossvalidation. 
DGL:F  Final result of DGL and hardthresholding as explained in section 3.3. 
DSF  Decomposition approach using stepwise forward selection without fixing penalty parameter. 
DSF:AIC  As DSF but penalty parameter fixed corresponding to AIC. 
DF  Decomposition approach without model selection but using MLE. 
WW  Approach by Wainwright et al. (2007) without fixing the penalty parameter. 
WW:CV  As WW but penalty parameter fixed by crossvalidation. 
WW:MIN  As WW using minimal penalty parameter that is computationally feasible for junction tree. 
RF  Random Forest. 
5 Simulation Study
We simulate from a loglinear interaction model corresponding to a graph with 40 nodes and 91 edges. Each node corresponds to a binary variable (and thus, we can compare with the method in Wainwright et al. (2007), explained in section 4.2). The graph is displayed in Fig. 3. This is the same simulation setting as was used in Kim (2005).
We generate 10 datasets each consisting of 100000 observations according to the graph in Fig. 3.
In section 5.1, we investigate the effect of the maximal clique size (for splitting off a clique as explained in section 3.2) in our decomposition approach with respect to performance in estimating the correct model structure. In section 5.2 we compare different approaches with respect to performance in estimating the correct model structure, while in section 5.3 we compare in terms of accuracy for estimating the cell probabilities.
5.1 Optimal clique size
We estimate a model using DGL:CV for equal to 3, 5 and 10. A ROC curve is shown in Fig. 4, where the endpoints of the curves correspond to the selected model of DGL:CV. The curves then build up by successively eliminating edges corresponding to the smallest estimated interaction vector coefficient . We see here that larger decomposition sizes lead to slightly more favorable ROC curves. The picture remains qualitatively the same if we use DSF instead of DGL. For the remainder of the simulations, we will keep the maximal clique size in our approach fixed at .
5.2 Performance for structure estimation
First, we compare DGL, DSF and DF. The performance for structure estimation is shown in ROC curves (see Fig. 5). For DSF the curve builds up by varying (compare section 4.1). Note, that the starting point of the curve of DSF coincides with DF. DGL starts from DGL:CV and uses hardthresholding of for obtaining the values on the ROC curve. Note that during this hardthresholding, the value of DGL:F is produced, too. We see that DSF and DGL lead to models which have approximately the same number of false positive and false negative edges, but the DGL is slightly favorable. The solution of DF has the largest false positive rate (as was expected since no model selection was done).
Second, in Fig. 6 the alternative approaches WW and RF are compared with DGL. In order to keep a simple overview, the results for DSF are no longer shown. We see that our decomposition approach (DGL) slightly outperforms the alternative approaches WW and RF. Keep in mind that in addition to the advantage of our method in performance, RF does not yield an estimate for the parameter vector and WW does so only for binary variables, whereas our method is applicable to general multicategory variables.
Even though Fig. 5 and Fig. 6 only represent one simulated dataset, we observed a similar picture for other simulation settings. The lines for DGL and DSF are always very close, with DGL being slightly better and both methods being clearly superior to the global approaches. The reason why Fig. 5 and Fig. 6 only display results from one dataset is that the single final models cannot be averaged over different datasets as they have different positions on the curves for different datasets (different numbers of true and false positives). If we average over all these values, the result is not very meaningful anymore. However, we can average the differences of true positive rates for e.g. DGL and DSF where both methods have the same number of false positives as the solution of DGL:F (dotted vertical line in Fig. 5). The results of such comparisons are summarized in Table 2. We see that DGL yields a significantly (for significance level ) higher true positive rate than the corresponding solutions of DSF and WW. On the other hand, the comparison between the solution of DGL and RF (at the false positive rate of DGL:F) shows no significant difference.
5.3 Performance for estimation of cell probabilities
DGL, DSF, DF and WW yield estimates of the parameter vector which can be immediately transformed into estimates of cell probabilities. In this section we will compare these methods with respect to the performance for estimating the cell probabilities. Recall that RF does not yield an estimate of the parameter vector and is therefore not included in this comparison.
All approaches considered in this section yield the parameter vector only up to a constant. We need to ensure that the estimated cell probabilities add up to one. For sparse graphs, we can use the junction tree algorithm to calculate the normalising constant for the probabilities. For a detailed description see Lauritzen (1996).
We compare the estimated probabilities, using crossvalidation for tuning parameters and using an expression which is up to a constant the KullbackLeibler divergence between the estimated and the true probability (nonnormalized KullbackLeibler divergence):
(11) 
where is the estimated probability vector and denotes the true probability vector. As this sum requires the calculation of components of and and the summation of the two huge vectors, this is computationally not feasible. To avoid this problem, we calculate an empirical version by simulating one million observations from the graph in Fig. 3 and summing over these values only.
The results are summarized in Table 3. We see that DGL:F, DSF:AIC and DF perform similarly and WW:MIN is clearly inferior. For WW:CV, which is indicated in Fig. 6, the normalizing constant could not be computed. This is because the CVoptimal solution almost corresponds to the full model and thus, the junction tree algorithm is not feasible. The maximal computable solutions of WW still correspond to very large models, which on average involve of all possible edges, compared to for DGL:F and for the true graph.
Table 4 provides further insight about significance of the differences in Table 3. All methods are compared against each other using a paired ttest for the empirical (nonnormalized) KullbackLeibler divergences and the values are reported. One can see that there is no significant (at significance level ) difference for the decomposition approaches, whereas they are all superior to WW. This provides evidence that the decomposition of the model is more crucial than the effective choice of the loglinear model fitting procedure afterwards.
Furthermore, it is worthwhile stating that the thresholding of the coefficients (see section 3.3) does hardly influence the likelihood as can be seen in Fig. 7. On average, of the coefficients are thresholded as indicated by the dotted line. However, for these thresholdoptimal solutions, calculated as described in section 3.3, the empirical (nonnormalized) KullbackLeibler divergence is approximately the same as for the nonthresholded model.
Mean  SD  

DGL:F  4.0080  0.0204 
DSF:AIC  4.0242  0.0217 
DF  4.0223  0.0099 
WW:MIN  4.3360  0.1094 
DSF:AIC  DF  WW:MIN  

DGL:F  0.1030  0.0677  4.7 
DSF:AIC    0.8080  6.0 
DF      7.5 
6 Application to Tissue Microarray Data
6.1 Tissue Microarray technology
The central motivation that led to this work was to fit a graphical model to discrete expression levels of biomarkers resulting from Tissue Microarray (TMA) experiments. Tissue Microarray technology allows rapid visualization of molecular targets in thousands of tissues at a time, either at DNA, RNA or protein level. Tissue Microarrays are composed of hundreds of tissue sections from different patients arrayed on a single glass slide. With the use of immunohistochemical staining, they provide a highthroughput method to analyze potential biomarkers on large patient samples. The assessment of the expression level of a biomarker is usually performed by the pathologist on a categorical scale: expressed/not expressed, or the level of expression.
Tissue Microarrays are powerful for validation and extension of findings obtained from genomic surveys such as cDNA microarrays. cDNA microarrays are useful to analyze a huge number of genes, e.g. a couple of thousands in one specimen at a time. In contrast, TMAs are applicable to the analysis of one target at a time, denoted as biomarker, but in up to 1000 tissues on each slide.
The analysis of the interaction pattern of these biomarkers and in particular the estimation of the graphical model associated with the underlying discrete random variables are of biomedical importance. These graphbased patterns can deliver valuable insight into the underlying biology. A detailed description of the Tissue Microarray technology can be found in Kallioniemi et al. (2001).
6.2 Estimation of graphical model
Our TMA dataset consists of Tissue Microarray measurements from renal cell carcinoma patients. We have information from 1116 patients, 831 thereof having a clear cell carcinoma tumor, which is the tumor of interest here. We have identified 18 biomarkers from which we have information for the majority of the patients. Among 831 ccRCC (clear cell renal cell carcinoma) observations, 527 observations are complete with all biomarker measurements available. For 87 observations one measurement was missing, 64 and 30 observations had 2 or 3 missing values, respectively. 123 observations contained more than 3 missing values and were ignored in the analysis. For the observations with 1, 2 or 3 missing values, multiple imputation was applied using the R package mice (Van Buuren and Oudshoorn, 2007). From 18 biomarkers, 9 are binary and 9 have 3 levels.
Using DGL:F, we estimated a graphical model to the TMA data. The resulting graph is displayed in Fig. 8. The thickness of the line corresponds to the norm of the respective interaction coefficients. Two biomarkers connected by a thick line, as is the case for nuclear p27 and cytoplasmic p27, indicates a strong interaction. The kinase inhibitor p27 exhibits its function in the cell nucleus and therefore recent studies have focused on nuclear p27 expression. Our graphical loglinear model however shows a tight association between nuclear and cytoplasmic expression of p27. Therefore it can be speculated that both nuclear and cytoplasmic presence is required to ensure proper function of p27. It has been shown that in renal tumors, the von HippelLindau protein (VHL protein) is upregulating the expression of the tumor suppressor p27 (Osipov et al., 2002). The graphical model here provides supporting evidence that VHL indeed regulates p27, and the corresponding coefficient (not displayed here) implies that it is a positive regulation.
Furthermore it has been shown in vitro by Roe et al. (2006) that VHL increases p53 expression which is a tumor suppressor. In our model it seems as if p53 is conditionally independent of VHL. Indeed, it has long been known that p53 activates expression of p21 (e.g. Kim (1997)). This dependence is displayed very clearly in the graphical model. We can therefore view the p53p21 pathway with its strong interaction as one unit and it is therefore very reasonable that nuclear VHL interacts with p53. As nuclear VHL is only expressed in 14% of the tumors, and it further makes sense from a biological point of view that the strong interaction between VHL and the p21p53 pathway is in fact a causal relation, we can indeed speculate that the loss of VHL deactivates the tumor suppressor p53 which in turn favors tumor development.
CA9, Glut1 and Cyclin D1 are all hypoxiainducible transcription factor (HIF) target genes (Wenger et al., 2005). HIF has not been measured but we can clearly see that all these HIF targets are connected by a rather thick line implying that they might react to a common gene. In addition, CD10 strongly interacts with Glut1 a known HIF target which suggests that CD10 might also be regulated by HIF. The reduction of ECadherin expression has been found to be negatively correlated with HIF expression in Imai et al. (2003). This is supported by a strong negative interaction between ECadherin and CA9 which is positively correlated with HIF expression (not measured).
A lot of supporting evidence has been delivered for already existing theories. However, two strong interactions, one between PAX2 and nuclear p21 and the other between PAX2 and Cyclin D1 cannot be immediately explained. PAX2 is absent in normal renal tubular epithelial cells but expressed in many clear cell renal cell carcinoma tumours (see Mazal et al. (2005)). Its frequent expression together with the strong interaction with the p21p53 pathway, Cyclin D1 and PTEN make PAX2 an interesting and possibly important molecular parameter whose exact function and role still remains to be elucidated.
A more detailed discussion of biomedical implications is given in Dahinden et al. (2009).
7 Discussion
We have proposed a decomposition approach to estimating loglinear models for large contingency tables and for fitting discrete graphical models. In a simulation study we have compared various algorithms and concluded that our procedures are very competitive. It seems that the decomposition of the problem is much more crucial than the choice of the algorithm to handle the smaller decomposed datasets: no matter whether DGL, DSF or DF is applied, the resulting models are superior to nondecomposition approaches such as WW and RF for model selection as well as for probability or parameter estimation.
Maybe most important is the computational feasibility of our procedure for large contingency tables. The proposed method is scalable to orders of realistic complexity (e.g. dozens up to hundreds of factors) where most or all other existing algorithms become infeasible. In particular, our procedure is not only capable of handling binary data but can easily deal with factors with more levels. Furthermore, with DGL one doesn’t risk the nonexistence of the parameter estimator in case of sampling zeroes in the contingency table as this might arise when using the MLE. Our procedure not only fits a graphical model but also yields an estimation of the parameter vector in a loglinear model and therefore of the cell probabilities. All this is achieved with good performance in comparison to other methods. As a drawback, if the true underlying graph has a clique which is larger than our decomposition size, then some of the edges in the graph are necessarily lost.
We apply the proposed approach to a problem in molecular biology and we find supporting evidence for dependencies between biomarkers which have already been found to exist in vitro or some even in renal tumors, the domain of our application. However, some strong interactions cannot be explained immediately and therefore, new biological hypotheses arise.
The R package LLdecomp for computing a decomposition as described in this paper (i.e., recursively finding cliques and separators) is available at http://www.rproject.org.
We would like to thank an Associate Editor and two referees for their constructive comments. We would also like to thank Dr. Peter Schraml for valuable comments, suggestions and discussions regarding biological interpretation.
Conflict of Interest Statement
The authors have declared no conflict of interest.
References
 Bishop et al. (1975) Bishop, Y., Fienberg, S. and Holland, P. (1975) Discrete Multivariate Analysis. MIT Press, Cambridge, Massachusetts and London, UK.
 Breiman (2001) Breiman, L. (2001) Random forests. Machine Learning, 45, 5–32.
 Christensen (1991) Christensen, R. (1991) Linear Models for Multivariate Time Series, and Spatial Data. SpringerVerlag.
 Dahinden et al. (2009) Dahinden, C., Ingold, B., Wild, P., Boysen, G., Luu, V.D., Montani, M., Kristiansen, G., Bühlmann, P., Moch, H. and Schraml, P. (2009) Mining tissue microarray data to unhide dysregulated tumor suppressor pathways in clear cell renal cell cancer: loss of p27 and pten is critical for tumor progression. To appear in Clinical Cancer Research.
 Dahinden et al. (2007) Dahinden, C., Parmigiani, G., Emerick, M. and Bühlmann, P. (2007) Penalized likelihood for sparse contingency tables with an application to fulllength cDNA libraries. BMC Bioinformatics, 8, 476.
 Darroch et al. (1980) Darroch, J., Lauritzen, S. and Speed, T. (1980) Markov fields and loglineaer interaction models for contingency tables. Ann. Statist., 8, 522–539.
 Imai et al. (2003) Imai, T., Horiuchi, A., Wang, C., Oka, K., Ohira, S., Nikaido, T. and Konishi, I. (2003) Hypoxia Attenuates the Expression of ECadherin via UpRegulation of SNAIL in Ovarian Carcinoma Cells. The American Journal of Pathology, 163, 1437–1447.
 Jackson et al. (2007) Jackson, L., Gray, A. and Fienberg, S. (2007) Sequential category aggregation and partitioning approach for multyway contingency tables based on survey and census data. Preprint.
 Kallioniemi et al. (2001) Kallioniemi, O., Wagner, U., Kononen, J. and Sauter, G. (2001) Tissue microarray technology for highthroughput molecular profiling of cancer. Hum. Mol. Genet., 10, 657–662.
 Kim (2005) Kim, S. (2005) Loglinear modelling for contingency tables by using marginal model structures. Research Report 05, Division of Applied Mathematics, Korea Advanced Institute of Science and Technology.
 Kim (1997) Kim, T. (1997) In vitrotranscriptional activation of p21 promoter by p53. Biochemical and Biophysical Research Communication, 234, 300–302.
 Lauritzen (1996) Lauritzen, S. (1996) Graphical Models. Oxford Statistical Science Series, 17. Oxford Clarendon Press.
 Mazal et al. (2005) Mazal, P., Stichenwirth, M., Koller, A., Blach, S., Haitel, A. and Susani, M. (2005) Expression of aquaporins and pax2 compared to cd10 and cytokeratin 7 in renal neoplasms: a tissue microarray study. Modern Pathology, 18, 535–40.
 Olesen and Madsen (2002) Olesen, K. and Madsen, A. L. (2002) Maximal prime subgraph decomposition of Bayesian networks. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 32.
 Osipov et al. (2002) Osipov, V., Keating, J. T., Faul, P. N., Loda, M. and Dattas, M. W. (2002) Expression of p27 and VHL in renal tumors. Applied Immunohistochemistry & Molecular Morphology, 10, 344–350.
 Ravikumar et al. (2009) Ravikumar, P., Wainwright, M. and Lafferty, J. (2009) Highdimensional graphical model selection using regularized logistic regression. To appear in Annals of Statistics.
 Roe et al. (2006) Roe, J., Kim, H., Lee, S., Kim, S., Cho, E. and Youn, H. (2006) p53 stabilization and transactivation by a von hippellindau protein. Molecular Cell, 22, 395–405.
 Strobl et al. (2007) Strobl, C., Boulesteix, A., Zeileis, A. and Hothorn, T. (2007) Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 8, 25.
 Tibshirani (1996) Tibshirani, R. (1996) Regression shrinkage and selection via the lasso. J. Royal Statist. Soc., 58, 267–288.
 Van Buuren and Oudshoorn (2007) Van Buuren, S. and Oudshoorn, C. (2007) mice: Multivariate Imputation by Chained Equations. URL http://web.inter.nl.net/users/S.van.Buuren/mi/hmtl/mice.htm. R package version 1.16.
 Wainwright et al. (2007) Wainwright, M., Ravikumar, P. and Lafferty, J. (2007) Highdimensional graphical model selection using regularized logistic regression. In Advances in Neural Information Processing Systems 19 (eds. B. Schölkopf, J. Platt and T. Hoffman), 1465–1472. Cambridge, MA: MIT Press.
 Wenger et al. (2005) Wenger, R., Stiehl, D. and Camenisch, G. (2005) Integration of oxygen signaling at the consensus hre. Science Signaling: Signal Transduction Knowledge Environment (STKE), 2005, re12.
 Yuan and Lin (2006) Yuan, M. and Lin, Y. (2006) Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, 68, 49–67.