Consistent Biclustering
Abstract
Biclustering, the process of simultaneously clustering the rows and columns of a data matrix, is a popular and effective tool for finding structure in a highdimensional dataset. Many biclustering procedures appear to work well in practice, but most do not have associated consistency guarantees. To address this shortcoming, we propose a new biclustering procedure based on profile likelihood. The procedure applies to a broad range of data modalities, including binary, count, and continuous observations. We prove that the procedure recovers the true row and column classes when the dimensions of the data matrix tend to infinity, even if the functional form of the data distribution is misspecified. The procedure requires computing a combinatorial search, which can be expensive in practice. Rather than performing this search directly, we propose a new heuristic optimization procedure based on the KernighanLin heuristic, which has nice computational properties and performs well in simulations. We demonstrate our procedure with applications to congressional voting records, and microarray analysis.
KEY WORDS: Biclustering; Block Model; Profile Likelihood; Congressional Voting; Microarray Data.
1 Introduction
Suppose we are given a data matrix , and our goal is to cluster the rows and columns of into meaningful groups. For example, can indicate whether or not user has an interest in product , and our goal is to segment users and products into relevant subgroups. Alternatively, could be the log activation level of gene in patient ; our goal is to seek groups of patients with similar genetic profiles, while at the same time finding groups of genes with similar activation levels. The general simultaneous clustering problem is known by many names, including direct clustering (Hartigan, 1972), block modeling (Arabie et al., 1978), biclustering (Mirkin, 1996), and coclustering (Dhillon, 2001).
Empirical results from a broad range of disciplines indicate that biclustering is useful in practice. Ungar and Foster (1998) and Hofmann (1999) found that biclustering helps identify structure in collaborative filtering contexts with heterogeneous users and sparselyobserved preferences. Eisen et al. (1998) used microarray data to simultaneously cluster genes and conditions, finding that genes with similar functions often cluster together. Harpaz et al. (2010) applied biclustering methods to a Food and Drug Adminstration report database, identifying associations between certain active ingredients and adverse medical reactions. Several other applications of biclustering exist (Cheng and Church, 2000; Getz et al., 2000; Lazzeroni and Owen, 2002; Kluger et al., 2003); Madeira and Oliveira (2004) give a comprehensive survey.
Practitioners interested in biclustering have used a variety of different algorithms to achieve their results. Clearly, many of these algorithms work well in practice, but they are often adhoc, and there are no rigorous guarantees as to their performance. Without these guarantees practitioners cannot be assured that their discoveries from biclustering will generalize or be reproducible; collecting more data may lead to completely different clusters.
There are two approaches to evaluating the theoretical performance of these procedures. The first is to define a higherlevel learning task, and evaluate procedures using a taskdependent measure of generalization performance (Seldin and Tishby, 2009, 2010). We instead consider the alternative approach, which is to consider the problem purely as an unsupervised learning task. In this case, the procedure is evaluated based on the identified biclusters, where a reasonable goal is consistent biclustering.
Our first contribution is to formalize biclustering as an estimation problem. We do this by introducing a probabilistic model for the data matrix , where up to permutations of the rows and columns, the expectation of has a block structure. Next, we make distributional assumptions on the elements of and derive a clustering objective function via profilelikelihood (Murphy and van der Vaart, 2000). Finally, we show that the maximum profile likelihood estimator performs well when the distributional assumptions are not satisfied, in the sense that it is still consistent. To our knowledge, this is the first general consistency result for a biclustering algorithm.
Unfortunately, it is computationally intractable to compute the maximum profile likelihood estimator. It is for this reason that Ungar and Foster (1998), who used a similar probabilistic model for the data matrix, dismissed likelihoodbased approaches as computationally infeasible. Our second contribution, then, is to propose a new approximation algorithm for finding a local maximizer of the biclustering profile likelihood. Our algorithm is based on the KernighamLin heuristic (Kernighan and Lin, 1970), which was employed by Newman (2006) for clustering network data. This is a greedy optimization procedure, so it is not guaranteed to find a global optimum. To mitigate this, we run the fitting procedure repeatedly with many random initializations; as we increase the number of initializations, the probability that the algorithm finds the global optimum increases. We show that this procedure has low computational complexity and it performs well in practice.
Our work was inspired by recent developments in clustering methods for symmetric binary networks. In that context, is an by symmetric binary matrix, and the clusters for the rows of are the same as the clusters for the columns of . Bickel and Chen (2009) used methods similar to those used for proving consistency of Mestimators to derive results for network clustering when tends to infinity. This work was later extended by Choi et al. (2012), who allow the number of clusters to increase with ; Zhao et al. (2011), who allow for nodes not belonging to any cluster; and Zhao et al. (2012), who incorporate individualspecific effects. In parallel to this work, Rohe et al. (2011) study the performance of spectral clustering for symmetric binary networks; and Rohe and Yu (2012) study spectral methods for unsymmetric binary networks.
In our report we have extended methods originally developed for an extremely specialized context (symmetric binary networks) to handle clustering for arbitrary data matrices. Using standard conditions, we have been able to generalize the Bickel and Chen (2009) results beyond Bernoulli random variables. To our knowledge, this is the first time methodologies for binary networks have been used to study general biclustering methods. Notably, our extensions can handle a variety of data distributions, and they can handle both dense and sparse data matrices.
The main text of the paper is organized as follows. First Section 2 describes the theoretical setup and Section 3 presents our main result with a heuristic proof. Then, Section 4 describes the formal theoretical framework and states the rigorous consistency results. Next, Section 5 presents our approximation algorithm. Using this algorithm, Section 6 corroborates the theoretical findings through a simulation study, and Section 7 presents applications to a microarray and a congressional voting dataset. Section 8 presents some concluding remarks. The supplementary appendices include additional proofs, empirical results, and an application to a movie review dataset.
2 Estimation problem and criterion functions
Our first task is to formalize biclustering as an estimation problem. To this end, let be a data matrix. We follow the network clustering literature and posit existence of row classes and column classes, such that the mean value of entry is determined solely by the classes of row and column . That is, there is an unknown row class membership vector , an unknown column class membership vector , and an unknown mean matrix such that
(2.1) 
We refer to model (2.1) as a block model, after the related model for undirected networks proposed by Holland et al. (1983). When a block model is in force, biclustering the rows and columns of the data matrix is equivalent to estimating and .
Not all block models give rise to welldefined estimation problems. To ensure that and are welldefined, we require that each class has at least one member, and that no two classes have the same mean vector. Formally, define row class proportion vector with element equal the proportion of nodes with row class . Also, define column class proportion vector with element equal to the proportion of nodes with column class . We require that every element of and be nonzero. To ensure that the mean vectors of the row classes are distinct, we require that no two rows of are identical. Similarly, we require that no two columns of are identical.
We estimate the clusters by assigning labels to the rows and columns of , codified in vectors and . Ideally, and match and . Note we are assuming that the true numbers of row and column clusters, and , are known, or they have been correctly estimated by some model selection procedure. We measure the performance of a particular label assignment through the corresponding confusion matrix. Specifically, for row and column label assignments and , define normalized confusion matrices and by
Entry is the proportion of nodes with class and label ; entry is defined similarly. These matrices are normalized so that and are the class proportion vectors, and and are the label proportion vectors. If and are diagonal, then the assigned labels match the true classes. More generally, if and can be made diagonal by permuting their columns, then the partition induced by the labels matches the partition induced by the classes. The goal, then, is to find row and column labellings such that and are permutations of diagonal matrices.
In practice, we cannot estimate and directly, because we do not have knowledge of the true row and column classes. To evaluate the quality of a biclustering, we need a surrogate criterion function. Analogously to Bickel and Chen (2009), we employ profilelikelihood for this purpose.
In Bickel and Chen’s setting, the data are binary, so there is a natural data likelihood which arises from the Bernoulli distribution. Our setting is more general, with allowed to be a count or a continuous measurement, so there are many possible choices for the element densities. We proceed by initially assuming that the elements of are sampled from distributions in a singleparameter exponential family. Conditional on and , the elements of are independent, and entry has density with respect to some base measure , where
is the cumulant generating function, and is the natural parameter. Later, we will relax the assumption of the specific distributional form.
With labels and , the complete data loglikelihood is
where and are the estimated class proportions and is the estimated cluster mean. We get the profile loglikelihood by maximizing the loglikelihood over the mean parameter matrix :
where is the convex conjugate of . We refer to as the relative entropy function since is equal to the KullbackLeiber divergence of the base measure from the distribution in the exponential family with mean (Brown, 1986).
Following the above derivation, a natural criterion for the quality of labeling is the profile loglikelihood . In the sequel, we consider a far more general setting. We consider criterion functions of the form
(2.2) 
where is any smooth convex function. Following the derivation above, we refer to as a profile likelihood and we refer to as the corresponding relative entropy function. However, we do not assume that likelihood has been correctly specified. In particular, as long as the block model (2.1) is in force, the elements of can have arbitrary distributional forms, not necessarily belonging to any exponential family. We explicitly allow for heteroscedasticity and distributional misspecification. We show that under mild technical conditions, the maximizer of is a consistent estimator of the true row and column classes.
3 Heuristic justification
In Section 2, we defined a formal biclustering estimation problem and we motivated a class of criterion functions for this problem based on profile likelihood. In this section, we investigate the behavior of the criterion functions. In particular, we outline a heuristic argument which shows that the row and column labels found by maximizing these criterion functions are good estimates of the true row and column classes. Formal statements of the results and their proofs are given in Section 4 and Supplement A.
As noted in Section 1, the main thrust of our theoretical results are similar to that used in the literature on clustering for symmetric binary networks initiated by Bickel and Chen (2009) and extended by Choi et al. (2012), Zhao et al. (2011) and Zhao et al. (2012). The main point of departure from this previous work are that we work with arbitrary data modalities instead of symmetric binary matrices.
Let be a data matrix drawn from an identifiable block model (2.1) with row and column classes and and mean matrix . Let , and be as defined in Section 2. For any row and column labeling and , let , , and be the corresponding estimates of , , and , and let and be the confusion matrices. Let be a profile likelihood criterion function as in (2.2) with corresponding relative entropy function , assumed to be smooth and strictly convex.
We now outline a series of results which show that the maximizers of are good estimates of the true row and column classes.
Proposition 3.1.
The criterion function is uniformly close to a “population criterion function” which only depends on the confusion matrices.
If and are large, then for any choice of and , the estimated cluster mean will be close to , the average value of over the block defined by labels and . This quantity can be computed in terms of the confusion matrices as
By applying Bernstein’s inequality, one can show that is close to uniformly over all choices of and . Thus, we get the population criterion function by replacing with .
For each nonnegative vector define to be the set of normalized confusion matrices with fixed row sums: The population version of is a function of the row and column confusion matrices, , with
Since is uniformly close to , under mild regularity conditions on , the criterion is uniformly close to . Proposition LABEL:L:popcrit (Supplement A) contains a rigorous statement of this result.
Proposition 3.2.
The population criterion function is selfconsistent.
Selfconsistency is an important property for any criterion function, which implies that in the absence of noise, the criterion function will be maximized at the truth (Tarpey and Flury, 1996). In our context, selfconsistency means that is maximized when and are permutations of diagonal matrices.
The selfconsistency of follows from the strict convexity of :
If has no two identical rows and no two identical columns, then exact equality holds only when and are permutations of diagonal matrices. Thus, is maximized when the row and column class partitions match the label partitions. Proposition LABEL:L:popperturb (Supplement A) gives a refined selfconsistency result with a more complete characterization of the behavior of near its maxima.
Proposition 3.3.
Under enough regularity, the maximizer of the criterion function is close to the true row and column class partition.
This is a direct consequence of Propositions 3.1 and 3.2. The criterion is uniformly close to the population criterion , and is maximized at the true class partitions. Thus, the maximizer of is close to the maximizer of . Importantly, Proposition 3.3 does not require any distributional assumptions on the data matrix beyond its expectation satisfying the block model. In particular this result can be applied to binary matrices, count data, and continuous data. Theorems 4.1 and 4.2 contain precise statements analogous to Proposition 3.3.
4 Rigorous Theoretical Results
Here we provide formal statements of the main result from Section 3. The proofs of these results are contained in Supplement A
We work in an asymptotic framework, where the dimensions of the data matrix tend to infinity. Let be a sequence of data matrices indexed by , with and as . We will also suppose that for some finite constant ; this assumption is not essential, but it simplifies the assumption and result statements.
Suppose that for each there exists a row class membership vector and a column class membership vector . We assume that there exist vectors and such that and as almost surely for all and ; this assumption is satisfied, for example, if the class labels are independently drawn from a multinomial distribution. When there is no ambiguity, we omit the subscript .
We define the mean matrix as in Section 3, but allow it to possibly vary with . To model sparsity in , we allow to tend to . To avoid degeneracy, we suppose that there exists a sequence and a fixed matrix such that . Denote by the convex hull of the entries of . Let be a neighborhood of .
To adjust for the sparsity, we redefine the criterion and population criterion functions as
We discuss these modifications and the role of in Section LABEL:S:rho.
We only consider nontrivial partitions; to this end, for , define , the set of nontrivial labellings as
4.1 Assumptions
We require the following regularity conditions:

The biclusters are identifiable: no two rows of the are equal, and no two columns of are equal.

The relative entropy function is locally strictly convex: when .

The third derivative of the relative entropy function is locally bounded: is bounded when .

The average variance of the elements is of the same order as :

The mean matrix does not converge to zero too quickly:

The elements satisfy a Lindeberg condition: for all ,
Condition (C1) is necessary for the biclusters to be identifiable, while (C2) and (C3) are mild regularity conditions on the entropy function.
Condition (C4) is trivially satisfied for dense data and is satisfied for Binomial and Poisson data so long as . However, this condition cannot handle arbitrary sparsity. For example, if the elements of are distributed as Negative Binomial random variables, then condition (C4) requires that the product of the mean and the dispersion parameter does not tend to infinity. In other words, for sparse count data, the counts cannot be too heterogeneous.
Condition (C5) places a sparsity restriction on the mean matrix. Zhao et al. (2012) used the same assumption to establish weak consistency for network clustering. A variant Lyaponuv’s condition (Varadhan, 2001) implies (C6). That is, if
for some , then (C6) follows. In particular, for dense data ( bounded away from zero), uniformly bounded absolute central moments for any is sufficient. For many types of sparse data, including Bernoulli or Poisson data with converging to zero, (C5) is a sufficient condition for (C6).
Theorem 4.1.
Our focus is on the cluster assignments, but, using the methods involved to prove Theorem 4.1, it is possible to show that when the assumptions of Theorem 4.1 are in force, then the scaled estimate of the mean, converges in probability to the population quantity. (This follows from Theorem 4.1 and Lemma LABEL:L:meanresidconv from Supplement A.)
Under stronger distributional assumptions, we can use the methods of the proof to establish finitesample results. For example, if we assume that the elements of are Gaussian, then the following result holds.
Theorem 4.2.
The proof of this finitesample result follows the same outline as the asymptotic result; Appendix LABEL:S:finitesample (Supplement A) gives details.
5 Approximation algorithm
Proposition 3.3 shows that the maximizer of the profile loglikelihood will give a good estimate of the true clusters. Unfortunately, finding this maximizer is computationally intractable. Maximizing is a combinatorial optimization problem with an exponentiallylarge state space. To get around this, we will settle for finding a local optimum rather than a global one. We present an algorithm for finding a local optimum that, in practice, has good estimation performance.
Our approach is based on the KernighanLin heuristic (Kernighan and Lin, 1970), which Newman (2006) used for a related problem, network community detection. After inputting initial partitions for the rows and columns, we iteratively update the cluster assignments in a greedy manner. The algorithm works as follows:

Initialize the row and column labels and arbitrarily, and compute .

Repeat until convergence:

For each row , determine which of the possible label assignments for this row is optimal, keeping all other row and column labels fixed. Do not perform this assignment, but record the optimal label and the local improvement to that would result if this assignment were to be made.

For each column , determine which of the possible label assignments for this column is optimal, keeping all other row and column labels fixed. As in step 2a, do not perform this assignment, but record the optimal label and the local improvement to that would result if this assignment were to be made.

In order of the local improvements recorded in steps 2a and 2b, sequentially perform the individual cluster reassignments determined in these steps, and record the profile likelihood after each reassignment. Note that these assignments are no longer locally optimal since the labels of many of the rows and columns change during this step. Thus, the profile likelihood could increase or decrease as we move sequentially through the assignments.

Out of the sequence of cluster assignments considered in step 2c, choose the one that has the highest profile likelihood.

Since the criterion function increases at each complete iteration, the algorithm will converge to a local optimum.
There is no guarantee that the local optimum found by the algorithm will be the global optimum of the objective function . To mitigate this deficiency, we will run the algorithm repeatedly with many different random initializations for and . Each initialization can give rise to a different local optimum. We choose the cluster assignment with the highest value of among all local optima found by the procedure. As we increase the number of random initializations, the probability that the global optimum will be in this set will increase. We found that 100–250 initializations seem to suffice in the simulations and data examples.
The main computational bottleneck is updating the value of as we update the labels and . We can do this efficiently by storing and incrementally updating the cluster proportions and , the withincluster row and column sums
and the block sums Given the values of these quantities, we can compute the criterion with operations.
If we reassign the label of row from to , then it is straightforward to update with operations. The values of the withincluster row sums remain unchanged. The new values of the block sums are and for ; the other block sums are unchanged. The expensive part of the update is recomputing the withincluster column sums for row labels and and each column . These computations require operations if is dense, and operations if is sparse with at most nonzero elements and .
Overall we must perform operations to reassign the label of row . Reassigning the label of column has the same computational cost. Thus, one loop iteration in step 2 requires operations. For dense data, one iteration requires operations. We do not have an upper bound on the number of iterations until the algorithm converges, but in our experiments we found that empirically, 25 to 30 iterations suffice. These iteration counts may seem small, but in fact each iteration performs row label assignments and column label assignments. The convergence here is not a result of early stopping—we found that after 25–30 iterations, no possible local improvement was possible.
For comparison, a spectralbased biclustering algorithm requires the top singular vectors of the data matrix, which can be gotten in roughly operations using Lanczos or another indirect method (Golub and Loan, 1996).
6 Empirical evaluation
Here we evaluate the performance of the profile likelihood based biclustering algorithm from Section 5. We simulate data from a variety of regimes, including sparse binary data and dense heavytailed continuous measurements. In these settings, we employ the following three relative entropy functions:
(6.1a)  
(6.1b)  
(6.1c) 
We evaluate performance both when the profile likelihood is correctly specified and when the relative entropy function does not match the data distribution.
In our simulations, we report the proportion of misclassified rows and columns by the profile likelihood based method (PL). We initialize partitions randomly, and then run the improvement algorithm from Section 5 until convergence. We use multiple random starting values to minimize the possibility of finding a nonoptimal stationary point.
We compare our method to two other biclustering algorithms. The first algorithm is a spectral biclustering algorithm, DISIM, motivated by a block model similar to ours (Rohe and Yu, 2012). The algorithm finds the singular value decomposition of the data matrix , and then applies means clustering to the top left and right singular vector loadings. The second algorithm we compare against, KM, ignores the interactions between the clusters, and applies means separately to the rows and columns of .
In our first simulation, we generate sparse Poisson count data from a block model with row clusters and column clusters. We vary the number of columns, , between 200 to 1400 and we take the number of rows to be for . To assign the true row and column classes and , we sample independent multinomials with probabilities and . We choose the matrix of block parameters to be
where is chosen between 5 and 20; the entries of the matrix were chosen randomly, uniformly on the interval . We chose the scaling so that the data matrix would be sparse, with elements in each row. We generate the data conditional on the row and column classes as We run all three methods with random starting values.
Figure 1 presents the average bicluster misclassification rates for each sample size and Table 1 reports the standard deviations. In all of the scenarios considered, biclustering based on the profile likelihood criterion performs at least as well as the other methods, even when the relative entropy function is misspecified (using PLGaus instead of PLPois). Moreover by looking at the standard deviations, we see that for the PL methods, the misclassification rate seems to be converging to zero as we increase .
PLPois  PLNorm  KM  DS  

200  0.0356  0.0147  0.0047  0.0878  0.0239  0.0070  0.0957  0.0438  0.0097  0.0403  0.0286  0.0114 
400  0.0182  0.0064  0.0013  0.0313  0.0091  0.0025  0.0478  0.0145  0.0033  0.0279  0.0144  0.0043 
600  0.0103  0.0034  0.0004  0.0153  0.0045  0.0011  0.0253  0.0068  0.0013  0.0186  0.0083  0.0026 
800  0.0076  0.0024  0.0003  0.0104  0.0039  0.0006  0.0188  0.0050  0.0009  0.0140  0.0069  0.0013 
1000  0.0054  0.0018  0.0003  0.0076  0.0029  0.0003  0.0122  0.0033  0.0004  0.0111  0.0040  0.0008 
1200  0.0041  0.0010  0.0001  0.0061  0.0017  0.0003  0.0092  0.0021  0.0003  0.0092  0.0034  0.0007 
1400  0.0036  0.0009  0.0001  0.0047  0.0014  0.0001  0.0077  0.0016  0.0001  0.0071  0.0023  0.0005 
n  b=5  b=10  b=20  b=5  b=10  b=20  b=5  b=10  b=20  b=5  b=10  b=20 
200  0.0160  0.0039  0.0009  0.0415  0.0069  0.0010  0.0663  0.0119  0.0019  0.0339  0.0128  0.0030 
400  0.0071  0.0017  0.0000  0.0101  0.0021  0.0004  0.0183  0.0034  0.0004  0.0152  0.0049  0.0009 
600  0.0036  0.0005  0.0000  0.0051  0.0008  0.0000  0.0100  0.0014  0.0000  0.0088  0.0024  0.0002 
800  0.0021  0.0003  0.0000  0.0035  0.0005  0.0000  0.0056  0.0009  0.0000  0.0064  0.0014  0.0001 
1000  0.0017  0.0002  0.0000  0.0026  0.0004  0.0000  0.0043  0.0006  0.0000  0.0050  0.0009  0.0000 
1200  0.0012  0.0001  0.0000  0.0020  0.0002  0.0000  0.0030  0.0004  0.0000  0.0036  0.0008  0.0000 
1400  0.0007  0.0001  0.0000  0.0012  0.0001  0.0000  0.0018  0.0002  0.0000  0.0025  0.0005  0.0000 
n  b=5  b=10  b=20  b=5  b=10  b=20  b=5  b=10  b=20  b=5  b=10  b=20 
200  0.0059  0.0010  0.0000  0.0118  0.0018  0.0000  0.0204  0.0027  0.0000  0.0141  0.0039  0.0000 
400  0.0022  0.0001  0.0000  0.0033  0.0003  0.0000  0.0071  0.0005  0.0000  0.0056  0.0010  0.0000 
600  0.0010  0.0000  0.0000  0.0014  0.0001  0.0000  0.0024  0.0002  0.0000  0.0025  0.0004  0.0000 
800  0.0004  0.0000  0.0000  0.0008  0.0000  0.0000  0.0015  0.0000  0.0000  0.0017  0.0001  0.0000 
1000  0.0002  0.0000  0.0000  0.0005  0.0000  0.0000  0.0009  0.0000  0.0000  0.0013  0.0000  0.0000 
1200  0.0002  0.0000  0.0000  0.0003  0.0001  0.0000  0.0006  0.0001  0.0000  0.0009  0.0001  0.0000 
1400  0.0000  0.0000  0.0000  0.0002  0.0000  0.0000  0.0005  0.0000  0.0000  0.0006  0.0000  0.0000 
Supplement B describes in detail the simulations for Bernoulli, Gaussian, and heavytailed data. These results are similar to the Poisson case. Our method performs at least as well as the other procedures in all cases and shows signs of convergence.
Overall, the simulations confirm the conclusions of Proposition 3.3, and they show that our approximate maximization algorithm performs well. These results give us confidence that profile likelihood based biclustering can be used in practice.
7 Applications
In this section we use profilelikelihoodbased biclustering to reveal structure in two highdimensional datasets. For each example, we maximize the profile loglikelihood using the algorithm described in Section 5. An additional application example is provided in Supplement B.
7.1 GovTrack
In our first application of the proposed method, we cluster legislators and motions based on the rollcall votes from the 113th United States House of Representatives (years 2013–2014). We validate our method by showing that the clusters found by the method agree with the political parties of the legislators.
After downloading the rollcall votes from govtrack.org, we form a data matrix with rows corresponding to the 444 legislators who voted in the House of Representatives, and columns corresponding to the 545 motions voted upon. Even though there are only 435 seats in the House of Representatives, 9 legislators were replaced midsession when they resigned or died. We code the nonmissing votes as
Not all legislators vote on all motions, and 7% of the data matrix entries are missing, coded as NA. If we assume that the missing data mechanism is ignorable, then it is straightforward to extend our fitting procedure to handle incomplete data matrices. Specifically, to handle the missing data, we replace sums over all matrix entries with sums over all nonmissing matrix entries.
To choose the number of row clusters, , and the number of column clusters, , we fit all candidate models with and each ranging between and . Figure 2 plot show the deviance (twice the negative loglikelihood) plotted as a function of and . The left “scree” plot shows that for most values of , increasing from to has a large effect of the deviance, but increases from to a larger value has only a minor effect. Similarly, the right plot shows that increasing from to causes a large drop in the deviance, but increasing to a larger value has only a minor effect. Together, these plot suggest that we should pick and .
To guard against the possibility of the bicluster algorithm finding a local rather than global optimum, we used random initializations, and chose the row and column cluster assignments with the highest loglikelihood. To check the robustness of this assignment, we increased the number of random initializations up to . Even with times as many restarts, we still found the same optimal loglikelihood.
Figure 3 shows a heatmap constructed after biclustering the data into row clusters and column clusters. The two rowclusters found by the algorithm completely recover the political parties of the legislators (every legislator in row cluster is a Democrat, and every legislator in row cluster is a Republican). The fact that we were able to recover the political party provides us some confidence that the algorithm can find meaningful clusters in practice. The column clusters reveal four types of motions: (1) motions with strong support from both parties; (2) motions with moderate support from both parties; (3) motions with strong Democratic support; (4) motions with strong Republican support.
We compared the clusters found by our method with those found by the two competing methods, DISIM and KM. The competing methods do not directly handle missing data, so for these methods we code “Yea” as , “Nay“ as , and “Missing” as . DISIM returns the same rowclusters, but classified 50 columns (9%) differently. KM placed 10 Republicans into the majorityDemocrat cluster; it classified 45 motions (8%) differently from profile likelihood. The fact that all three methods are giving similar answers gives us confidence that the column clusters are meaningful.
7.2 Agemap
Biclustering is commonly used for microarray data to visualize the activation patterns of thousands of genes simultaneously. It is used to identify patterns and discover distinguishing properties between genes and individuals. We use the AGEMAP dataset (Zahn et al., 2007) to illustrate this process.
AGEMAP is a large microarray data set containing the log expression levels for 40 mice across 8,932 genes measured on 16 different tissue types. For this analysis, we restrict attention to two tissue types: cerebellum and cerebrum. The 40 mice in the dataset belong to four age groups, with five males and five females in each group. One of the mice is missing data for the cerebrum tissue so it has been removed from the dataset.
Our goal is to uncover structure in the gene expression matrix. We bicluster the 39 17,864 residual matrix computed from the least squares solution to the multiple linear regression model
where is the logactivation of gene in mouse , is the age of mouse , indicates if mouse is male, is a random error, and is a genespecific coefficient vector. Here, we are biclustering the residual matrix rather than the raw gene expression matrix because we are interested in the structure remaining after adjusting for the observed covariates.
The entries of the residual matrix are not independent (for example, the sum of each column is zero). Also, the responses of many genes are likely correlated with each other. Thus, the block model required by Theorem 4.1 is not in force, so its conclusion will not hold unless the dependence between the residuals is negligible. In light of this caveat, the example should be considered as exploratory data analysis.
We perform biclustering using profile likelihood based on the Gaussian criterion (6.1c) with 100 random initializations. To determine an appropriate number of mice clusters, , and gene clusters, , we experiment with values of and between and . Figure 4 presents the scree plots. The left plot shows that increasing beyond has a relatively small impact on the deviance, and similarly, the right plot shows that increasing beyond has a relatively minor effect. This suggests we should set and . For this choice of and , we experimented with using up to 1000 random starting values, but found no change to the resulting loglikelihood.
The heatmap presented in Figure 5 shows the results. The expression levels for gene group 1 and 3 appear to be fairly neutral across the three mouse groups, but the other three gene groups have a more visually apparent pattern. It appears that a mouse can have expression levels for at most two of gene groups 2, 4, and 5. Mouse group 2 has high expression for gene groups 2 and 4; mouse group 2 has high expression for gene group 4; and mouse group 3 has high expression for gene group 5.
We computed the bicluster assignments found by DISIM and KM. The methods failed to converge after 1000 iterations, but the resulting cluster assignments found by KM and the profile likelihood method generally agreed, with 89.6% cluster agreement. DISIM agreed less with the other two methods. Between DISIM and the profile likelihood method the cluster agreement was 62.5%, and the cluster agreement between DISIM and KM was 67.8%.
The three clusters of mice found by the profilelikelihood method also agree with those found by Perry and Owen (2010). That analysis identified the mouse clusters, but could not attribute meaning to them. The bicluster based analysis has deepened our understanding of the three mouse clusters while suggesting some interesting interactions between the genes.
8 Discussion
We have developed a statistical setting for studying the performance of biclustering algorithms. Under the assumption that the data follows a stochastic block model, we derived sufficient conditions for an algorithm based on maximizing a profilelikelihood based criterion function to be consistent. This is the first theoretical guarantee for any biclustering algorithm which can be applied to a broad range of data distributions and can handle both sparse and dense data matrices.
Since maximizing the criterion function exactly is computationally infeasible, we have proposed an approximate algorithm for obtaining a local optimum rather than a global one. We have shown through simulations that the approximation algorithm has good performance in practice. Our empirical comparisons demonstrated that the method performs well in a variety of situations and can outperform existing procedures.
Applying the profilelikelihood based biclustering algorithm to real data revealed several interesting findings. Our results from the GovTrack dataset demonstrated our methods ability to recover ground truth labels when available, and identified motion clusters that were robust across different methods. Biclustering the genes and mice in the AGEMAP data exposed an interesting pattern in the expression of certain genes and we found that at most two gene groups can have high expression levels for any one mouse. The consistency theorem proved in this report gives conditions under which we can have confidence in the robustness of these findings.
Supplementary Materials
Supplement A Additional Proofs and Theoretical Results
(supptheory.pdf). Rigorous proofs of all theoretical results.
Supplement B Additional Empirical and Application Results
(suppempirical.pdf). Additional empirical results for Bernoulli, Gaussian, and Studentâs t distributed data, as well as an additional application example.
References
 Arabie et al. (1978) Arabie, P., Boorman, S. A., and Levitt, P. R. (1978). Constructing blockmodels: How and why. J. Math. Pscyh., 17(1):21–63.
 Bickel and Chen (2009) Bickel, P. J. and Chen, A. (2009). A nonparametric view of network models and NewmanGirvan and other modularities. Proc. Nat. Acad. Sci. USA, 106:21068–21073.
 Brown (1986) Brown, L. D. (1986). Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory, volume 9 of Lecture Notes – Monograph Series. Institute of Mathematical Statistics, Hayward, CA.
 Cheng and Church (2000) Cheng, Y. and Church, G. M. (2000). Biclustering of expression data. Proceedings International Conference on Intelligent Systems for Molecular Biology ; ISMB. International Conference on Intelligent Systems for Molecular Biology, 8:93–103.
 Choi et al. (2012) Choi, D., Wolfe, P. J., and Airoldi, E. M. (2012). Stochastic blockmodels with growing number of classes. Biometrika, 99:273–284.
 Dhillon (2001) Dhillon, I. S. (2001). Coclustering documents and words using bipartite spectral graph partitioning. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 26–29, San Francisco.
 Eisen et al. (1998) Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein, D. (1998). Cluster analysis and display of genomewide expression patterns. Proc. Nat. Acad. Sci. USA, 95(25):14863–14868.
 Getz et al. (2000) Getz, G., Levine, E., and Domany, E. (2000). Coupled twoway clustering analysis of gene microarray data. Proc. Nat. Acad. Sci. USA, 97:12079–12084.
 Golub and Loan (1996) Golub, G. H. and Loan, C. F. V. (1996). Matrix Computation. Johns Hopkins University Press.
 Harpaz et al. (2010) Harpaz, R., Perez, H., Chase, H. S., Rabadan, R., Hripcsak, G., and Friedman, C. (2010). Biclustering of Adverse Drug Events in the FDA’s Spontaneous Reporting System. Clinical Pharmacology & Therapeutics, 89(2):243–250.
 Hartigan (1972) Hartigan, J. A. (1972). Direct clustering of a data matrix. J. Amer. Statist. Assoc., 67(337):123–129.
 Hofmann (1999) Hofmann, T. (1999). Latent class models for collaborative filtering. In In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, pages 688–693.
 Holland et al. (1983) Holland, P. W., Laskey, K. B., and Leinhardt, S. (1983). Stochastic blockmodels: First steps. Social Networks, 5:109–137.
 Kernighan and Lin (1970) Kernighan, B. W. and Lin, S. (1970). An Efficient Heuristic Procedure for Partitioning Graphs. The Bell system technical journal, 49(1):291–307.
 Kluger et al. (2003) Kluger, Y., Basri, R., Chang, J. T., and Gerstein, M. (2003). Spectral biclustering of microarray data: Coclustering genes and conditions. Genome Research, 13:703–716.
 Lazzeroni and Owen (2002) Lazzeroni, L. and Owen, A. (2002). Plaid models for gene expression data. Statist. Sinica, 12:61–86.
 Madeira and Oliveira (2004) Madeira, S. C. and Oliveira, A. L. (2004). Biclustering algorithms for biological data analysis: A survey. IEEE T. Comput. Bi., 1:24–45.
 Mirkin (1996) Mirkin, B. (1996). Mathematical classification and clustering. Kluwer Academic Press.
 Murphy and van der Vaart (2000) Murphy, S. A. and van der Vaart, A. W. (2000). On Profile Likelihood. J. Amer. Statist. Assoc., 95(450):449–465.
 Newman (2006) Newman, M. E. J. (2006). Modularity and community structure in networks. Proc. Nat. Acad. Sci. USA, 103(23):8577–8582.
 Perry and Owen (2010) Perry, P. O. and Owen, A. B. (2010). A rotationn test to verify latent structure. J. Mach. Learn. Res., 11:603–624.
 Rohe et al. (2011) Rohe, K., Chatterjee, S., and Yu, B. (2011). Spectral clustering and the highdimensional stochastic blockmodel. Ann. Statist., 39(4):1878–1915.
 Rohe and Yu (2012) Rohe, K. and Yu, B. (2012). Coclustering for directed graphs; the stochastic coblockmodel and a spectral algorithm. Preprint arXiv:1204.2296.
 Seldin and Tishby (2009) Seldin, Y. and Tishby, N. (2009). Pacbayesian generalization bound for density estimation with application to coclustering. In 12th International Conference on Artificial Intelligence and Statistics (AISTATS).
 Seldin and Tishby (2010) Seldin, Y. and Tishby, N. (2010). Pacbayesian analysis of coclustering and beyond. Journal of Machine Learning Research, 11:3595–3646.
 Tarpey and Flury (1996) Tarpey, T. and Flury, B. (1996). Selfconsistency: A fundamental concept in statistics. Statist. Sci., 11(3):229–243.
 Ungar and Foster (1998) Ungar, L. and Foster, D. P. (1998). A formal statistical approach to collaborative filtering. In CONALD98.
 Varadhan (2001) Varadhan, S. R. S. (2001). Probability Theory (Courant Lecture Notes). American Mathematical Soceity.
 Zahn et al. (2007) Zahn, J. M., Poosala, S., Owen, A. B., Ingram, D. K., Lustig, A., Carter, A., Weeraratna, A. T., Taub, D. D., Gorospe, M., MazanMamczarz, K., Lakatta, E. G., Boheler, K. R., Xu, X., Mattson, M. P., Falco, G., Ko, M. S. H., Schlessinger, D., Firman, J., Kummerfeld, S. K., Wood, W. H., Zonderman, A. B., Kim, S. K., and Becker, K. G. (2007). AGEMAP: A Gene Expression Database for Aging in Mice. PLOS Genetics.
 Zhao et al. (2011) Zhao, Y., Levina, E., and Zhu, J. (2011). Community extraction for social networks. P. Natl. Acad. Sci. USA, 108:7321–7326.
 Zhao et al. (2012) Zhao, Y., Levina, E., and Zhu, J. (2012). Consistency of community detection in networks under degreecorrected stochastic block models. Ann. Stat., 40(4):2266–2292.
See pages 114 of supptheory See pages 111 of suppempirical