Goodness-of-fit Test for Latent Block Models

Goodness-of-fit Test for Latent Block Models

Abstract

Latent block models are used for probabilistic biclustering, which is shown to be an effective method for analyzing various relational data sets. However, there has been no statistical test method for determining the row and column cluster numbers of latent block models. Recent studies have constructed statistical-test-based methods for stochastic block models, which assume that the observed matrix is a square symmetric matrix and that the cluster assignments are the same for rows and columns. In this study, we developed a new goodness-of-fit test for latent block models to test whether an observed data matrix fits a given set of row and column cluster numbers, or it consists of more clusters in at least one direction of the row and the column. To construct the test method, we used a result from the random matrix theory for a sample covariance matrix. We experimentally demonstrated the effectiveness of the proposed method by showing the asymptotic behavior of the test statistic and measuring the test accuracy.

1 Introduction

Block modeling [14, 2] is known to be effective in representing various relational data sets, such as the data sets of movie ratings [40], customer-product transactions [40], congressional voting [23], document-word relationships [10], and gene expressions [35]. Latent block models or LBMs [13] are used for probabilistic biclustering of such relational data matrices, where rows and columns represent different objects. For instance, suppose that a matrix represents the relationship between users and movies, where entry is the rating of the -th movie by the -th user. In LBMs, we assume a regular-grid block structure behind the observed matrix ; i.e., both rows (users) and columns (movies) of matrix are simultaneously decomposed into latent clusters. A block is defined as a combination of row and column clusters, and entries of the same block in matrix are supposed to be i.i.d. random variables.

An open problem in using LBMs is that there has been no statistical criterion for determining the numbers of row and column clusters. Recently, statistical-test-based approaches [4, 25, 16] have been proposed for estimating the cluster number of stochastic block models (SBMs) [15]. SBMs are similar to LBMs in the sense that they assume a block structure behind an observed matrix; however, they are based on different assumptions from LBMs that an observed matrix is a square symmetric matrix and that the cluster assignments are the same for rows and columns [29]. In regard to the LBM setting, no statistical method has been constructed to determine row and column cluster numbers.

Aside from the test-based methods, several model selection approaches have been proposed based on cross-validation [6] or an information criterion [22, 33, 23]. However, these approaches have several limitations. (1) First, they cannot provide knowledge about the reliability of the result besides the finally estimated cluster numbers. Rather than minimizing the generalization error, in some cases, it is more appropriate to provide a probabilistic guarantee in reliability for the purpose of knowledge discovery. (2) Second, both the cross-validation-based and information-criterion-based methods depend on the clustering algorithm used. For instance, we can employ the Bayesian information criterion (BIC) for estimating the marginal likelihood only if the Fisher information matrix of the model is regular, which is not the case for block models. Constructing an information criterion that estimates the expectation of the generalization error for a wider class of models is generally difficult. (3) Finally, the above methods require relatively large computational complexity. Computation of an information criterion requires the process of approximating the posterior distribution by the Markov chain Monte Carlo (MCMC) method, and cross-validation requires the iterative calculation of the test error with different sets of partitions of the training and test data sets.

In this study, we proposed a new statistical test method for LBMs. To construct a hypothesis test with a theoretical guarantee, we used a result from random matrix theory. Recent studies on random matrix theory have revealed the asymptotic behavior of singular values of an random matrix [12, 41, 48, 3, 18, 19, 42, 32, 34]. Here, we assume that each entry of matrix , which is given by (which is computed by the original matrix , its block-wise mean and standard deviation ) follows a distribution with a sub-exponential decay. From the result in [34], the normalized maximum eigenvalue of converges in law to the Tracy-Widom distribution with index , under the above sub-exponential condition. Based on this result, we constructed a goodness-of-fit test for a given set of row and column cluster numbers of an LBM, using the maximum singular value of matrix , which is an estimator of the matrix . We proved that under the null hypothesis (i.e., observed matrix consists of a given set of row and column cluster numbers), the proposed test statistic converges in law to the Tracy-Widom distribution with index (Theorem 4.1). We also showed that under the alternative hypothesis, test statistic increases in proportion to with a high probability, where is a number proportional to the matrix size (Theorems 4.2 and 4.3).

The proposed method solves the limitations of other model selection approaches. (1) Our statistical test method enables us to obtain knowledge about the reliability of the test results. When testing a given set of row and column cluster numbers, we can explicitly set the probability of Type I error (or false positive) as a significance level . (2) Unlike the other model selection methods, the proposed method does not depend on the clustering algorithm as long as it satisfies the consistency condition (Section 2). It only uses the output of a clustering algorithm to test a given set of cluster numbers; there is no need to modify the test method according to the clustering algorithm. (3) The proposed test method requires relatively small computational complexity. It does not require the MCMC procedure or partitioning into the training and test data sets. For these reasons, the proposed test-based method can be widely used for the purpose of knowledge discovery.

The next sections consist of the detailed explanation of the proposed test method for LBMs. In Section 2, we describe the proposed goodness-of-fit test and its theoretical guarantee with the assumptions required for the problem setting. Next, we briefly review the related works and their differences from the proposed method in Section 3. The main results are presented in Section 4, where we prove the asymptotic properties of the proposed test statistic. In Section 5, we experimentally demonstrate the effectiveness of the proposed test method by showing the asymptotic behavior of the test statistic and calculating the test accuracy. We discuss the results and limitations of the proposed method in Section 6 and conclude the paper in Section 7.

2 Problem setting and statistical model for goodness-of-fit test for latent block models

Let be an observed matrix. We assume that each entry of matrix is independently generated, given its row and column clusters. Let be the cluster of the -th row, and be the cluster of the -th column of matrix . To apply the result in [34], we assume that each entry of matrix is independently subject to a distribution with block-wise mean and block-wise standard deviation :

(1)

where and , respectively, are the mean and the positive standard deviation of entries in the -th true block under the null hypothesis.

Let be the true set of cluster numbers for rows and columns of an observed matrix , which is unknown in advance. In this paper, we propose a goodness-of-fit test for selecting these cluster numbers from observed matrix . In such a test, we test whether is equal to a given set of cluster numbers or at least one of the given row and column cluster numbers or is smaller than the true cluster numbers or . In other words, the null (N) and alternative (A) hypotheses are given by

(2)

By sequentially testing the cluster numbers in the following order (Figure 1), we can select the cluster numbers of a given observed matrix .

  1. Test .

  2. Test .

  3. Test .

  4. Test . Let be the row and column cluster numbers where the null hypothesis is accepted and holds. The selected set of cluster numbers is .

Based on the above sequentially ordered test, selection of the cluster numbers requires tests at most.

Figure 1: The sequential order for testing row and column cluster numbers. For example, let the blue entry be the true cluster numbers . Based on this sequentially ordered test, the given cluster numbers are always unrealizable (that is, at least one of or holds), until it reaches to .

Assumptions.

Throughout this paper, we make the following assumptions to derive the test statistics:

  1. We assume that a distribution of , which is given by as in (4) later, has a sub-exponential decay. That is, there exists some such that for , . From this assumption, note that the fourth moment of a random variable is finite (i.e., ).

  2. We denote the number of rows and columns of matrix as and , respectively. We assume that both and increase in proportion to some sufficiently large number (i.e., ).

  3. Let and , respectively, be the minimum row and column cluster numbers to represent the block structure of observed matrix under the null hypothesis. We assume that both and are finite constants that do not increase with the matrix sizes and . We also assume that the minimum row and column sizes of a block in the true block structure, which we denote as and , respectively, satisfy and , where we used the following notation:

    (3)

    In other words, we assume that with high probability, there is no “too small” block in matrix .

  4. If the given set of cluster numbers is equal to the true cluster numbers , then we call it a realizable case. Otherwise, we call it an unrealizable case ( or ). Here, we only consider the cases where and .

  5. In the realizable case, we assume that a clustering algorithm is consistent, that is, the probability that it outputs the correct block structure converges to , in the limit of . By using this assumption, the proposed method does not depend on a specific clustering algorithm. Several clustering algorithms including [11, 1, 5] have been proven to be consistent.

3 Relation to existing works

In this section, we briefly review the related works and explain the differences between them and the proposed method.

3.1 Model selection for block models

Statistical-test-based methods (for SBM)

Recently, several methods have been proposed for testing the properties of a given observed matrix in relation to SBMs [4, 25, 20, 16, 49]. Particularly, the methods proposed in [4, 25, 16] have enabled us to estimate the number of blocks for SBMs. However, these methods differ from ours in the problem setting; they can be applied only to an SBM setting, where an observed matrix is a square symmetric matrix, and the cluster assignments are the same for rows and columns. There has been no method for estimating the block number for LBMs, where rows and columns (not necessarily square) of an observed matrix are simultaneously decomposed into clusters.

Cross-validation-based methods

Cross-validation is a widely used method for model selection, where a data set is first split into training and test data sets, and then the best model with the minimum test error is determined. Recently, cross-validation methods for matrix data have been proposed [9, 26, 21, 6] to determine the number of clusters in network data. Although the purpose of these methods and our method is similar, these methods differ from ours in that their target is the network data, where the observed matrix is square and its rows and columns represent the same node sets. Thus, the block structure is symmetric regardless of whether the network itself is directed or undirected). Moreover, unlike a statistical test, these methods cannot provide quantitative knowledge about the reliability of the selected model. Furthermore, the computational cost of cross-validation is generally high because it requires the iterative calculation of the test error with different data set partitions.

Information-criterion-based methods

Another approach for determining the number of blocks in a matrix is to estimate the generalization error or marginal likelihood by some information criterion for given sets of block numbers. By using such information criteria, we can select a model in a statistically meaningful (non-heuristic) way. In regard to block models, many variants of BIC have been proposed [22, 33, 23, 17, 39]. Unlike our test-based method, which only requires a clustering algorithm to satisfy the consistency condition (Section 2), an information criterion for a theoretical guarantee should be carefully chosen according to the given clustering algorithm. For instance, BIC can be employed for estimating the marginal likelihood only if the Fisher information matrix of the model is regular, which is not the case for block models.

To solve this problem, as an alternative criterion to BIC, the integrated completed likelihood (ICL) criterion has been used in many studies for estimating the number of blocks in LBMs [27, 46, 8]. In ICL, we first derive a marginal likelihood for a given set of an observed matrix and block assignments and then substitute the set of estimated block assignments to approximate the marginal likelihood. However, since ICL is computed based on a single estimator of block assignments, there is no guarantee for the goodness of the approximation of marginal likelihood.

Similar to cross-validation-based methods, information-criterion-based methods cannot provide a probabilistic guarantee for the reliability of the selected model, which is a disadvantage for the purpose of knowledge discovery. The computational cost also becomes a problem because the computation of an information criterion requires the process of approximating the posterior distribution by MCMC.

Other model selection methods

Aside from the information criteria, several studies have proposed to determine the number of blocks in LBMs based on the co-clustering adjusted rand index [37], the extended modularity for biclustering [24], or the expected posterior loss for a given loss function [36]. Another approach is to define the posterior distribution not only on cluster assignments of rows and columns but also on row and column cluster numbers [47, 31]. Unlike the model selection approaches, such nonparametric Bayesian methods can estimate the distribution of the block numbers. The best-fitted number of the blocks can be determined based on the posterior distribution (e.g., we can choose a MAP estimator [31]). However, in this case, the computational cost of MCMC is higher than that of the information-criterion-based methods because it requires a large number of iterations to approximate the posterior distribution both on the block assignments and the number of blocks.

4 Test statistic for determining the set of cluster numbers

To derive the test statistic for the proposed goodness-of-fit test, we first normalize each entry of an observed matrix by subtracting and dividing it by , where and , respectively, are the block-wise mean and standard deviation in (2):

(4)

By definition, each entry of matrix in (4) independently follows a distribution with zero mean and standard deviation of one. Therefore, according to the result in [34], if and in the limit of , the scaled maximum eigenvalue of matrix converges in law to the Tracy-Widom distribution with index () in the limit of :

(5)

where is the maximum eigenvalue of matrix and

(6)

In most cases, the true cluster numbers and the true cluster assignments and are unknown in advance. Therefore, we can only estimate the block structure based on the observed matrix and the given cluster numbers. Let be the given set of row and column cluster numbers, and and , respectively, be the estimated cluster assignments for rows and columns. Based on such an estimated block structure , we estimate the block-wise mean and standard deviation by

(7)

where is the set of row indices of matrix that are assigned to the -th cluster, and is the set of column indices of matrix that are assigned to the -th cluster:

(8)

The consistency assumption (v) guarantees that if , the probability that the cluster assignments and are correct converges to in the limit of .

We define an estimator of normalized matrix in (4) based on the estimated block-wise mean and standard deviation in (4):

(9)

The test statistic for the proposed goodness-of-fit test is given by the scaled maximum eigenvalue of matrix :

(10)

where is the maximum eigenvalue of matrix , and and are given by (6).

Theorem 4.1 (Realizable case).

We assume that the following condition holds: , in the limit of . Under the consistency assumption (v) for the clustering algorithm, if ,

(11)

in the limit of , where is defined as in (10).

Proof.

We denote the operator norm by ,

(12)

From now on, we use the following notation:

  • and , respectively, are the constant elements in the -th true blocks of mean and standard deviation matrices and .

  • and , respectively, are the constant elements in the -th true blocks of mean and standard deviation matrices and , which are calculated based on the observed matrix and the correct cluster assignments. That is, and , respectively, are the sample mean and standard deviation of all the entries in the -th true block in matrix .

  • and , respectively, are the constant elements in the -th estimated blocks of mean and standard deviation matrices and . Here, the estimated block structure is output by a clustering algorithm that satisfies the consistency assumption (v).

First of all, we derive the difference between () and (). Since the number of entries in the block is proportional to by the assumption (iii), converges to from the central limit theorem. Therefore, from Prokhorov’s theorem [44], we have

(13)

Also, the following equation holds (The proof is given in Appendix A):

(14)

Let be a matrix whose entries are calculated based on the observed matrix , sample mean and standard deviation for the correct block structure. Formally,

(15)

From here, we derive the difference between the maximum eigenvalue of matrix and the maximum eigenvalue of matrix . From (5), we have . Therefore, the largest singular value of matrix , which is equal to , is in the order of .

By the subadditivity of the operator norm, we have

(16)

Let , , , and , respectively, be the -th true blocks of matrices , , , and . We also denote the row and column sizes of the -th true block as and , respectively. From the definitions in (4) and (4), we have

(17)

Combining this with (13), (14) and the fact that the Frobenius norm upper bounds the operator norm, we have

(18)

Therefore, since the operator norm of a matrix is not larger than the sum of the operator norms of all of its blocks and the number of blocks are finite constants, we have

(19)

By combining this with (16), we obtain

(20)

Next, we consider the joint probability of the event that the clustering algorithm outputs the correct block structure (i.e., ) and the event that holds. Such a joint probability satisfies the following inequality:

(21)

where is the complement of event . The consistency assumption (v) guarantees that if , converges to in the limit of . By combining this fact with (20), we obtain

(22)

which results in

(23)

Here, note that and . By combining these results and (5), from Slutsky’s theorem (note that both and are constants that only depend on the matrix size),

(24)

This is equivalent to the statement of Theorem 4.1. ∎

Theorem 4.2 (Unrealizable case, lower bound).

Suppose or .

(25)

where is defined as in (10).

Proof.

Let be a matrix that consists of the estimated block structure and whose entries are the population block-wise means, which can be calculated using (see also Figure 2).

Figure 2: Difference between matrices , , and in an unrealizable case.

To derive the difference between matrices and , we first focus on the relationship between matrices and . In the unrealizable case (i.e., or ), we can assume without loss of generality.

Let be the number of rows in the -th true row cluster. For all the true row cluster indices , at least one estimated row cluster contains or more rows that are assigned to the -th row cluster in the true block structure (otherwise, the total number of rows in the -th true row cluster is smaller than ). Since , at least one estimated block contains two or more sets of rows whose true row clusters are mutually different, and both of which have the row sizes of at least , where is the minimum row size of a block in the true block structure. On the other hand, for all the true column cluster indices , at least one estimated column cluster contains or more columns that are assigned to the -th column cluster in the true block structure, where is the number of rows in the -th true column cluster. By combining these facts, there exists at least one estimated block that contains two or more submatrices, both of which have the sizes of at least and whose true blocks are mutually different.

Let and be such submatrices, whose true block-wise mean are and , respectively. We can assume without the loss of generality. In matrix , which has the estimated block structure, both of and have the same values . Here, if holds, , and if otherwise. Therefore, for any , there exists at least one submatrix (which is either or ) with a size of at least , where all the entries are (which is either or ) in matrix and

(26)

Let be the row and column cluster indices of the estimated block which contains the above submatrix . We denote the row and column sizes of the -th estimated block as and , respectively. Let , , and , respectively, be the submatrices of , , and with the same row and column indices as the -th estimated block. In regard to the difference between matrices and (both of which have the estimated block structure), we have

(27)

where , and , respectively, are the -th true blocks of matrices , and , and and . To derive the final equation in (27), we used the assumption that and the fact that is equal to the largest singular value of , which is from (5).

Let be the event that holds. For all , , and , the following inequality holds:

(28)

By combining (27) and (28), we obtain

(29)

From now on, we denote the row and column sizes of submatrix , respectively, by and . Let , , , , and respectively, be the submatrices of matrices , , , , and with the same row and column indices as submatrix . We also denote the constant entries of the submatrices of and with the same row and column indices as submatrix , respectively, as and . From the definition (9) and since the operator norm of a submatrix is not larger than that of the original matrix, we have

(30)

First, the order of the estimated standard deviation is given by

(31)

The proof of (31) is in Appendix B.

The only non-zero (and thus, the largest) singular value of matrix is . Since the largest singular value of a matrix is equal to its operator norm, we have

(32)

Therefore, by combining this fact with (26), if the statement of event holds, the following inequality also holds:

(33)

which results in that , where .

Also, from (5), we have