# Co-clustering separately exchangeable network data\thanksrefT1

###### Abstract

This article establishes the performance of stochastic blockmodels in addressing the co-clustering problem of partitioning a binary array into subsets, assuming only that the data are generated by a nonparametric process satisfying the condition of separate exchangeability. We provide oracle inequalities with rate of convergence corresponding to profile likelihood maximization and mean-square error minimization, and show that the blockmodel can be interpreted in this setting as an optimal piecewise-constant approximation to the generative nonparametric model. We also show for large sample sizes that the detection of co-clusters in such data indicates with high probability the existence of co-clusters of equal size and asymptotically equivalent connectivity in the underlying generative process.

10.1214/13-AOS1173 \volume42 \issue1 2014 \firstpage29 \lastpage63 \newproclaimdefinitionDefinition[section] \newproclaimexampleExample[section] \newproclaimremarkRemark[section]

Co-clustering network data \thankstextT1Supported in part by the US Army Research Office under PECASE Award W911NF-09-1-0555 and MURI Award W911NF-11-1-0036; by the UK EPSRC under Mathematical Sciences Established Career Fellowship EP/K005413/1 and Institutional Sponsorship Award EP/K503459/1; by the UK Royal Society under a Wolfson Research Merit Award; and by Marie Curie FP7 Integration Grant PCIG12-GA-2012-334622 within the 7th European Union Framework Program.

a]\fnmsDavid \snmChoilabel=e1]davidch@andrew.cmu.edu and b]\fnmsPatrick J. \snmWolfe\correflabel=e2]p.wolfe@ucl.ac.uk

class=AMS] \kwd[Primary ]62G05 \kwd[; secondary ]05C80 \kwd60B20

Bipartite graph \kwdnetwork clustering \kwdoracle inequality \kwdprofile likelihood \kwdstatistical network analysis \kwdstochastic blockmodel and co-blockmodel

## 1 Introduction

Blockmodels are popular tools for network modeling that see wide and rapidly growing use in analyzing social, economic and biological systems; see Zhao, Levina and Zhu (2011) and Fienberg (2012) for recent overviews. A blockmodel dictates that the probability of connection between any two network nodes is determined only by their respective block memberships, parameterized by a latent categorical variable at each node.

Fitting a blockmodel to a binary network adjacency matrix yields a clustering of network nodes, based on their shared proclivities for forming connections. More generally, fitting a blockmodel to any binary array involves partitioning it into blocks. In this way, blockmodels represent a piecewise-constant approximation to a latent function that generates network connection probabilities. This in turn can be viewed as a histogram-like approximation to a nonparametric generative process for binary arrays; fitting such models is termed co-clustering [Flynn and Perry (2012), Rohe and Yu (2012)].

This article analyzes the performance of stochastic blockmodels for co-clustering under model misspecification, assuming only an underlying generative process that satisfies the condition of separate exchangeability [Diaconis and Janson (2008)]. This significantly generalizes known results for the blockmodel and its co-clustering variant, which have been established only recently under the requirement of correct model specification [Bickel and Chen (2009), Bickel, Chen and Levina (2011), Rohe, Chatterjee and Yu (2011), Chatterjee (2012), Choi, Wolfe and Airoldi (2012), Flynn and Perry (2012), Rohe and Yu (2012), Zhao, Levina and Zhu (2012), Fishkind et al. (2013)].

We show that blockmodels for co-clustering satisfy consistency properties and remain interpretable whenever separate exchangeability holds. Exchangeability is a natural condition satisfied by many network models: it characterizes permutation invariance, implying that the ordering of nodes carries no information [Bickel and Chen (2009), Hoff (2009)]. A blockmodel is an exchangeable model in which the connection probabilities are piecewise constant. Blockmodels also provide a simplified parametric approximation in the more general nonparametric setting [Bickel, Chen and Levina (2011)].

In addition to providing oracle inequalities for blockmodel -estimators corresponding to profile likelihood and least squares optimizations, we show that it is possible to identify clusterings in data—what practitioners term network communities—even when the actual generative process is far from a blockmodel. The main statistical application of our results is to enable co-clustering under model misspecification. Much effort has been devoted to the task of community detection [Newman (2006), Fortunato and Barthélemy (2007), Zhao, Levina and Zhu (2011), Fienberg (2012)], but the drawing of inferential conclusions in this setting has been limited by the need to assume a correctly specified model.

Our results imply that community detection can be understood as finding a best piecewise-constant or simple function approximation to a flexible nonparametric process. In settings where the underlying generative process is not well understood and the specification of models is thus premature, such an approach is a natural first step for exploratory data analysis. This has been likened to the use of histograms to characterize exchangeable data in nonnetwork settings [Bickel and Chen (2009)].

The article is organized as follows. In Section 2, we introduce our nonparametric setting and model. In Section 3 we present oracle inequalities for co-clustering based on blockmodel fitting. In Section 4 we give our main technical result, and discuss a concrete statistical application: quantifying how the collection of co-clusterings of the data approaches that of a generative nonparametric process. We prove our main result in Section 5, by combining a construction used to establish a theory of graph limits [Borgs et al. (2006, 2008, 2012)] with statistical learning theory results on -statistics [Clémençon, Lugosi and Vayatis (2008)]. In Section 6 we illustrate our results via a simulation study, and in Section 7 we relate them to other recent work. Appendices A–C contain additional proofs and technical lemmas.

## 2 Model elicitation

Recall that fitting a blockmodel to a binary array involves partitioning it into blocks. Denote by a bipartite graph with edge set and vertex sets , where assignments of vertices to or are known. For example, and might represent people and locations, with edge denoting that person frequents location . See Flynn and Perry (2012) and Rohe and Yu (2012) for additional examples.

### 2.1 Exchangeable graph models

For a bipartite graph represented as a binary array , the appropriate notion of exchangeability is as follows.

[(Separate exchangeability [Diaconis and Janson (2008)])] An array of binary random variables is separately exchangeable if

for all all permutations of for all all permutations of , and all .

If we identify a finite set of rows and columns of with the adjacency matrix of an observed bipartite graph , then it is clear that the notion of separate exchangeability encompasses a broad class of network models. Indeed, given a single observation of an unlabeled graph, it is natural to consider the class of all models that are invariant to permutation of its adjacency matrix; see Bickel and Chen (2009) and Hoff (2009) for discussion.

The assumption of separate exchangeability is the only one we will require for our results to hold. A representation of models in this class will be given by the Aldous–Hoover theorem for separately exchangeable binary arrays.

[(Exchangeable array model)] Fix a measurable mapping . Then the following model generates an exchangeable random bipartite graph through its adjacency matrix : {longlist}[(1)]

generate ;

fix and , and generate each element of and ;

for , and , generate , where denotes the function . If , then connect vertices and .

The Aldous–Hoover theorem states that this representation is sufficient to describe any separately exchangeable network distribution.

###### Theorem 2.1 ([Diaconis and Janson (2008)])

Let be a separately exchangeable binary array. Then there exists some , unique up to measure-preserving transformation, which generates .

The interpretation of the exchangeable graph model of Definition 2.1 is that each vertex has a latent parameter in ( for vertex in , and for vertex in ) which determines its affinity for connecting to other vertices, while is a network-wide connectivity parameter (nonidentifiable from a single network observation). Because and are latent, itself is identifiable only up to measure-preserving transformation, and is hence indistinguishable from any mapping for which are in the set of measure-preserving bijective maps of to itself.

### 2.2 The stochastic co-blockmodel

Many popular network models can be recognized as instances of Definition 2.1. For example, Hoff, Raftery and Handcock (2002), Airoldi et al. (2008) and Kim and Leskovec (2012) all present models in which the resulting is constant in , while Miller, Griffiths and Jordan (2009) require the full parameterization . The stochastic co-blockmodel specifies constant in and also piecewise-constant in and , and thus can be viewed as a simple function approximation to in Definition 2.1.

[(Stochastic co-blockmodel [Rohe and Yu (2012)])] Fix integers , a matrix and discrete probability measures and on and . Then the stochastic co-blockmodel generates an exchangeable bipartite graph through the matrix as follows: {longlist}[(1)]

Fix and , and generate and .

For , and , generate . If , then connect vertices and . Additionally, given co-blockmodel parameters , define

as the mapping corresponding to Definition 2.1, with the inverse distribution function corresponding to a given distribution .

Without loss of generality we assume in what follows, noting that our results do not depend in any crucial way on this assumption. Thus, a stochastic blockmodel’s vertices in belong to one of latent classes, as do those in . Vectors and of categorical variables specify these class memberships. The matrix indexes the corresponding connection affinities between classes in and . Because and are latent, the stochastic co-blockmodel is identifiable only up to a permutation of its class labels.

## 3 Oracle inequalities for co-clustering

If we assume that the separately exchangeable data model of Definition 2.1 is in force, then a natural first step is to approximate by way of some piecewise-constant , according to the stochastic co-blockmodel of Definition 2.2. This approximation task is equivalent to fixing and estimating by co-clustering the entries of an observed adjacency matrix .

### 3.1 Sets of co-clustering parameters

To accomplish this task, we consider -estimators that involve an optimization over the latent categorical variable vectors and . The resulting blockmodel estimates will reside in a set containing triples , where we define to be the set of all probability distributions over whose elements are integer multiples of ,

and likewise for . Note that and are subsets of the standard -simplex, chosen to contain all measures and that can be obtained by empirically co-clustering the elements of an -dimensional binary array. Thus, by construction, any estimator based on an empirical co-clustering of an observed binary array has codomain .

Given a specific and , let denote the set of all node-to-class assignment functions that partition the set into classes in a manner that respects the proportions dictated by ,

and likewise for .

### 3.2 Oracle inequalities

We now establish that, for risk and Kullback–Leibler divergence, there exist -estimators that enable us to determine, with rate of convergence , optimal piecewise-constant approximations of the generative , up to quantization due to the discreteness of .

###### Theorem 3.1 ((Oracle inequalities for co-clustering))

Let be a separately exchangeable array generated by some in accordance with Definition 2.1, and consider fitting a -class stochastic co-blockmodel parameterized by to . Then as , with and fixed:

[(1)]

For the least squares co-blockmodel -estimator

(1) |

relative to the risk

we have that

Given any , let . Consider the profile likelihood co-blockmodel -estimator

(2) | |||

relative to

If exists, and and are finite, then

(3) |

Theorem 3.1 can be viewed as analyzing maximum likelihood techniques in the context of model misspecification [White (1982)], and is proved in Appendix A. It establishes that minimization of the squared error between a fitted co-blockmodel and an observed binary array according to (1) serves as a proxy for approximation of by in mean square, and that fitting a stochastic co-blockmodel via profile likelihood according to (3.1) is equivalent to minimizing the average Kullback–Leibler divergence of the approximation from the generative .

The existence of a limiting object implies that we are in the dense graph regime, with expected network degree values increasing linearly as a function of or . Given a correctly specified generative blockmodel, profile likelihood estimators are known to be consistent even in the sparse graph setting of polynomial or poly-logarithmic expected degree growth [Bickel and Chen (2009)]. In our setting, however, the generative model is no longer necessarily a blockmodel; in this context, both Borgs et al. (2008) and Chatterjee (2012) leave open the question of consistently estimating sparse network parameters, while Bickel, Chen and Levina (2011) give an identifiability result extending to the sparse case. The simulation study reported in Section 6 below suggests that the behavior of blockmodel estimators is qualitatively similar across at least some families of dense and sparse models.

### 3.3 Additional remarks on Theorem 3.1

In essence, Theorem 3.1 implies that the binary array yields information on its underlying generative at a rate of at least . While the necessary optimizations in (1) and (3.1) are not currently known to admit efficient exact algorithms, they strongly resemble existing objective functions for community detection for which many authors have reported good heuristics [Newman (2006), Fortunato and Barthélemy (2007), Zhao, Levina and Zhu (2011)]. Furthermore, polynomial-time spectral algorithms are known in certain settings to find correct labelings under the assumption of a generative blockmodel [Rohe, Chatterjee and Yu (2011), Fishkind et al. (2013)], suggesting that efficient algorithms may exist when distinct clusterings or community divisions are present in the data. In this vein, Chatterjee (2012) has recently proposed a universal thresholding procedure based on the singular value decomposition.

We may replace the objective function of (3.1)with the full profile likelihood function . The same rate of convergence can then be established with respect to the corresponding term for , adapting the proofs in Appendices A and B.

Assume exists. Terms and in (3) show that elements of and must not approach or too quickly as ; otherwise can be much smaller than .

This is a natural consequence of the fact that the Kullback–Leibler divergence of from is finite if and only if is absolutely continuous with respect to . To see the implication, consider , and generated according to Definition 2.1 with . Let and . Then the maximum-likelihood two-class blockmodel fit to will yield , and so diverges to unless .

## 4 Convergence of co-cluster estimates

We now give our main technical result and show its statistical application in enabling us to interpret the convergence of co-cluster estimates. The estimators of Theorem 3.1 require optimizations over the set of all possible co-clusterings of the data; that is, over vectors and that map the observed vertices to . Analogously, one may also envision an uncountable set of co-clusterings of the generative model, which map the unit interval to . We define these two sets of co-clusterings more formally and then give a result showing in what sense they become close with increasing and , so that optimizing over co-clusters of the data is asymptotically equivalent to optimizing over co-clusters of the generative model. This result yields the rate of convergence appearing in Theorem 3.1, and also has a geometric interpretation that sheds light on the estimators defined by (1) and (3.1).

### 4.1 Relating co-clusterings of to those of

Given a bipartite graph with adjacency matrix , recall that the latent class vectors and respectively partition and into subsets each. To relate an empirical co-clustering of to a piecewise-constant approximation of some , we first define the matrix to index the proportion of edges spanning each of the subset pairs defined by and ,

Second, we define mappings , which will play a role analogous to and . Given some , this allows us to define a matrix which encodes the mass of assigned to each of the subset pairs defined by and as follows:

We will use the matrices and to index all possible co-clusterings that can be induced by partitioning an observed binary array into blocks. To link these sets of co-clusters, recall from Section 3 the sets and of all node-to-class assignment functions that partition and into classes in manners that respect the proportions dictated by and . Analogously, we define (resp., ) to be the set of partitions of into subsets whose cardinalities are of proportions :

We are now equipped to introduce sets and , which describe all possible co-clusterings that can be induced from and with respect to , and to define the related notion of a support function.

[(Sets and of admissible co-clusterings)] For fixed discrete probability distributions and over , we define the sets of all co-clustering matrices and , induced respectively by and , as follows:

[(Support functions of and )] Let be nonempty and with . Its support function is defined as for any , whence {subequations}

(4) | |||||

(5) |

We will show below that converges in probability to zero at a rate of at least , and this result in turn gives rise to Theorem 3.1. To see why, observe that for any , the least squares objective function of (1) can be expressed using as follows:

As we prove in Appendix A, this line of argument establishes the following.

### 4.2 A general result on consistency of co-clustering

From Lemma 4.1 we see that closeness of to implies closeness (up to constant terms) of the least squares objective function of (1) to the risk , and of the profile likelihood of (3.1) to the average Kullback–Leibler divergence of from the generative . Equipped with this motivation, we now state our main technical result, which serves to establish the rate of convergence in Theorem 3.1. Its proof follows in Section 5 below.

###### Theorem 4.1

Let be a separately exchangeable array generated by some in accordance with Definition 2.1. Then for each and each ratio , there exists a universal constant such that as ,

The support functions and also have a geometric interpretation: for any fixed , they define the supporting hyperplanes of the sets and in the direction specified by . Each supporting hyperplane is induced by a point in , or in the closure of respectively; these points are extremal in that they cannot be written as a convex combination of any other points in their respective sets. Evidently, it is only the extreme points which determine convergence properties for the risk functionals considered here. Equivalently, for any fixed parameter triple , the values of these functionals depend only on the maximizing choices of or .

Formally, Theorem 4.1 has the following geometric interpretation:

###### Corollary 4.1

The result of Theorem 4.1 is equivalent to the following: The Hausdorff distance between the convex hulls of and is .

Consider , and denote by the Frobenius norm (i.e., the Hilbert–Schmidt metric on induced by ). The Hausdorff distance between and , based on the metric , is then

This measures the maximal shortest distance between any two elements of and . If these subsets of are furthermore nonempty and bounded, then the Hausdorff distance between their convex hulls and can be expressed in terms of their support functions ,

see, for example, Schneider (1993), as applied to the convex hulls of the closures of and of . In this way, is a natural measure of distance between two convex bodies. Recalling the equivalence of norms on , we see that

Since Theorem 4.1 holds for , the leftmost inequality implies that it also holds for . Now suppose instead that Theorem 4.1 holds for ; by the rightmost inequality, it then also holds for . Thus the result of Theorem 4.1 is equivalent to the statement that

This geometric interpretation is helpful in relating our work to a series of papers by Borgs et al. (2006, 2008, 2012), which explore dense graph limits in depth and statistical applications thereof. Very broadly speaking, Borgs et al. (2008), Theorem 2.9 and Borgs et al. (2012), Theorem 4.6, analyze sets termed quotients, which resemble and . The authors show convergence of these sets in the Hausdorff metric at rate , based on a distance termed the cut metric, and detail implications that can also be related to those of Bickel, Chen and Levina (2011).

In fixing and through our -estimators, we are studying what Borgs et al. term the microcanonical quotients. Because our results require only convergence of the closed convex hulls of and , we are able to obtain an exponentially faster bound on the rate of convergence.

### 4.3 Interpreting convergence of blockmodel estimates

Recall that the -estimators of Theorem 3.1 each involve an optimization over the set by way of its support function, which in turn represents its convex hull. Suppose that optimizes either objective function in Theorem 3.1. Then the following corollary of Theorem 3.1 shows that is interpretable, in that there will exist a partition of yielding co-clusters of equal size and asymptotically equivalent connectivity.

###### Corollary 4.2

We show the latter result; parallel arguments yield the former. Since for the co-blockmodel, by letting and satisfy and we may express as

Thus, for any , there exists some choice of such that

If we now take for , we see by a similar argument that since , we have in turn that

Expanding in accordance with its definition, we then see that

Choosing and applying Theorem 3.1 completes the proof.

Corollary 4.2 ensures that co-blockmodel fits remain interpretable, even in the setting of model misspecification. It establishes that the identification of co-clusters in an observed exchangeable binary array indicates with high probability the existence of co-clusters of equal size and asymptotically equivalent connectivity in the underlying generative process .

## 5 Proof of Theorem 4.1

Our proof strategy is inspired by Borgs et al. (2008) and adapts certain of its tools, but also requires new techniques in order to attain polynomial rates of convergence. Most significantly, we do not use the Szemerédi regularity lemma, which typically features strongly in the graph-theoretic literature, and provides a means of partitioning any large dense graph into a small number of regular clusters. Results in this direction are possible, but instead we use a Rademacher complexity bound for -statistics adapted from Clémençon, Lugosi and Vayatis (2008), allowing us to achieve the improved rates of convergence described above.

### 5.1 Establishing pointwise convergence

The main step in proving Theorem 4.1 is to establish pointwise convergence of to for any fixed . We do this through Proposition 5.1 below, after which we may apply it to a union bound over a covering of all to deduce the result of Theorem 4.1. Appendix B provides a formal statement and proof of this argument, along with proofs of all supporting lemmas.

###### Proposition 5.1 ([Pointwise convergence of to ])

Assume the setting of Theorem 4.1, fixing . Then there exist constants such that, given any , , , , and generated from , it holds for all that

To obtain the claimed result, we must establish lower and upper bounds on the support function that show its convergence to at rate . Recalling the definitions of and in (4.1), we first require a statement of Lipschitz conditions on and . Its proof follows by direct inspection.

###### Lemma 5.1

Define for measurable mappings over the metric

and analogously the standard Hamming distance for sequences, with respect to normalized counting measure. Then for any and , with as defined in Section 4.1, we have that: {longlist}[(1)]

;

;

if differ by a single entry.

In conjunction with McDiarmid’s inequality, these Lipschitz conditions yield the following lower bound on