A Remaining proofs

# Optimal Bipartite Network Clustering

## Abstract

We consider the problem of bipartite community detection in networks, or more generally the network biclustering problem. We present a fast two-stage procedure based on spectral initialization followed by the application of a pseudo-likelihood classifier twice. Under mild regularity conditions, we establish the weak consistency of the procedure (i.e., the convergence of the misclassification rate to zero) under a general bipartite stochastic block model. We show that the procedure is optimal in the sense that it achieves the optimal convergence rate that is achievable by a biclustering oracle, adaptively over the whole class, up to constants. The optimal rate we obtain sharpens some of the existing results and generalizes others to a wide regime of average degree growth. As a special case, we recover the known exact recovery threshold in the regime of sparsity. To obtain the general consistency result, as part of the provable version of the algorithm, we introduce a sub-block partitioning scheme that is also computationally attractive, allowing for distributed implementation of the algorithm without sacrificing optimality. The provable version of the algorithm is derived from a general blueprint for pseudo-likelihood biclustering algorithms that employ simple EM type updates. We show the effectiveness of this general class by numerical simulations.

Keywords: Bipartite networks; stochastic block model; community detection; biclustering; network analysis; pseudo-likelihood, spectral clustering.

refs \renewbibmacroin: \DeclareFieldFormat[article,inbook,incollection,inproceedings,patent,thesis,unpublished]citetitle#1 \DeclareFieldFormat[article,inbook,incollection,inproceedings,patent,thesis,unpublished]title#1

## 1 Introduction

Network analysis has become an active area of research over the past few years, with applications and contributions from many disciplines including statistics, computer science, physics, biology and social sciences. A fundamental problem in network analysis is detecting and identifying communities, also known as clusters, to help better understand the underlying structure of the network. The problem has seen rapid advances in recent years with numerous breakthroughs in modeling, theoretical understanding, and practical applications [fortunato2016community]. In particular, there has been much excitement and progress in understanding and analyzing the stochastic block model (SBM) and its variants. We refer to [abbe2017community] for a recent survey of the field. Much of this effort, especially on the theoretical side has been focused on the univariate (or symmetric) case, while the bipartite counterpart, despite numerous practical applications, has received comparatively less attention. Of course, there has been lots of activity in terms of modeling and algorithm development for bipartite clustering both in the context of networks [Zhou2007, Larremore2014, Wyse2014, Rohe2015, Razaee2017] as well as other domains, such as topic modeling and text mining [Dhillon2001, Dhillon2003], as well as in biological applications [cheng2000biclustering, Madeira2010]. But much of this work either lacks theoretical investigations or has not considered the issue of statistical optimality.

In this paper, we provide a unified treatment of the community detection, or clustering, in the bipartite setting with a focus on deriving fundamental theoretical limits of the problem. The main goal is to propose computationally feasible algorithms for bipartite network clustering that exhibit provable statistical optimality. We will focus on the bipartite version of the SBM which is a natural model for bipartite networks with clusters. SBM is a stochastic network model where the probability of edge formation depends on the latent (unobserved) community assignment of the nodes, often referred to as node labels. The goal of the community detection problem is to recover these labels given an instance of the network. This is a non-trivial task since, for example, maximum likelihood estimation involves a search over exponentially many labels.

Community detection in bipartite SBM is closely related to the biclustering problem, for which many algorithms have been developed over the years [hartigan1972direct, cheng2000biclustering, tanay2002discovering, gao2016optimal]. On the other hand, in recent years, many algorithms have been proposed for clustering in univariate SBMs, including global approaches such as spectral clustering [rohe2011spectral, krzakala2013spectral, lei2013consistency, fishkind2013consistent, vu2014simple, massoulie2014community, yun2014accurate, bordenave2015non, gulikers2017spectral, pensky2017spectral] and convex relaxations via semidefinite programs (SDPs) [amini2014semidefinite, hajek2016achieving, bandeira2015random, guedon2016community, montanari2016semidefinite, ricci2016performance, agarwal2017multisection, perry2017semidefinite], as well as local methods such as belief propagation [decelle2011asymptotic], Bayesian MCMC [nowicki2001estimation] and variational Bayes [celisse2012consistency, bickel2013asymptotic], greedy profile likelihood [bickel2009nonparametric, zhao2012consistency] and pseudo-likelihood [amini2013pseudo] maximization, among others. A limitation of spectral clustering approaches is that they are often not optimal on their own, and the SDPs have the drawback of not being able to fit the full generality of SBMs. Various algorithms can further improve the clustering accuracy, and adapt to the generality of SBM. Profile likelihood maximization was proposed and analyzed in [bickel2009nonparametric], but the underlying optimization problem is computationally infeasible and the approach only applicable to networks of limited size. Pseudo-likelihood ideas were used in [amini2013pseudo] to derive EM type updates to maximize a surrogate to the likelihood of the SBM based on a block compression which is computed using initial labels obtained by spectral clustering.

The pseudo-likelihood approach belongs to the general class of algorithms based on “Good Initialization followed by Fast Local Updates” (GI-FLU) which has been a staple of recent developments in devising optimal clustering algorithms, as was pointed out by [gao2017achieving]. The GI-FLU strategies often use a spectral initialization and due to their often cheap local updates are scalable to very large networks. In cases where they are accompanied by optimality guarantees they seem to occupy the ideal sweet spot in the computational versus statistical trade-off. We build on these ideas and especially the approach in [amini2013pseudo] to extend the algorithms to the bipartite settings. Moreover, we provide modifications to the general blueprint suggested by [amini2013pseudo]—in addition to the natural modification required for the bipartite setup—which allows us to demonstrate optimality of the procedure under the full generality of the bipartite SBM.

In the univariate setting, there has been interesting recent advancements in understanding optimal recovery in what we refer to as the semi-sparse regime, where the (expected) average network degree is allowed to grow to infinity but rather slowly, as the number of nodes increases to infinity. In a series of papers [mossel2015consistency, abbe2016exact, hajek2016achieving, hajek2016achievingExt] the thresholds for optimal exact recovery, also known as strong consistency, were established in the context of simple planted partition models, and SDPs where shown to achieve the optimal threshold. In [abbe2015community], the problem of strong consistency was considered for a general SBM and the optimal threshold for strong consistency was established. In subsequent work [zhang2016minimax, gao2017achieving, gao2016community], the results were extended to include weak consistency, i.e., requiring the fraction of misclassified nodes to go to zero, rather than drop to exactly zero (as in strong consistency), and rates of optimal convergence where established (with some slack in the exponent). To achieve the more relaxed consistency results, [gao2017achieving] limited the model to what we refer to as strongly assortative SBM (see [sdp:sbm:aos] for a definition).

Our work from a theoretical perspective is mostly inspired by the insightful work of [abbe2015community, gao2017achieving]. We extend these ideas by presenting results that are sharper and more general that what has been obtained so far. In short, we only assume that the clusters are distinguishable (in the sense of Chernoff divergence) and the network is not very dense. Relative to [gao2017achieving], our results are shaper, removing the term in the exponents of the rates, and hold for the full generality of the SBM (i.e., no assortativity assumption needed). Compared to [abbe2015community], our work greatly relaxes the assumptions on degree growth. It is well-known that for strong consistency one needs at least , i.e., average degree should grow at least logarithmically, and this is the regime considered in [abbe2015community]. In our work, could grow to infinity arbitrarily slowly or at the other extreme as fast as . Thus, our results establish optimal rates of weak consistency below the sequare-root regime and above the sparse regime , and in particular, between the and regimes, where weak consistency is possible but not the strong consistency. In addition, in contrast to [abbe2015community] (and some results in [gao2017achieving]), we allow all the parameters of the model, including the number of communities, the spread of mean parameters () and the community balance parameter () to grow subject to two compact conditions (namely ((A3)) and ((A4))); these conditions encapsulate in a simple way how much cumulative growth these auxiliary parameters can exhibit relative to the information growth of the model () for the optimal rate to still be achieved by the algorithm we present. We make more detailed comparisons with the work of [abbe2015community, gao2017achieving] in Section 3.

Contributions. Establishing these results require a fair amount of technical and algorithmic novelty over the previous work. Here, we highlight some of these features and point out to the relevant parts of the paper for details:

[1., wide, labelwidth=!, itemsep=.5pt, topsep=2pt]
1. We introduce an efficient sub-block (or sub-graph) partitioning scheme for the provable version of the algorithm, Algorithm 3, which for example replaces the edge splitting idea of [abbe2015community] (cf. Section 3.1 and Remark 6.1 for the shortcomings of edge splitting in our setting). The subblock partitioning is mainly introduced to generate enough independence for the technical arguments to go through. However, the idea turns out to be computationally appealing as well. The computational bottleneck of GI-FLU approaches discussed earlier is often the spectral initialization. The subsequent (often likelihood based) local updates are usually quite fast, , computations. Our block partitioning scheme allows one to break the costly initial step into the application of spectral methods—as well as likelihood ratio classifiers—on smaller subblocks, without losing optimality. If done in parallel, spectral clustering on subblocks will be in fact computationally cheaper than performing a spectral decomposition of the entire matrix. Although, from a theoretical perspective breaking into blocks is enough (in the bipartite settings), our results allow for the number of blocks to even grow slowly to infinity. Thus, the provable version of our algorithm has computational appeal, esp. in distributed settings and for very large networks where it is prohibitive to perform eigendecomposition of the entire adjacency matrix. The algorithm is naturally parallelizable since it proceeds in stages and in each stage the operations on the underlying subblocks can be performed in parallel; see Section 5 for details.

2. Our algorithms being an extension of [amini2013pseudo], are modifications of a natural EM algorithm on mixtures of Poisson vectors, hence very familiar from a statistical perspective. In other words, they are not tailor-made to the community detection problem and are derived from fairly general well-known principals. Although other (optimal) algorithms in the literature are more or less preforming similar operations, the link to EM algorithms and mixture modeling is quite clear in our work. We provide in Section 4.2 the general blueprint of the algorithms based on the pseudo-likelihood idea and block compression (Algorithm 1). We then show how a provable version can be constructed by combining with the sub-block partitioning ideas in Section 5. It is worth noting that although we can provide no guarantees for the general algorithms of Section 4.2, empirically they perform very well, as illustrated in Section 8.

3. In order to get the sharper rate, analyzing a single step of an EM type algorithm is not enough, and thus we analyze the second step as well. We will show that the first step gets us from a good (but crude) initial rate to the fast rate which is in the vicinity of the optimal rate, and then repeating the iteration once more, with the more accurate labels obtained in the first step—hence more accurate parameter estimates—gets us to the optimal rate .

4. Among the technical contributions, are a uniform consistency result (Lemma 6.1) for the likelihood ratio classifier (LRC) over a subset of the parameters close to the truth, sharp approximations for the Poisson-binomial distributions (Section 9.4), and extension (and elucidation) of a novel technique of [abbe2015community] in deriving error exponents to general exponential families (cf. Section 9.3). The uniform consistency result for LRCs lets us tolerate some degree of dependence among the statistics from iteration to iteration, allowing the sub-block partioning idea to go through. That is, we can run LRCs on the same blocks used in estimating their parameters; see Sections 5 and 6.1 for details.

5. The bipartite clustering setup (as opposed to the symmetric case) allows us to introduce an oracle version of the problem which helps in understanding the nature of the optimal rates observed in community detection and their relation to classical hypothesis testing and mixuture modeling. That is, we try to answer the curious question of why or how the Chernoff exponent of a (simple binary) hypothesis testing problem seems to control the misclassification rate in community detection and network clustering. The oracle also provides a lower bound on the performance of any algorithm. See Section 3 and Proposition 3 for details.

The rest of the paper is organized as follows. We introduce the model and the biclustering oracle in Section 2, and then present our main results in Section 3. The general algorithms based on the pseduo-likelihood idea are presented in Section 4 and a provable version in Section 5. The proofs of the consistency results will appear in Sections 6 and 7. In Section 8, we demonstrate the numerical performance of the methods.

## 2 Network biclustering

We start by introducing the network biclustering problem based a stochastic block modeling, and set up some notation that will be used throughout the paper. We then discuss how a biclustering oracle with side information can optimally recover the labels. These ideas will be the basis of the algorithms discussed in this paper.

### 2.1 Bipartite block model

We will be working with a bipartite network which can be represented by a biadjacency matrix , where for simplicity we assume that the nodes on the two sides are indexed by the sets and . We assume that there are and communities for the two sides respectively, and the membership of the nodes to these communities are given by two vectors and . Thus, if node on side 1 belongs to community . We call and the labels of nodes and respectively. We often treat these labels as binary vectors as well, using the identification via the one-hot encoding, that is .

Given the labels and , and a connectivity matrix (also known as the edge probability matrix), the general bipartite stochastic block model (biSBM) assumes that: are Bernoulli variables, independent over with mean parameters,

 \ex[Aij]=yTiPzj=Pkℓ,ifyi=k,zj=ℓ. (1)

We denote this model compactly as . It is helpful to consider the Poisson version of the model as well which is denoted as . This is the same model as the Bernoulli SBM, with the exception that each entry is drawn (independently) from a Poisson variate with mean given in (1). These two models behave very closely when the entries of are small enough. Throughout, we treat , and as unknown deterministic parameters. The goal of network biclustering is to recover these three sets of parameters given an instance of .

In fact, as we will see, the parameters themselves are not that important. What matters is the set of (Poisson) mean parameters which are derived from and the sizes of the communities. In order to define these parameters, let , be the number of nodes in each of the communities of side 2. That is, . A similar notation, namely , denotes the community sizes of side 2. The row mean parameters are defined as

 Λ=(λkℓ)=(Pkℓnℓ(z))=P\diag(n(z))∈\realsK×L (2)

where for a vector is a diagonal matrix with diagonal entries . The column mean parameters can be defined in a similar fashion, namely,

 ΓT=(nk(y)Pkℓ)=\diag(n(y))P∈\realsK×L. (3)

Note the transpose in the above definition, i.e., , and this convention allows us to define information measures based on rows of matrices and in a similar fashion, as will be discussed in Section 3.

### 2.2 Biclustering oracle with side information

The key idea behind the algorithms discussed in this paper, as well as our consistency arguments is the following simple observation: Assume that we have prior knowledge of and the column labels , but not the row labels . For each row, we can sum the columns of according to their column memberships, i.e., we can perform the (ideal) block compression . The vector contains the same information for recovering the community of , as the original matrix —i.e., it is a sufficient statistic. Assume that we are under the model. Then, has the distribution of a vector of independent Poisson variables. More precisely,

 b∗i∗∼Qk:=L∏ℓ=1Poi(λkℓ),if,yi=k, (4)

where are the row mean parameters defined in (2). Note that the distributions are known under our simplifying assumptions. The problem of determining the row labels thus reduces to deciding from which of these known distributions it comes from. Whether node belongs to a particular community can be decided optimally by performing a likelihood ratio (LR) test of against each of .

The above LR test is the heart of the algorithms discussed in Sections 4 and 5. The difficulty of the biclustering problem (relative to a simple mixture modeling) is that in practice, we do not know in advance either or —hence neither the exact test statistics nor the distributions are known. We thus proceed by a natural iterative procedure: Based on the initial estimates of and , we obtain estimates of and , perform the approximate LR test to obtain better estimates of , and then repeat the procedure over the columns to obtain better estimates of . These new label estimates lead to better estimates of and , and we can repeat the process.

We refer to the algorithm that has access to the true column labels and parameters , and performs the optimal LR tests, as the oracle classifier. Note that the performance of this oracle gives a lower bound on the performance of any biclustering algorithm in our model. The performance of the oracle in turn is controlled by the error exponent of the simple hypothesis testing problems versus , as detailed in Proposition 3. This line of reasoning reveals the origin of the information quantities and —defined in  (8) and (9)—that control the optimal rate of the biclustering problem. Note that the bipartite setup has the advantage of disentangling the row and column labels, so that a non-trivial oracle exists. It does not make much sense to assume known column labels in the unipartite SBM, since by symmetry we then know the row labels as well, hence nothing left to estimate. On the other hand, due to the close relation between the bipartite and unipartite problems, the above argument also sheds light on why the error exponent of a hypothesis test is the key factor controlling optimal misclassification rates of community detection in unipartite SBM.

### 2.3 Notation on misclassification rates

Let the set of permutations on . The (average) misclassification rate between two sets of (column) labels and is given by

 Mis(ˆy,y):=minσ∈Πn1nn∑i=11{σ(ˆyi)≠yi}. (5)

Letting be a minimizer in (5), the misclassification rate over cluster is

 Misk(ˆy,y):=1nk(y)∑i:yi=k1{σ∗(ˆyi)≠yi}=|i:σ∗(ˆyi)≠k,yi=k|nk(y), (6)

using the cardinality notation to be discussed shortly. Note that (6) is not symmetric in its arguments. We will also use the notation to denote an optimal permutation in (5). When is sufficiently small, this optimal permutation will be unique. It is also useful to define the direct misclassification rate between and , denoted as , which is obtained by setting the permutation in (5) to the identity. With , we have . We note that

 Mis(ˆy,y)=∑k∈Kπk(y)Misk(ˆy,y)≤maxk∈KMisk(ˆy,y), (7)

as well as . We can similarly define the misclassification rate of an estimate relative to . Our goal is to derive efficient algorithms to obtain and that have minimal misclassification rates asymptotically (as the number of nodes grow).

Other notation. We write w.h.p. as an abbreviation for “with high probability”, meaning that the event holds with probability . To avoid ambiguity, we assume all parameters, including , are functions of . For example, denotes . We write to denote a cyclic group of order . Our convention regarding solutions of optimization problems, whenever more than one exist is to choose one uniformly at random. We use the shorthand notation for cardinality of sets, where is implicit, assuming the is a vector of length . For example, if , we have the identity . It is worth nothing that we use community and cluster interchangeably in this paper, although some authors prefer to use community for the assorative clusters, and use “cluster” to refer to any general group of nodes. We will not follow this convention and no assortativity will be implicitly assumed.

## 3 Main results

Let us start with some assumptions on the mean parameters. Recall the row and column mean parameter matrices and defined in (2) and (3). Let and be the minimum and maximum value of the entries of , respectively, and similarly for . We assume

 \infnormΛΛmin∨\infnormΓΓmin≤ω, (A1)

for some . That is, measures the deviation of the entries of the mean matrices from uniform. We assume that the sizes of the clusters are bounded as

 1βK≤πk(y)≤βKand1βL≤πℓ(z)≤βL (A2)

for all and . The following key quantity controls the misclassification rate:

 Ikr:=Ikr(Λ):=sups∈(0,1)L∑ℓ=1(1−s)λkℓ+sλrℓ−λ1−skℓλsrℓ, (8)

for . We can think of , as an operator acting on pairs of rows of a matrix , say and , producing a pairwise information matrix. We often refer to the function of being maximized in (8) as , with some abuse of notation assuming and are fixed, and we note that this function is strictly concave over whenever , and we have .

Recalling the product Poisson distributions , (8) is the Chernoff exponent in testing the two hypothesis and  [chernoff1952measure]. The difference with the classical setting, in which the Chernoff exponent appears, is that we work in the regime where we are effectively testing based on a sample of size of 1 and instead of the sample size, we let . We define the column information matrix similarly

 Icolℓℓ′:=Iℓℓ′(Γ)=sups∈(0,1)K∑k=1(1−s)Γℓk+sΓℓ′k−Γ1−sℓkΓsℓ′k, (9)

for all . Another set of key quantities in our analysis are:

 εkr:=maxℓ∈[L](λkℓλrℓ∨λrℓλkℓ)−1,εk:=minr∈[K]εkr, and ε:=mink∈[K]εk. (10)

The relation with hypothesis testing is formalized in the following proposition: {prop} Consider the likelihood ratio (LR) testing of the null hypothesis against , based on a sample of size . Let . Assume that as , (a) , and (b) . Then, there exist constants and such that

 Missing or unrecognized delimiter for \big (11)

See Corollary 9.3.2 and Appendix Section A.6 for the proof. Any hypothesis testing procedure can be turned into a classifier, and a bound on the error of the hypothesis test (for a sample of size 1) translates into a bound on the misclassification rate for the associated classifier. This might not be immediately obvious, and we provide a formal statement in Lemma 6.1. Proposition 3 thus provides a precise bound on the misclassification rate of the LR classifier for deciding between and .

The significance of the Chernoff exponent of the hypothesis test in controlling the rates is thus natural, given the full information about the and the test statistics. What is somewhat surprising is that almost the same bound holds when no such information is available a priori. Our main result below is a formalization of this claim. In our assumptions, we include a parameter that controls the number of sub-blocks when partitioning, the details of which are discussed in Section 5. Under the following two assumptions:

 (Q2logQ)β2ω3KL(K∨L)log(K∨L)(∥Λ∥∞∨∥Γ∥∞)2 =O(n∧m), and (A3) (QlogQ)2β3ω2(K∨L)3(α∨α−1)(∥Λ∥∞∨∥Γ∥∞) =o((Imin∧Icolmin)2), (A4)

where , there is an algorithm that achieves almost the same rate as the oracle: {thm}[Main result] Consider a bipartite SBM (Section 2.1) satisfying ((A1))–((A4)). Then, as and , the row labels output by Algorithm 3 in Section 5 satisfies for some ,

 Misk(ˆy,y)=O(ω∑r≠k(1+1εkr)exp(−Ikr−(12−ζ)logΛmin)) (12)

for every , with high probability. Similar bounds holds for w.r.t. .

One can replace the big  with the small  in (12) to obtain an equivalent result (due to the presence of ). Let us discuss the assumptions of Theorem 3. The only real assumptions are ((A3)) and ((A4)). The other two, namely ((A1)) and ((A2)) can be more or less thought of as definitions of and . For example, ((A2)) only imposes the mild constraint that no cluster is empty. Similarly ((A1)) imposes the mild assumption that no entry of or is zero. The main constraints on and are encoded in ((A3)) and ((A4)) in tandem with other parameters of the model. {rem} In the first reading, one can take , and . In this setting, ((A3)) is a very mild sparsity condition, implying that the degrees should not grow faster than . ((A4)) guarantees that the information quantities grow fast enough so that the clusters are distinguishable. We only need ((A4)) for Algorithm 3 which uses a spectral initialization. In Section 5.2.1, we present Theorem 5.2.1, for the likelihood-based portion of the algorithm, assuming that a good initialization is provided. Theorem 5.2.1 only requires a weakened version, ((A4${}^{\prime}$)), of assumption ((A4)).

Depending on the behavior of , the rate obtained in Theorem 3 can exhibit different regimes which are summarized in Corollary 3 below. Consider the additional assumption:

 Missing or unrecognized delimiter for \Big (A5)
{cor}

Under the same assumptions as Theorem 3, w.h.p., for all ,

 Misk(ˆy,y)=o(∑r≠kexp(−Ikr)). (13)

If in addition we assume ((A5)), then for some , w.h.p., for all ,

 Misk(ˆy,y)=O(∑r≠kexp(−Ikr−(12−ζ)logΛmin)). (14)
{rem}

Consider the oracle version of the biclustering problem where the connectivity matrix and the true column labels are given. Then, the optimal row clustering reduces to the likelihood ratio tests in Proposition 3. That is, given the row sums within blocks as sufficient statistics, we compare the likelihoods at two different mean parameters. By Proposition 3, the optimal misclassification rate for the oracle problem is

 Missing or unrecognized delimiter for \big (15)

where the sum over is due to the need to compare against all other clusters. The gap between and is not avoidable when stating high probability results, due to the Markov inequality; see Lemma 6.1 for the details. This error rate coincides with (14), which merely loses a constant due to the unknown mean parameters and column labels. The rate is sharp up to a factor of according to the lower bound in Proposition 3.

In order to understand the rates in Corollary 3 better, let us consider some examples which also clarify our results relative to the previous literature.

{exa}

Consider a simple planted partition model where

 n=m,K=L,Pkk=an,Pkℓ=bn,∀k≠ℓ.

Then, and when . Applying (8) with ,

 Ikr≥12∑ℓ(√λkℓ−√λrℓ)2≥(√a−√b)2βK.

Assume that ((A3)) and ((A4)) hold, that is (using )

 β4ω3(KlogK)a2=O(n∧m)andβ6ω2K4a=o((√a−√b)4).

and further assume that . Then w.h.p., we have

 Misk(ˆy,y)=o(exp(−(√a−√b)2βK)). (16)

For the details of , see Section 7.4. In particular, if

 liminfn→∞(√a−√b)2βKlogn≥1, (17)

we have w.h.p., that is, we have the exact recovery of the labels by Algorithm 3. (Whenever misclassification rate drops below , it should be exactly zero.) Note that this result holds without any assumption of assortativity, i.e., it holds whether or .

{exa}

Suppose that where is a symmetric constant matrix, , , and . Then and are constants. Then,

 λkℓ=˜λkℓlogn,% where˜λkℓ:=˜Pkℓπk(y),andIkr=~Ikrlogn

where is defined based on and as in (8). Assuming in addition that is constant, both and are constants. In this regime, our key assumptions ((A3)) and ((A4)) are satisfied. By Corollary 3, w.h.p., we have

 Misk(ˆy,y)=o(exp(−minr≠k~Ikrlogn))=o(n−minr≠k~Ikr). (18)

As a consequence if , then w.h.p., that is we have exact recovery by Algorithm 3.

### 3.1 Comparison with existing results

Let us now compare with [gao2017achieving] and [abbe2015community] whose results are closest to our work. Both papers consider the symmetric (unipartite) SBM, but the results can be argued to hold in the bipartite setting as well. The setup of Example 3 is more or less what is considered in [gao2017achieving]. They have shown that there is an algorithm with misclassification error bounded by

 Missing or unrecognized delimiter for \Big (19)

We have sharpened this rate to (16) under assumption ((A3)) (i.e., assuming the average degree grows slower than ). Bound (19) implies that when

 liminfn→∞(√a−√b)2βKlogn>1,

one has exact recovery. Our bound on the other hand, imposes the relaxed condition (17).

We note that the results in [gao2017achieving] are derived for a more general class of (assortative) models than that of Example 3, namely, the class with connectivity matrix satisfying and for . The rate obtained in [gao2017achieving] uniformly over this class is dominated by that of the hardest within this class which is the model of Example 3. For other members of this class, neither their rate (19) or the one we gave in (16) is optimal. The optimal rate in those cases is given by the general form of Theorem 3 and is controlled by the general form of in (8). In other words, Algorithm 3 that we present is rate adaptive over the class considered in [gao2017achieving], achieving the optimal rate simultaneously for each member of the class.

A key in our approach is to apply the likelihood-type algorithm twice, in contrast to the single application in [gao2017achieving]. After the second stage we obtain much better estimates of the labels and parameters relative to the initial values, allowing us to establish the sharper forms of the bounds. Another key is the result in Lemma 2 which provides a better error rate than the classical Chernoff bound, using a very innovative technique introduced in [abbe2015community]. Moreover, we keep track of the balance parameter in ((A2)) throughout, allowing it to go to infinity slowly. Last but not least, assortativity is a key assumption in [gao2017achieving], while our result does not rely on it. Besides consistency, our provable algorithm is more computationally efficient in a practical sense. To obtain initial labels, we will apply spectral clustering on very few subgraphs (8 to be exact). However, the provable version of the algorithm in [gao2017achieving] applies spectral clustering for each single node on the rest of the graph excluding that node. If the cost of running the spectral clustering on a network of nodes is , then our approach costs while that of [gao2017achieving] costs roughly . Our algorithm thus has a significant advantage in computational complexity when . To be fair, the algorithm introduced in [gao2017achieving] was for the symmetric SBM, which has the extra complication of dependency in due to symmetry. Our comparison here is mostly with Corollary 3.1 in [gao2017achieving]. In addition, [gao2017achieving] have a result (their Theorem 5) for when grows arbitrarily fast which is not covered by our result. See the discussion below for comments on the symmetric case and dependence on .

The problem of exactly recovery for a general SBM has been considered in [abbe2015community], again for the case of a symmetric SBM, though the results are applicable to the bipartite setting (with ) as well. The model and scaling considered in [abbe2015community] is the same as that of Example 3, and they show that exact recovery of all labels is possible if (and only if) which is the same result we obtain in Example 3 for Algorithm 3. Thus, our result contains that of [abbe2015community] as a special case, namely in the -degree regime with other parameters (such as and the normalized connectivity matrix) kept constant. The results and algorithms of [abbe2015community] do not apply to the general model in our paper; consider the following two points:

[1., wide, labelwidth=!, itemsep=.5pt, topsep=2pt]
1. Only the regime , i.e., the degree grows as fast as , is investigated in [abbe2015community], while we allow the degree to grow in the range from “arbitrarily slowly” up to “as fast as ”.

2. One needs independent versions of the adjacency matrix in different stages of the algorithm. To achieve this goal, edge splitting was introduced in [abbe2015community]. The idea is that one can regard the two (or more) graphs obtained from edge splitting to be nearly independent. To be specific, let be the joint probability measure corresponding to a pair of graphs and generated independently with connectivity matrices and . Let be the joint probability measure on and obtained by edge splitting from a single SBM with connectivity matrix , assigning every edge independently to either or with probabilities and . Then, and have the same marginal distributions. Having a vanishing total variation between and is necessary for further analysis which, as was pointed out by [abbe2015community, pp. 46-47], is equivalent to showing that under , and do no share any edge, with high probability. Letting ,

 P1(G1 and G2 do not share edges)≤(1−(1−q)q˜P2min(logn)2n2)n2

which is strictly bounded away from unless , that is, the connectivity matrix of either or should vanish faster than . Our consistency result will not hold in this regime. Thus, edge splitting cannot be used to derive the results in this paper, and we introduce the block partitioning idea to supply us with the independent copies necessary for analysis. Another technical issue about edge splitting is discussed in Remark 6.1.

### 3.2 Discussion

Our results do not directly apply to the symmetric case, due to the dependence between the upper and lower triangular parts of the adjacency matrix . However, a more sophisticated two-stage block partitioning scheme can be used to derive similar bounds under mild extra assumptions. One starts with an asymmetric partition into blocks of sizes , for very slowly. In the first stage, one applies a similar procedure as described in Algorithm 3 on the upper triangular portion of the large subblock , followed by the application of the LR classifier on the fat block to obtain very accurate row labels of the small block .. One then repeats the process using the “leave-one-out” of [gao2017achieving], but applied to small blocks rather than individual nodes. We leave the details for a future work.

It was also shown by [gao2017achieving, Theorem 5] that their equivalent of condition ((A1)) can be removed by modifying the algorithm. In their setting, without assuming , a misclassification rate of is achievable, where is a variable in the new version of their algorithm. If those arguments can be extended to the general block model, it will be possible to relax the requirements on in ((A3)) and ((A4)). When , one can completely remove sparsity condition ((A3)) using a much sharper Poisson-binomial approximation than what we have used in this paper. Finally, we suspect that our result could be generalized beyond SBMs to biclustering arrays where the row and column sums over clusters follow Poissonian central limit theorems. We will explore these ideas in the future.

## 4 Pseudo-likelihood approach

In this section, after introducing the local and global mean parameters which will be used throughout the paper, we present our general pseudo-likelihood approach to biclustering.

### 4.1 Local and global mean parameters

Let us define the following operator that takes an adjacency matrix and row and column labels and , and outputs the corresponding (unbiased) estimate of its mean parameters:

 [L(A,~y,~z)]kℓ=1nk(~y)n∑i=1m∑j=1Aij1{~yi=k,~zj=ℓ},k∈[K],ℓ∈[L]. (20)

Note that is a matrix with nonnegative entries. In general, we let

 ^Λ =(^λkℓ):=L(A,~y,~z), (21) Λ(~y,~z) =(λkℓ(~y,~z)):=L(\ex[A],~y,~z), (22)

for any row and column labels and . Here is the estimate of the true row mean matrix. Its expectation is due to the linearity of . We call , the (global) row mean parameters associated with labels and . (We do not explicitly show the dependence of on the labels, in contrast to the mean parameters.) We have the following key identity

 Λ(~y,~z)∣~y=y,~z=z=Λ (23)

where is the true (global) row mean parameter matrix defined in (2). In words, (23) states that the global row mean parameters associated with the true labels and , are the true such parameters. We will also use parameters such as which are obtained based on the true row labels and generic column labels .

We also need local versions of all these definitions which are obtained based on submatrices of . More precisely, let be a submatrix of , and let and be the corresponding subvectors of and (i.e., corresponding to the same row and column index sets used to extract the submatrix). Here range over creating a partition of into subblocks. We call

 Λ(q′,q)(~y,~z):=(λ(q′,q)kℓ(~y,~z)):=L(\ex[A(q′,q)],~y(q′),~z(q)), (24)

the local row mean parameters associated with submatrix and sublabels and . The corresponding estimates are defined similarly (by replacing with ). We will mostly work with submatrices obtained from a partition of into (nearly) equal-sized blocks—the details of which are described in Section 5. In such cases,

 Λ(q′,q)(~y,~z)≈1QΛ(~y,~z),∀q′,q∈[Q]

assuming the each subblock in the partition has nearly similar cluster proportions: . This is the case, for example, for a random partition as we show in Section 6.2. Of special interest is when we replace both and with true labels and . In such cases, does not depend on . More precisely, we have for any ,

 λ(q′,q)kℓ(y,z)=Pkℓnℓ(z(q)),∀q′∈[Q], (25)

where is the number of labels in class in , consistent with our notation for the full label vectors. We often write as a shorthand for which is justified by the above discussion. These will be called the true local row mean parameters (associated with column subblock in the partition).

### 4.2 General pseudo-likelihood algorithm

Let us now describe our main algorithm based on the pseudo-likelihood (PL) idea, which is a generalization of the approach in [amini2013pseudo] to the bipartite setup. The pseudo-likelihood algorithm (PLA) is effectively an EM algorithm applied to the approximate mixture of Poissons obtained from the block compression of the adjacency matrix . It relies on some initial estimates of the row and column labels to perform the first block compressions (for both rows and columns). The initialization is often done by spectral clustering and will be discussed in Section 5.2.2 once we introduce the provable version of the algorithm. Subsequent block compressions are performed based on the label updates at previous steps of PLA.

Let us assume that we have obtained labels and as estimates of the true labels and . We focus on the steps of PLA for recovering the row labels. Let us define an operator that takes approximate columns labels and produces the corresponding column compression of :

 B(A;~z):=b(~z):=(biℓ(~z))∈Zn×L+,biℓ(~z):=m∑j=1Aij1{~zj=ℓ}. (26)

The distribution of is determined by the row class of . It is not hard to see that

 \ex[biℓ(~z)]=λkℓ(y,~z)=λkℓ(y,~z)|~z=z,if yi=k, (27)

where are the (global) row mean parameters defined in (22).

Now consider an operator that given the column compression and the initial estimate of the row labels , produces estimates of the (row) mean parameters :

 L(b;~y):=^Λ:=[^λkℓ]∈\realsK×L+,^λkℓ:=1nk(~y)n∑i=1biℓ1{~yi=k}. (28)

Note that if , we have . The definition of the estimates in (28) are consistent with those of (21) due to the following identity:

 L(B(A;~z);~y)=L(A,~y,~z)

which holds for any row labels and column labels . Let us write

 π(~y):=(πk(~y)),πk(~y):=1nn∑i=11{~yi=k} (29)

for the estimate of (row) class priors based on . We note that the operation and remain valid even if and are soft labels with a minor modification. By a soft row label we mean a probability vector on : and , which denotes a soft assignment to each row cluster. To extend (26) to soft row labels, it is enough to replace with . Extending (28) to soft column labels is done similarly.

Now, given any block compression and any estimate of the (row) mean parameters and any estimate of the (row) class prior, consider the operator that outputs the (row) class posterior assuming that the rows of approximately follow :

 F(b,^Λ,~π):=(^πik)∈[0,1]n×K,^πik:=~πk∏Lℓ=1φ(biℓ,^λkℓ)∑Kk′=1~πk′∏Lℓ=1φ(biℓ,^λk′ℓ) (30)

where is the Poisson likelihood (up to constants). In practice, we only use or a flat prior as the estimated prior in this step; similarly, we only use a block compression which is based on estimates of row labels, i.e., for some . Note that outputs soft-labels which can be considered our new estimates of . We can convert to hard labels if needed.

Algorithm 1 summarizes the general blueprint of PLA, which proceeds by iterating the three operators (26), (28) and (30). Optional conversion from soft to hard labels is performed by MAP assignment per row. With option 2 in step 6, the inner loop on lines 4–9 is the EM algorithm for a mixture of Poisson vectors. We can also remove the inner loop and perform iterations 5–8 only once. In total, Algorithm 1 has (at least) 6 possible versions, depending on whether we include each of the steps 8 or 11 (for the soft to hard label conversion) and whether to implement the inner loop till convergence or only for one step. We provide empirical results for two of these versions in Section 8. In practice, we recommend to keep soft labels throughout, and only run the inner loop for a few iterations (maybe even one if the computational cost is of significance).

{rem}

[PL naming] We have borrowed the name pseudo-likelihood (PL) from [amini2013pseudo] based on which the algorithms in this paper are derived. In [amini2013pseudo], the setup is that of the symmetric SBM, and in order to treat the full likelihood as the product of independent (over nodes ) of the mixture of Poisson vectors, one has to ignore the dependence among the upper and lower triangular parts of the adjacency matrix, making the PL naming more inline with the traditional use of the term. In our bipartite setup, there is no such dependence to ignore, but we have kept the name PL for consistency with [amini2013pseudo] and ease of use. We interpret the “pseudo” nature of the likelihood as the approximation used in the block compression stage (with imperfect labels) and in replacing Poisson-binomial distribution with the Poisson.

### 4.3 Likelihood ratio classifier

A basic simplified building block of the PLA is given in Algorithm 2. This operation—which will play a key role in the development of the provable version of the algorithm in Section 5—can be equivalently described as a likelihood ratio classifier (LRC). Let us write the joint Poisson likelihood (up to a constant) as:

 Φ(x,λ)=L∏ℓ=1φ(xℓ,λℓ)=L∏ℓ=1exp(xℓlogλℓ−λℓ),x∈\realsL,λ∈\realsL+, (31)

and the corresponding likelihood ratio as:

 Ψ(x;λ∣λ′)=logΦ(x,λ)Φ(x,λ′)=L∑ℓ=1xℓlogλℓλ′ℓ+λ′ℓ−λℓ,x∈\realsL,λ,λ′∈\realsL+. (32)

Recalling the column compression (26), the likelihood ratio classifier, based on initial row labels and an estimate of the row mean parameter matrix, is

 [LR(A,~Λ,~z)]i∈\argmaxr∈[K]logΦ(bi∗(~z),~λr∗),i∈[n]. (33)

which gives us a refined estimate of the row labels (i.e., ). It is not hard to see that the output of Algorithm 2 is .

## 5 Provable version

When analyzing Algorithm 2, we need the initial labels to be independent of the adjacency matrix. Hence, we cannot apply the initialization method (e.g., the spectral clustering) and the likelihood ratio classifier (Algorithm 2) on the same adjacency matrix , iteratively. In this section, we introduce an algorithm, namely Algorithm 3, that partitions into sub-blocks and operates iteratively on collections of these blocks to maintain the desired independence. For this version of the pseudo-likelihood algorithm, our main result, Theorem 3, holds.

Let us assume that and are divisible by . This assumption is not necessary but helps simplify the notations. Let us write

 ˆy=rowSC(A),ˆz=colSC(A)

to denote labels obtained by applying the spectral clustering, respectively, on rows and columns of the adjacency matrix , the details of which are discussed in Section 5.2.2 below. We have . We also recall the LR classifier defined in (33). For matrices (or vectors) and , we use to denote column concatenation and to denote row concatenation.