Community Detection in Bipartite Networks with Stochastic Blockmodels
Abstract
In bipartite networks, community structures are restricted to being disassortative, in that nodes of one type are grouped according to common patterns of connection with nodes of the other type. This makes the stochastic block model (SBM), a highly flexible generative model for networks with block structure, an intuitive choice for bipartite community detection. However, typical formulations of the SBM do not make use of the special structure of bipartite networks. Here, we introduce a Bayesian nonparametric formulation of the SBM and a corresponding algorithm to efficiently find communities in bipartite networks which parsimoniously chooses the number of communities. The biSBM improves community detection results over general SBMs when data are noisy, improves the model resolution limit by a factor of , and expands our understanding of the complicated optimization landscape associated with community detection tasks. A direct comparison of certain terms of the prior distributions in the biSBM and a related highresolution hierarchical SBM also reveals a counterintuitive regime of community detection problems, populated by smaller and sparser networks, where nonhierarchical models outperform their more flexible counterpart.
pacs:
89.75.Hc 02.50.Tt 89.70.CfI Introduction
A bipartite network is defined as having two types of nodes, with edges allowed only between nodes of different types. For instance, a network in which edges connect people with the foods they eat is bipartite, as are other networks of associations between two classes of objects. Recent applications of bipartite networks include studies of plants and the pollinators that visit them Young et al. (2019), stock portfolios and the assets they comprise Squartini et al. (2017), and even U.S. Supreme Court justices and the cases they vote on Guimerà and SalesPardo (2011). More abstractly, bipartite networks also provide an alternative representation for hypergraphs in which the two types of nodes represent the hypergraph’s nodes and its hyperedges, respectively Ghoshal et al. (2009); Chodrow (2020).
Many networks exhibit community structure, meaning that their nodes can be divided into groups such that the nodes within each group connect to other nodes in other groups in statistically similar ways. Bipartite networks are no exception, but they exhibit a particular form of community structure because typeI nodes are defined by how they connect to typeII nodes, and vice versa. For example, in the bipartite network of people and the foods they eat, vegetarians belong to a group of nodes which are defined by the fact that they never connect to nodes in the group of meatcontaining foods; meatcontaining foods are defined by the fact that they never connect to vegetarians. While the group structure in this example comes from existing node categories, one can also ask whether statistically meaningful groups could be derived solely from the patterns of the edges themselves. This problem, typically called community detection, is the unsupervised task of partitioning the nodes of a network into statistically meaningful groups. In this paper, we focus on the community detection problem in bipartite networks.
There are many ways to find community structure in bipartite networks, including both general methods—which can be applied to any network—and specialized methods derived specifically for bipartite networks. We focus on a family of models related to the stochastic blockmodel (SBM), a generative model for community structure in networks Holland et al. (1983). Since one of the SBM’s parameters is a division of the nodes into groups, community detection with the SBM simply requires a method to fit the model to network data. With inference methods becoming increasingly sophisticated Peixoto (2019), many variants of the SBM have been proposed, including those that accommodate overlapping communities Airoldi et al. (2008); GodoyLorite et al. (2016), broad degree distributions Karrer and Newman (2011), multilayer networks TarrésDeulofeu et al. (2019), hierarchical community structures Peixoto (2014a), and networks with metadata Hric et al. (2016); Newman and Clauset (2016); Peel et al. (2017). SBMs have also been used to estimate network structure or related observational data even if the measurement process is incomplete and erroneous Young et al. (2019); Newman (2018a, b); Peixoto (2018). In fact, a broader class of socalled mesoscale structural inference problems, like coreperiphery identification and imperfect graph coloring, can also be solved using formulations of the SBM, making it a universal representation for a broad class of problems Young et al. (2018); Olhede and Wolfe (2014).
At first glance, the existing SBM framework is readily applicable to bipartite networks. This is because, at a high level, the two types of nodes should correspond naturally to two blocks with zero edges within each block, implying that SBMs should detect the bipartite split without that split being explicitly provided. However, past work has shown that providing node type information a priori improves both the quality of partitions and the time it takes to find them Larremore et al. (2014). Unfortunately those results, which relied on local search algorithms to maximize model likelihood Karrer and Newman (2011); Larremore et al. (2014), have been superseded by more recent results which show that fitting fully Bayesian SBMs using Markov chain Monte Carlo can find structures in a more efficient and nonparametric manner Peixoto (2017); Riolo et al. (2017); Peixoto (2019). These methods maximize a posterior probability, producing similar results to traditional cross validation by link predictions in many (but not all) cases Kawamoto and Kabashima (2017a); VallèsCatalà et al. (2018). In this sense, they avoid overfitting the data, i.e., they avoid finding a large number of communities whose predictions fail to generalize. This raises the question of whether the more sophisticated Bayesian SBM methods gain anything from being customized for bipartite networks, like the previous generation of likelihoodbased methods did Larremore et al. (2014).
In this paper, we begin by introducing a nonparametric Bayesian bipartite SBM (biSBM) and show that bipartitespecific adjustments to the prior distributions improve the resolution of community detection by a factor of , compared with the general SBM Peixoto (2013). As with the general SBM, the biSBM automatically chooses the number of communities and controls model complexity by maximizing the posterior probability.
After introducing a bipartite model, we also introduce an algorithm, designed specifically for bipartite data, that efficiently fits the model to data. Importantly, this algorithm can be applied to both the biSBM and its general counterpart, allowing us to isolate both the effects of our bipartite prior distributions and the effects of the search algorithm itself. As in the maximum likelihood case Larremore et al. (2014), the ability to customize the search algorithm for bipartite data provides both improved community detection results, as well as a more sophisticated understanding of the solution landscape, but unlike that previous work, this algorithm does more than simply require that blocks consist of only one type of node. Instead, the algorithm explores a twodimensional landscape of model complexity, parameterized by the number of typeI blocks and the number of typeII blocks. This contributes to the growing body of work that explores the solution space of community detection models, including methods to sample the entire posterior Riolo et al. (2017), count of the number of metastable states Kawamoto and Kabashima (2019), and determine the number of solution samples required to describe the landscape adequately Calatayud et al. (2019).
In the following sections, we introduce a degreecorrected version of the bipartite SBM Larremore et al. (2014), which combines and extends two recent advances. Specifically, we recast the bipartite SBM Larremore et al. (2014) in a microcanonical and Bayesian framework Peixoto (2017) by assuming that the number of edges between groups and degree sequence are fixed exactly, instead of only in expectation. We then derive its likelihood, introduce prior distributions that are bipartitespecific, and describe an algorithm to efficiently fit the combined nonparametric Bayesian model to data. We then demonstrate the impacts of both the bipartite priors and algorithm in synthetic and realworld examples, and explore their impact on the maximum number of communities that our method can find, i.e., its resolution limit, before discussing the broader implications of this work.
Ii The microcanonical bipartite SBM
Consider a bipartite network with nodes of type I and nodes of type II. The typeI nodes are divided into blocks and the typeII nodes are divided into blocks. Let and . Rather than indexing different types of nodes separately, we index the nodes by and annotate the block assignment of node by . A key feature of the biSBM is that each block consists of only one type of node.
Having divided nodes into blocks, we can now write down the propensities for nodes in each block to connect to nodes in the other blocks. Let be the total number of edges between blocks and . Then, let be the degree of node . Together, and specify the degrees of each node and the patterns by which edges are placed between blocks. The number of edges attached to a group must be equal to the sum of its degrees, such that for any . For bipartite networks, for all . We use to denote the number of nodes in block .
Given the parameters above, one can generate a network by placing edges that satisfy the constraints imposed by e and k. However, that network would be just one of an ensemble of potentially many networks, all of which satisfy the constraints, analogous to the configuration model Bollobás (1980); Fosdick et al. (2018). Peixoto showed how to count the number of networks in this ensemble Peixoto (2012), so that for a uniform distribution over that ensemble, the likelihood of observing any particular network is simply the inverse of the ensemble size. This means that, given e, k, and the group assignments , computing the size of the ensemble is tantamount to computing the likelihood of drawing a network with adjacency matrix A from the model, . Thus, treating networks as equiprobable microstates in a microcanonical ensemble leads to the microcanonical stochastic blockmodel, whose bipartite version we now develop, specifically to find communities in realworld bipartite networks. This derivation follows directly from combining the bipartite formulation of the SBM Larremore et al. (2014) with the microstate counting developed in Peixoto (2012). We introduce a new algorithm to fit the model in Sec. IV.
Iii Nonparametric Bayesian SBM for Bipartite Networks
We first formulate the community detection problem as a parametric inference procedure. The biSBM is parameterized by a partition of nodes into blocks b, the number of edges between blocks e, and the number of edges for each node, k. However, for empirical networks, we need only search the space of partitions b. This is because the microcanonical model specifies the degree sequence k exactly, so the only way that an empirical network can be found in the microcanonical ensemble is if the parameter k is equal to the empirically observed degree sequence. Note that, when k and b are both specified, e is also exactly specified. As a consequence, community detection requires only a search over partitions of the nodes into blocks b.
In the absence of constraints on b, the maximum likelihood solution is simply for the model to memorize the data, placing each node into its own group and letting . To counteract this tendency to dramatically overfit, we adapt the Bayesian nonparametric framework of Peixoto (2017), where the number of groups and other model parameters are determined from the data, and customize this framework for the situation in which the data are bipartite. We start by factorizing the joint distribution for the data and the parameters in this form,
(1) 
where , , and are prior probabilities that we will specify in later subsections. Thus, Eq. (1) defines a complete generative model for data and parameters.
The Bayesian formulation of the SBM is a powerful approach to community detection because it enables model comparison, meaning that we can use it to choose between different model classes (e.g., hierarchical vs flat) or to choose between parameterizations of the same model (e.g., to choose the number of communities). Two approaches to model comparison, producing equivalent formulations of the problem, are useful. The first formulation is that of simply maximizing Eq. (1), taking the view that the model which maximizes the joint probability of the model and data is, statistically most justified. The second formulation is that of minimizing the socalled description length Rissanen (2007), which has a variety of interpretations (for a reviews and update, see Grünwald (2007); Grünwald and Roos (2019)). Perhaps the most useful interpretation for our purposes is that of compression, which takes the view that the best model is one which allows us to most compress the data, while accounting for the cost to describe the model itself. In this phrasing, for a model class , the description length is given by . These two terms can be interpreted as the description cost of compressing the data A using the model and the cost of expressing the model itself, respectively. Therefore, the minimum description length (MDL) approach can be interpreted as optimizing the tradeoff between better fitting but larger models. Asymptotically, MDL is equivalent to the Bayesian Information Criterion (BIC) Schwarz (1978) for stochastic blockmodels under compatible prior assumptions Peixoto (2014a); Yan et al. (2014).
A complete and explicit formulation of model comparison will be provided in the context of our studies of empirical data in Sec. VII, using strict MDL approaches. For now, we proceed with calculating the likelihood and prior probabilities for the microcanonical biSBM and its parameters.
iii.1 Likelihood for microcanonical bipartite SBM
The observed network A is just one of networks in the microcanonical ensemble which match exactly. Assuming that each configuration in the network ensemble is equiprobable, computing the likelihood is equivalent to taking the inverse of the size of the ensemble. We compute the size of the ensemble by counting the number of networks that match the desired block structure and dividing by the number of equivalent network configurations without block structure , yielding,
(2) 
As detailed in Ref. Peixoto (2017), the number of networks that obey the desired block structure determined by e is given by,
(3) 
This counting scheme assumes that halfedges are distinguishable. In other words, it differentiates between permutations of the neighbors of the same node, which are all equivalent (i.e., correspond to the same adjacency matrix). To discount equivalent permutations of neighbors, we count the number of halfedge pairings that correspond to the bipartite adjacency matrix A,
(4) 
Note that while selfloops are forbidden, this formulation allows the possibility of multiedges.
iii.2 Prior for the degrees
The prior for the degree sequence follows directly from Ref. Peixoto (2017) because k is conditioned on e and b, which are bipartite. The intermediate degree distribution , with being the number of nodes with degree that belong to group , further factorizes the conditional dependency. This allows us to write
(5) 
where
(6) 
is a uniform distribution of degree sequences constrained by the overall degree counts, and
(7) 
is the distribution of the overall degree counts. The quantity is the number of restricted partitions of the integer into at most parts Andrews (1998). It can be computed via the following recurrence relation,
(8) 
with boundary conditions for , and for or . With this, computing for and requires additions of integers. In practice, we precompute using the exact Eq. (8) for (or when the network is smaller), and resort to approximations Peixoto (2017) only for larger arguments.
For sufficiently many nodes in each group, the hyperprior Eq. (7) will be overwhelmed by the likelihood, and the distribution of Eq. (5) will approach the actual degree sequence. In such cases, the prior and hyperprior naturally learn the true degree distribution, making them applicable to heterogeneous degrees present in realworld networks.
iii.3 Prior for the node partition
The prior for the partitions b also follows Ref. Peixoto (2017) in its general outline, but the details require modification for bipartite networks. We write the prior for b as the following Bayesian hierarchy
(9) 
where , the number of nodes in each group. We then assume that this prior can be factorized into independent priors for the partitions of each type of node, i.e., . This allows us to treat the terms of Eq. (9) as
(10) 
(11) 
and
(12) 
Equation (11) is a uniform hyperprior over all such histograms on the node counts n, while Eq. (12) is a prior for the number of nonempty groups itself. This Bayesian hierarchy over partitions accommodates heterogeneous group sizes, allowing it to model the group sizes possible in realworld networks.
iii.4 Prior for the bipartite edge counts
We now introduce the prior for edge counts between groups, e, which also requires modification for bipartite networks. While the edge count prior for general networks is parameterized by the number of groups , the analogous prior for bipartite networks is parameterized by and . We therefore modify the counting scheme of Ref. Peixoto (2017), written for general networks, to avoid counting nonbipartite partitions that place edges between nodes of the same type. Our prior for edge counts between groups is therefore
(13) 
where counts the number of grouptogroup combinations when edges are allowed only between typeI and typeII nodes. The notation counts the number of histograms with bins whose counts sum to . Similar to the uniform prior for general networks Peixoto (2017), it is unbiased and maximally noninformative, but by neglecting mixedtype partitions, this prior results in a more parsimonious description. In later sections, we show that this modified formulation enables the detection of smaller blocks, improving the socalled resolution limit, by reducing model complexity for larger and .
iii.5 Model summary
Having fully specified the priors in previous subsections, we now substitute our calculations into Eq. (1), the joint distribution for the biSBM, yielding,
(14) 
Inference of the biSBM reduces to the task of sampling this distribution efficiently and correctly. Although Eq. (14) is somewhat daunting, note that k and e are implicit functions of the partition b, meaning Eq. (14) depends only on the data and the partition b. This opens the door to efficient sampling of the posterior distribution via Markov chain Monte Carlo which we discuss in Sec. IV.
iii.6 Comparison with the hierarchical SBM
In deriving the biSBM, we replaced the SBM’s uniform prior for edge counts with a bipartite formulation Eq. (13). However, one can instead replace it with a Bayesian hierarchy of models (Eq. (23); Peixoto (2014a)). In this hierarchical SBM (hSBM), the matrix e is itself considered as an adjacency matrix of a multigraph with nodes and edges, allowing it to be modeled by a second SBM. Of course, the second SBM also has an edge count matrix with the same number of edges and fewer nodes, so the process of modeling each edge count matrix using another SBM can be done recursively until the model has only one block. In so doing, the hSBM typically achieves a higher posterior probability (which corresponds to higher compression, from a description length point of view) than nonhierarchical (or “flat”) models, and can therefore identify finerscale community structure.
The hSBM’s edge count prior allows it to find finer scale communities and more efficiently represent network data. However, as we will see, when the network is small and has no hierarchical structure, the hSBM can actually underfit the data, finding too few communities, due to the overhead of specifying a hierarchy even when none exists. The scenarios in which the flat bipartite prior has advantages over its hierarchical counterpart are explored in Sec. V.
Iv Fitting the model to data
The mathematical formulation of the biSBM takes full advantage of a network’s bipartite structure to arrive at a better model. Here, we again make use of that bipartite structure to accelerate and improve our ability to fit the model, Eq. (14), to network data.
At a high level, our algorithm for model fitting consists of two key routines. The first routine is typical of SBM inference, and uses Markov chain Monte Carlo importance sampling Metropolis et al. (1953); Hastings (1970); Peixoto (2014b), followed by simulated annealing, to explore the space of partitions, conditioned on fixed community counts. In this routine, we accelerate mixing time by making use of the bipartite constraint, specifying a Markov chain only over states (partitions) with one type of node in each block. Importantly, this constraint has the added effect that we must fix both block counts, and , separately.
The second routine of our algorithm consists of an adaptive search over the twodimensional space of possible , using the ideas of dynamic programming Cormen et al. (2009); Erickson (2019). It attempts to move quickly through those parts of the plane that are low probability under Eq. (14) without calling the MCMC routine, and instead allocating computation time for the regions that better explain the data. The result is an effective algorithm, with two separable routines, which makes full use of the network’s bipartite structure, allowing us to either maximize or sample from the posterior Eq. (14).
One advantage of having decoupled routines in this way is that the the partitioning engine is a modular component which can be swapped out for a more efficient alternative, should one be engineered or discovered. Reference implementations of two SBM partitioning algorithms, a KernighanLininspired local search Kernighan and Lin (1970); Karrer and Newman (2011); Larremore et al. (2014) and the MCMC algorithm, are freely available as part of the bipartiteSBM library Yen ().
Alternative methods for model fitting exist. For instance, it is possible to formulate a Markov chain over the entire space of partitions whose stationary distribution is the full posterior, without conditioning on the number of groups. In such a scheme, transitions in the Markov chain can create or destroy groups Riolo et al. (2017), and the MetropolisHastings principles guarantee that this chain will eventually mix. However, this approach turns out to be too slow to be practical because the chain gets trapped in metastable states, extending mixing times.
Another alternative approach is to avoid our twodimensional search over and , and instead search over . This is the approach of Ref. Peixoto (2013), where, after proving the existence of an optimal number of blocks , a goldenratio onedimensional search is used to efficiently find it.
iv.1 Inference routine
The task of the MCMC inference routine is to maximize Eq. (14), conditioned on fixed values of and . Starting from an initial partition , the MCMC algorithm explores the space of partitions with fixed and by proposing changes to the block memberships b, and then accepting or rejecting those moves with carefully specified probabilities. As is typical, those probabilities are chosen so that the probability that the algorithm is at any particular partition is equal to the posterior probability of that partition, given and , by enforcing the MetropolisHastings criterion.
Rather than initializing the MCMC procedure from a fully random initial partition, we instead use an agglomerative initialization Peixoto (2014a) which reduces burnin time and avoids getting trapped in metastable states that are common when group sizes are large. The agglomerative initialization amounts to putting each node in its own group and then greedily merging pairs of groups of matching types until the specified and remain.
After initialization, each step consists of proposing to move a node from its current group to a new group . Following Peixoto (2017), proposal moves are generated efficiently in a twostep procedure. First, we sample a random neighbor of node and inspect its group membership . Then, with probability we choose uniformly at random from ; otherwise, we choose with probability proportional to the number of edges leading to that group from group , i.e., proportional to .
A proposed move which would violate the bipartite structure by mixing node types, or which would leave group empty, is rejected with probability one. A valid proposed move is accepted with probability
(15) 
where
(16) 
Here, is the fraction of neighbors of node which belong to block and and is an arbitrary parameter that enforces ergodicity. The term is an inversetemperature parameter, and is the difference between the entropies of the biSBM’s microcanonical ensemble in its current state and in its proposed new state. With this in mind,
(17) 
where variables without primes represent the current state () and variables with primes correspond to the state being proposed ().
The initialization, proposal, and evaluation steps of the algorithm above are fast. With continuous bookkeeping of the incident edges to each group, proposals can be made in time , and are engineered to substantially improve the mixing times since they remove an explicit dependency on the number of groups which would otherwise be present with the fully random moves Peixoto (2014a). Then, when evaluating Eq. (17), we need only a number of terms proportional to . In combination, the cost of an entire “sweep,” consisting of one proposed move for each node in the network, is . The overall number of steps necessary for MCMC inference is therefore , where is the average mixing time of the Markov chain, independent of .
Our bipartiteSBM implementation Yen () has the following default settings, chosen to stochastically maximize Eq. (14) for fixed and via a simulated annealing process. We first let , and perform sweeps at to reach equilibrated partitions. Then we perform zerotemperature () sweeps, in which only moves leading to a strictly lower entropy are allowed. We keep track of the system’s entropy during this process and exit the MCMC routine when no recordbreaking event is observed within a sweeps window, or when the number of sweeps exceeds , whichever is earlier. The partition b at the end corresponds to the lowest entropy. Equivalently stated, this partition b corresponds to the minimum description length or highest posterior probability, for fixed and . The minimal entropy at each stage is bookmarked for future decisionmaking processes.
The bipartite MCMC formulation is more than just similar to its general counterpart. In fact, one can show that for fixed and , the Markov chain transition probabilities dictated by Eq. (17) are identical for the uniform bipartite edge count prior Eq. (13) and its general equivalent introduced in Peixoto (2017). This means that the MCMC algorithm explores the same entropic landscape for both bipartite and general networks when and are fixed. As we will demonstrate in Sec. V, however, by combining the MCMC routine with both the novel search routine over the block counts and the more sensitive biSBM priors, we can better infer model parameters in bipartite networks.
iv.2 Search routine
The task of the search routine is to maximize Eq. (14) over the plane, i.e., to find the optimal number of groups. However, maximizing Eq. (14) for any fixed choice of requires the MCMC inference introduced above, motivating the need for an efficient search. If we were to treat the network as unipartite, a onedimensional convex optimization on the total number of groups with a search cost of Peixoto (2013) could be used. On the other hand, exhaustively exploring the plane of possibilities would incur a search cost of , where is the maximum value of which can be detected. In fact, our experiments indicate that neither the general unipartite approach nor the naive bipartite approach is optimal. The plane search is too slow, while the line search undersamples local maxima of the landscape, which is typically multimodal. Instead, we present a recursive routine that runs much faster than exhaustive search, which parameterizes the tradeoff between search speed and search accuracy by rapidly finding the highprobability region of the plane without too many calls to the more expensive MCMC routine.
We provide only a brief outline of the search algorithm here, supplying full details in Appendix A. The search is initialized with each node in its own block. Blocks are rapidly agglomerated until . This is the socalled resolution limit, the maximum number of communities that our algorithm can reliably find, which we discuss in detail in Sec. VI. Equation (14) will never be maximized prior to reaching this frontier. During this initial phase, we also compute the posterior probability of the trivial bipartite partition with blocks, as a reference for the next phase.
Next, we search the region of the (,) plane within the resolution frontier to find a local maximum of Eq. (14) by adaptively reducing the number of communities. In this context, a local maximum is defined as an MCMCderived partition with exactly (,) blocks, whose posterior probability is larger than the posterior probabilities for MCMCderived partitions at nearby values (,), for a chosen neighborhood size . From the initial partition at the resolution frontier, we merge blocks, selected greedily from a stochastically sampled set of proposed merges. Here, because the posterior probability is a tiny value, it is computationally more convenient to work with the model entropy , which is related to the posterior probability by . Proposed merges are evaluated by their entropy after merging, but without calling the MCMC routine to optimize the postmerge partition. Because MCMC finds better (or no worse) fits to the data, this means that these postmerge entropies are approximate upper bounds of the bestfit entropy, given the postmerge number of blocks. We therefore use this approximate upper bound to make the search adaptive: whenever a merge would produce an upperbound approximation that is a factor higher than the current best , a full MCMC search is initialized at the current grid point. Otherwise, merges proceed rapidly since the approximate entropy is extremely cheap to compute. Throughout this process, the value of is estimated from the data to balance accuracy and efficiency, and it adaptively decreases as the search progresses (Appendix A). The algorithm exits when it finds a local minimum on the entropic landscape, returning the best overall partition explored during the search.
In practice, a typical call to the algorithm takes the form of (i) a rapid agglomerative merging phase from blocks to the resolution limit frontier; (ii) many agglomerative merges to move along candidate local minima that rely on approximated entropy; (iii) more deliberate and MCMCreliant neighborhood searches to examine candidate local minima. These phases are shown in Fig. 1. The algorithm has total complexity , where is the number of times that an exhaustive neighborhood search is performed. When , we find for most empirical networks examined. This algorithm is not guaranteed to find the global optimum, but due to the typical structure of the optimization landscape for bipartite networks, we have found it to perform well for many synthetic and empirical networks, and it tends consistently estimate the number of groups (see Sec. VI). An implementation is available in the bipartiteSBM library Yen ().
V Reconstruction performance
In this section, we examine our method’s ability to correctly recover the block structure in synthetic bipartite networks where known structure has been intentionally hidden. In each test, we begin by creating a bipartite network with unambiguous block structure, and then gradually mix that structure with noise until the planted blocks disappears entirely, creating a sequence of community detection problems that are increasingly challenging Moore (2017). The performance of a community detection method can then be measured by how well it recovers the known partition over this sequence of challenges.
The typical synthetic test for unipartite networks is the planted partition model Condon and Karp (2001) in which groups have assortative edges, and disassortative edges for . When the total expected degree for each group is fixed, the parameter controls the ambiguity of the planted blocks. Unambiguous assortative structure corresponds to while corresponds to a fully random graph. Here, we consider a straightforward translation of this model to bipartite networks in which the nodes are again divided into blocks according to a planted partition. As in the unipartite planted partition model, nonzero entries of the block affinity matrix take on one of two values but due to the fact that all edges are disassortative, we replace and with or to avoid confusion (see insets of Fig. 2). By analogy, we let while fixing the total expected degree for each group, so that corresponds to highly resolved communities which blend into noise as grows.
We present two synthetic tests using this bipartite planted partition model, designed to be easy and difficult, respectively. In the easy test, the unambiguous structure consists of nodes, divided evenly into blocks of 500 nodes each, with a mean degree . Each typeI block is matched with a typeII block so that the noisefree network consists of exactly 10 bipartite components, with zero edges placed between nodes in different components by definition. In the hard test, the unambiguous structure consists of nodes divided evenly into and blocks of approximately equal size, with mean degree . The relationships between the groups in the hard test are more complex, so the insets of Fig. 2 provide schematics of the adjacency matrices of both tests under a moderate amount of noise. In both cases, node degrees were drawn from a powerlaw distribution with exponent , and for a fixed , networks were drawn from the canonical degreecorrected stochastic blockmodel Karrer and Newman (2011); Larremore et al. (2014).
We test four methods’ abilities to recover the bipartite planted partitions, in combinations that allow us to separate the effects of using our bipartite model (Sec. III) and our bipartite search algorithm (Sec. IV), in comparison to existing methods. The first method maximizes the biSBM posterior using our 2D search algorithm. The second method keeps the 2D search algorithm, but examines the effects of the bipartitespecific edge count prior by replacing it with the general SBM’s edge count prior [i.e., replacing Eq. (13) with Eq. (22)]. The third method uses the same general SBM edge count prior as the second, but uses a 1D bisection search Peixoto (2013) to examine the effects of the 2D search. The fourth method maximizes the hierarchical SBM posterior using a 1D bisection search. For the first two cases, we use our bipartiteSBM library Yen (), while for the latter two, we use the graphtool library Peixoto (2014c). In all cases, we enforce typespecific MCMC move proposals to avoid mixedtype groups.
In the easy test, we find that the bipartite search algorithm introduced in Sec. IV performs better than the onedimensional searches (Fig. 2a). Because the onedimensional search algorithm assumes that the optimization landscape is unimodal, we reasoned that other modes may emerge as increases. To test this, we generated networks within the transition region () and then conducted an exhaustive survey of plausible values using MCMC with the general SBM. This revealed two basins of attraction, located at and , explaining the SBM’s performance. This bimodal landscape can therefore hinder search in one dimension by too quickly attracting the algorithm to the trivial bipartite partition. Perhaps surprisingly then, a similar exhaustive survey of the plane using the bipartite model revealed that near the transition , the biSBM has a local optimum with more than the planted blocks.
In the hard case, we find that it is not the bipartite search that enables the biSBM to outperform the other methods, but rather the bipartite posterior (Fig. 2b). An exploration of the outputs of the general searches shows that when they fail, they tend to find an incorrect number of blocks, which should total [corresponding to the planted blocks]. To understand this failure mode in more detail, we fixed and used MCMC to fit the general SBM Peixoto (2014c). This led to solutions in which , revealing that the performance degradation, relative to the biSBM, was due to a tendency for that particular algorithmic implementation of the SBM to find more balanced numbers of groups. Interestingly, near their respective transitions values of , both the SBM and biSBM tend to find more groups than were planted in the hard test, thus overfitting the data. To explore this further, we again conducted exhaustive surveys of the plane using MCMC and found that under both models, the posterior surfaces are consistently multimodal, with attractive peaks corresponding to more communities than the planted . However, only the bipartite search algorithm introduced in Sec. IV finds overfitted partitions with too many groups; the unipartite search algorithms instead return underfitted models with too few groups, balanced between the node types.
In sum, our synthetic network tests reveal two phenomena. First, the biSBM with bipartite search is able to extract structure from higher levels of noise than the alternatives, making it an attractive option for bipartite community detection with real data. However, our tests also reveal that the posterior surfaces of both the SBM and biSBM degenerate in unexpected ways near the detectability transition Decelle et al. (2011); Mossel et al. (2015); Kawamoto and Kabashima (2017b); RicciTersenghi et al. (2019).
Vi Resolution Limit
Community detection algorithms exhibit a resolution limit, an upper bound on the number of blocks that can be resolved in data, even when those blocks are seemingly unambiguous. For instance, using the general SBM, only groups can be detected Peixoto (2017), while the higher resolution of the hierarchical SBM improves this scaling to Peixoto (2014a). In this section we investigate the resolution limit of the biSBM numerically and analytically.
Our numerical experiment considers a network of bipartite cliques of equal size, with nodes of each type per biclique and therefore edges per biclique. To this network, we repeatedly apply the SBM, the hSBM, and biSBM, and record the number of blocks found each time, varying between and . For small values of , all three algorithms infer blocks, but as the number of blocks increases, solutions which merge pairs, then quartets, and then octets become favored (Fig. 3). The hSBM continues to find blocks, as expected.
The exact value of at which merging blocks into pairs becomes more attractive can be derived by asking when the corresponding posterior odds ratio, comparing a model with bicliques to a model with biclique pairs, exceeds one,
(18) 
When there are nodes of each type per biclique and edges, exceeds 1 when for the SBM and for the biSBM (Fig. 3; arrows). A similar calculation predicts the transition from biclique pairs to biclique quartets at for the SBM and for the biSBM (Fig. 3; arrows). Numerical experiments confirm these analytical predictions, but noisily, due to the stochastic search algorithms involved, and the fact the optimization landscapes are truly multimodal, particularly near points of transition.
The posterior odds ratio calculations above can be generalized, and show that the biSBM extends the resolution transitions twice as far as the SBM for the transitions from , and so on, but still undergoes the same transitions eventually. Thus, both models exhibit the same resolution limit scaling , but with resolution degradations that occur at for the SBM occurring at for the biSBM. Therefore, the resolution limit of the biSBM is larger than the SBM for the same number of nodes. One can alternatively retrace the analysis of Ref. Peixoto (2017), but for the biSBM applied to bicliques to derive the same resolution improvement.
This constantfactor improvement in resolution limit may seem irrelevant, given that the major contribution of the hierarchical SBM was to change the order of the limit to Peixoto (2014a). However, we find that, on the contrary, the factor improvement for the biSBM expands a previously uninvestigated regime in which flat models outperform their hierarchical cousin. When given the biclique data, the hSBM finds a hierarchical division where at each level , the number of groups decreases by a factor , except at the highest level where it finds a bipartite division. Assuming that , we have , where . The hSBM’s prior for edge counts Eq. (23) can be factored into uniform distributions over multigraphs at lower levels and over an SBM at the topmost level, leading to,
{IEEEeqnarray}rCl\IEEEeqnarraymulticol3l
P_lower( e) = ∏_l=1^log_σ~B((σ^2Eσl/~B))^~B/σ^l
& ×&
σ!2~B/σl(~B/σl1)!2(~B/σ^l11~B/σl1)^2 ,
and,
(18) 
By comparing with the corresponding terms from the biSBM [Eq. (13)] or the corresponding equation for the SBM [Eq. (22)], we can identify regimes in which a flat model better describes network data than the nested model.
Dataset  

Southern women interactions Jones et al. (1942)  18  14  5.56  (1, 1)  (1, 1)  (1, 1)  2.0  2.15  2.26 
Joern plantherbivore web Joern (1979)  22  52  4.97  (2, 2)  (1, 1)  (1, 1)  2.0  2.64  2.74 
Swingers and parties Niekamp et al. (2013)  57  39  4.83  (1, 1)  (1, 1)  (1, 1)  2.0  2.92  2.97 
McMullen pollination web McMullen (1993)  54  105  2.57  (2, 2)  (2, 2)  (1, 1)  2.0  2.87  3.02 
Ndrangheta criminals di Milano (2011)  156  47  4.48  (3, 4)  (3, 3)  (3, 4)  2.87  3.44  3.49 
Abu Sayyaf kidnappings Gerdes et al. (2014)  246  105  2.28  (2, 2)  (1, 1)  (1, 1)  2.0  4.50  4.54 
Virushost interactome RozenblattRosen et al. (2012)  53  307  2.52  (2, 2)  (1, 1)  (1, 1)  2.0  3.78  3.81 
ClementsLong plantpollinator Clements and Long (1923)  275  96  4.98  (1, 1)  (1, 1)  (1, 1)  2.0  3.45  3.47 
Human musculoskeletal system Murphy et al. (2018)  173  270  4.30  (7, 8)  (5, 5)  (8, 8)  4.01  3.94  3.94 
Mexican drug trafficking Coscia and Rios (2012)  765  10  16.1  (12, 8)  (8, 7)  (10, 6)  3.11  1.26  1.29 
Countrylanguage network Kunegis (2013)  254  614  2.89  (4, 5)  (2, 2)  (4, 3)  2.11  4.53  4.56 
Malaria gene similarity Larremore et al. (2013)  297  806  5.38  (15, 16)  (6, 6)  (25, 20)  4.95  4.73  4.67 
Protein complexdrug Nacher and Schwartz (2012)  739  680  5.20  (20, 22)  (14, 14)  (35, 39)  5.06  3.65  3.50 
Robertson plantpollinator Robertson ()  456  1428  16.2  (20, 18)  (11, 11)  (20, 19)  4.0  3.10  3.10 
Human genedisease network Goh et al. (2007)  1419  516  4.06  (13, 14)  (9, 9)  (35, 36)  5.04  5.02  4.80 
Food ingredientsflavors web Ahn et al. (2011)  1525  1107  27.9  (27, 69)  (20, 29)  (42, 130)  4.91  2.55  2.51 
Wikipedia docword network Gerlach et al. (2018)  63  3140  24.8  (22, 206)  (18, 23)  (29, 71)  4.17  1.58  1.51 
Foursquare checkins Yang et al. (2013)  2060  2876  11.0  (65, 66)  (40, 40)  (244, 248)  5.2  5.92  5.09 
Ancient metabolic network Goldford et al. (2017)  5651  5252  4.22  (18, 22)  (5, 5)  (17, 21)  4.26  5.68  5.82 
Marvel Universe characters Alberich et al. (2002)  6486  12942  9.95  (68, 72)  (67, 62)  (365, 314)  6.24  4.70  4.42 
Reuters news stories Lewis et al. (2004)  19757  38677  33.5  (396, 440)  (87, 108)  (294, 463)  6.25  4.22  4.16 
IMDb movieactor dataset  53158  39768  6.49  (91, 92)  (69, 68)  (264, 265)  6.22  7.40  7.30 
YouTube group memberships Mislove et al. (2007)  94238  30087  4.72  (62, 66)  (37, 38)  (221, 238)  5.9  7.07  7.13 
DBpedia writer network Auer et al. (2007)  89355  46213  2.13  (22, 26)  (2, 3)  (2, 3)  2.16  10.32  10.41 

Via the posterior odds ratio: ; ; .

Temporal data with timestamps are aggregated, making a multigraph.

Data available at https://www.imdb.com/interfaces. IMDb copyright permits redistribution of data only in unaltered form.
Figure 4 shows regimes in which the flat model is preferred for both the SBM and biSBM. These regimes are larger for the biSBM than the SBM, as expected, and are larger when the hierarchical branching factor decreases—indeed, if the data are less hierarchical, the hierarchical model is expected to have less of an advantage. The flatmodel description is also favored when there are fewer edges and more groups, suggesting that in order for the nested model to be useful, it requires sufficient data to support its more costly nested architecture. A number of realworld networks that fall into this flatmodel regime are described in the following section. We note that our definition of this regime relies on assumptions of perfect inference and a fixed branching factor at each level of the hSBM’s hierarchy. These assumptions may not always hold.
Vii Empirical networks
We now examine the application of the biSBM to a corpus of realworld networks ranging in size from to nodes, across social, biological, linguistic, and technological domains. While it was typical of past studies to measure a community detection method by its ability to recapitulate known metadata labels, we acknowledge that this approach is inadvisable for a number of theoretical and practical reasons Peel et al. (2017) and instead compare the biSBM to the SBM and hSBM using Bayesian model selection.
In general, to compare one partitionmodel pair and an alternative pair , we can compute the posterior odds ratio,
(19) 
Model is favored when and model is favored when , with the magnitude of difference from indicating the degree of confidence in model selection Jeffreys (1998). In the absence of any a priori preference for either model, , meaning that the ratio of probabilities can be alternatively expressed via the difference in description lengths, . [Recall that the description length for the combined model and data A can be written as the negative log of the posterior probability, as introduced in Sec. III.] In what follows, we compare the hSBM to the biSBM and without loss of generality choose to be whichever model is favored so that simply expresses the magnitude of the odds ratio. Note that by construction, the biSBM always outperforms the flat SBM.
As predicted in the previous section, the biSBM’s flat prior is better when networks are smaller and sparser, while for larger networks the hSBM generally performs better by building a hierarchy that results in a more parsimonious model (Table 1). Indeed, the majority of larger networks are better described using the hSBM (Table 1; rightmost columns), but exceptions do exist, including the ancient metabolic network Goldford et al. (2017), YouTube memberships Mislove et al. (2007), and DBpedia writer network Auer et al. (2007), which share the common feature of low density. The Robertson plantpollinator network Robertson (), on the other hand, is neither small nor particularly sparse, and yet the biSBM is still weakly preferred over the hSBM.
Differences between models, based only on their maximum a posteriori (i.e., minimum description length) estimates, may overlook additional complexity in the models’ full posterior distributions. We repeatedly sample from the posterior distributions of the SBM, biSBM, and hSBM for networks from Table 1, showing both posterior description length distributions and inferred block count distributions (Fig. 5). Generally, all three models exhibit similar descriptionlength variation, but due to the 2D search introduced in Sec. IV, the biSBM returns partitions with wider variation in and . For instance, the drug trafficking network Coscia and Rios (2012), a multigraph with , has a bimodal distribution of description lengths under the hSBM, while the biSBM finds plausible partitions for a wide variety of values (Fig. 5b). On the other hand, posterior distributions for the countrylanguage network Kunegis (2013) are all unimodal, but the biSBM finds probable states with wide variation in description length and block counts, while the hSBM samples from a small region (Fig. 5c). This can happen when the network is small, since the hSBM requires sufficiently complicated data to justify a hierarchy, while the biSBM finds a variety of lower description length partitions. In fact, viewing the same datasets through the lenses of these different models’ priors can quite clearly shift the location of posterior peaks. This is most clearly visible in the Reuters network Lewis et al. (2004), for which the models have unambiguous and nonoverlapping preferred states (Fig. 5f).
Briefly, we note that model comparison is possible here due to the fact that all of the models we considered are SBMs with clearly specified posterior distributions. Broader comparisons between community detection models of entirely different classes are also possible, for which we suggest Ref. Ghasemian et al. (2019).
Viii Discussion
This paper presented a bipartite microcanonical stochastic blockmodel (biSBM) and an algorithm to fit the model to network data. Our work is built on two foundations, developing a bipartite SBM Larremore et al. (2014) with a more sophisticated microstate counting approach Peixoto (2012). The model itself follows in the footsteps of Bayesian SBMs Peixoto (2014a, 2017) but with key modifications to the prior distribution and the search algorithm that more correctly account for the fact that some partitions are strictly prohibited when a network is bipartite. As a result, the biSBM is able to resolve community structure in bipartite networks better than the general SBM, demonstrated in tests with synthetic networks (Fig. 2).
The resolution limit of the biSBM is greater than the general SBM by a factor of . We demonstrated this mathematically and in a simple bicliquefinding test (Fig. 3). This analysis led us to directly compare the priors for the biSBM and the hierarchical SBM, which hinted at an unexpected regime in which the biSBM provides a better model than the hSBM. This regime, populated by smaller, sparser, and less hierarchical networks, was found in real data where model selection favored the biSBM (Table 1).
How should we understand these networks that are better described by our flat model than a hierarchical one? One possibility is that these networks are simply “flat” and so any hierarchical description simply wastes descriptionlength bits on a model which is too complex. Another possibility is that this result can be explained not by the mathematics of the models but by the algorithms used to fit the models. In fact, our tests with synthetic networks show clear differences between models and algorithms, with the 2D search algorithm introduced here providing better fits to data than a 1D search (Fig. 2). However, this finding alone does not actually differentiate between the two possible explanations, and so we constructed the following simple test.
To probe the differences between the biSBM and hSBM as models vs differences in their modelfitting algorithms, we combined both approaches in a twostep protocol: Fit the biSBM to network data and then build an optimal hierarchical model upon that fixed biSBM base. Unless the data are completely flat, this hierarchybuilding process will further reduce the description length, providing a more parsimonious model. If the hybrid hbiSBM provides a superior description length to the hSBM, our observations can be attributed to differences in modelfitting algorithms. In fact, this is precisely what we find.
Figure 6 shows repeated application of the biSBM, hSBM, and hybrid hbiSBM to the ancient metabolic network Goldford et al. (2017) and the malaria genes network Larremore et al. (2013). In the ancient metabolic network, the biSBM already outperformed the hSBM, so the hybrid model results in only marginal improvements in description length. However, doing so also creates hierarchies with an average depth of layers, compared with the layers found by hSBM natively. In other words, we can achieve a deeper hierarchy in addition to a more parsimonious model when using the flat biSBM partition at the lowest level. This suggests that, in fact, not all of the hSBM’s underperformance can be attributed to the ancient metabolic network’s being “flat,” since a hierarchy can be constructed upon the biSBM’s inferred structure. In the malaria genes network, although the hSBM outperformed the biSBM, the hybrid model was superior to both. Since the hybrid partitions are, in principle, available to the hSBM, our conclusion is that the 2D search algorithm we presented is actually finding better partitions. Put another way, there are further opportunities to improve the depth and speed of algorithms to fit stochastic blockmodels to realworld data, particularly when bipartite or other structure in the data can be exploited.
Finally, this work shows how both models and algorithms can reflect the structural constraints of realworld network data, and how doing so improves model quality. While our work addresses only community detection for bipartite networks, generalizations of both the mathematics and search algorithms could in principle be derived for multipartite networks in which more complicated rules exist for how node types are allowed to connect.
Acknowledgements.
The authors thank Tiago Peixoto, Tatsuro Kawamoto, Pan Zhang, Joshua Grochow, and JeanGabriel Young for stimulating discussions. DBL was supported in part by the Santa Fe Institute Omidyar Fellowship. The authors thank the BioFrontiers Institute at the University of Colorado Boulder and the Santa Fe Institute for the use of their computational facilities.Appendix A Recursive 2D search algorithm
In this appendix, we elaborate on the recursive search algorithm sketched in Sec. IV.2. Our overarching problem is to find the pair that minimizes the description length. We use dynamic programming to solve this problem efficiently, observing that it has the following two properties.
(1) Optimal substructure.—If we collectively inspect the solutions that lead to local minima, then the best of those determines the global minimum.
(2) Overlapping subproblems.—To verify the existence of a local minimum, we have to compute the description length for its neighborhood points. There are many subproblems which are solved again and again and their solutions can be stored in a table so that these need not to be recomputed.
Our recursive algorithm is summarized in Algms. A.1 and A.2. Due to its recursive construction, a base case (or smallest subproblem) represents a leaf node in the recursive search tree, at which we either terminate the algorithm or traceback and try a different node. The target point of the base case is only if its entropy is (i) minimal over all algorithmic history and (ii) is locally minimal compared with points within the neighborhood
(20) 
where is a userdefined parameter that controls the size of the subproblem.
In Phase I, we perform MCMC at the trivial bipartite partition to create a reference description length. Then, starting from an initial state in which each node belongs to its own group, we apply an AgglomerativeMerge algorithm (also summarized in the main text) to reach a partition at , where . The algorithm works as follows. In each sweep, we attempt block changes according to Eq. (16) for each block. These proposal moves are not uniformly random, but are instead based on the current block structure, treating the edge count matrix e as the adjacency matrix of a multigraph so that blocks can be thought of as nodes in this higherlevel representation. Potential merges of blocks are then ranked according to increasing , and exactly block merges are performed in that order, in each sweep. To minimize the impact of bad merges done in the earlier steps, at the end of each sweep we apply MCMC algorithm described in the main text at zero temperature, allowing block changes that strictly decrease the entropy. This algorithm has an overall complexity of Peixoto (2014b), which is dwarfed when compared with the MCMC calculation. Note that in Sec. IV.1, we perform the same agglomerative merge algorithm right before the MCMC inference but merge blocks to a specific number of groups , rather than to a threshold given by .
Phase II is the core recursive algorithm. Starting from the partition, we check whether it is a local minimum in the description length landscape, where the radius of the local neighborhood is , as defined in Eq. (20). If the current point is indeed a local minimum, the algorithm terminates. If it is not, the algorithm finds another candidate point in the grid by calling the RandMerge routine, which works by proposing many ways in which pairs of blocks could be merged, and then choosing the best merge. In particular, RandMerge proposes, for each block , other blocks to which could be merged, selected uniformly at random. From among those candidate merges, we choose the pair of blocks with the smallest relative entropy deviation . Here, is the minimum of all MCMCcalculated entropies explored globally and is the entropy that would result from a hypothetical merging of blocks and .
At this point, the algorithm gains its efficiency from avoiding calling the costly MCMC routine while still moving toward a local minimum in the plane. To do so requires that we accept the merge and entropy change from RandMerge without pausing to refit the model using MCMC at the new . If this process is repeated, the entropy after accumulating merges will deviate more and more from the optimal entropy, were we to refit the model using MCMC at the current . We therefore balance speed and efficiency by introducing a parameter that forces a full MCMC fit only when the accumulated entropy from repeated merges becomes intolerable. Let such that when the a block merge does not deviate from the entropy too much (), we accept the merge and attempt the next successive merges. Otherwise, we seek to terminate the algorithm by calling LocalMinimum_Check again at the current .
The key to efficiency is that computing the approximated partition by block merges from an optimized partition is faster than finding it from scratch. Note that when the state of the algorithm is far from a local minimum, is typically small and negative, meaning that a large number of merges can often be performed before a full MCMC is required. Thus, choosing is important. If we choose a large , the algorithm can overshoot the local minimum, requiring it to only gradually rediscover that minimum by inspecting many neighboring points. On the other hand, if we choose a small , there will be a larger number of MCMC calculations, which we also want to avoid. To this end, we determine from the data onthefly during the Adaptive_Search step. Namely, is the first outlier based on the Interquartile Rule,
(21) 
where collects the ’s at earlier sweeps and the IQR is the interquartile range, being equal to the difference between and percentiles. However, with this choice, we may still overshoot. In such cases, we reduce by a factor and relocate our attention to the whose entropy is minimal so far, and then call LocalMinimum_Check. During the neighborhood check, if we find an even better point nearby, we will relocate the tip to that point, and continue with the Adaptive_Search step. The algorithm ends if a local minimum is found.
Because of our dynamic programming approach, the time complexity of the total algorithm cannot be computed directly from the recursion, nor do we know the exact number of subproblems (local searches using MCMC) that the algorithm will need to call. Indeed, as we found for synthetic networks near the detectability limit, and for networks near transitions in the resolution limit, the optimization landscape becomes degenerate and multimodal, making a general algorithm complexity result hopeless.
Nevertheless, the time complexity of the search algorithm scales with the number of MCMC calculations. Heuristic arguments suggest the number of MCMC calculations should be on the order of , where is the number of times that the most expensive forloop [line 11 of LocalMinimum_Check] is called. However, even this is an approximation, due to the fact that, at times, a localminimum check reveals a point within the neighborhood that is better than the point currently being checked. In this way, subproblems may overlap, making the total cost somewhat cheaper. Empirically, for most networks in Table 1, .
Appendix B Prior for edge counts in the hierarchical biSBM
In this appendix, we provide the prior for edge counts in the hierarchical bipartite SBM, corresponding to Eqs. (VI) and Eq. (18). We begin with the flat SBM, whose prior for edge counts is,
(22) 
In the hSBM, it might seem as if should be written as a product of SBM likelihoods by repeatedly reusing Eq. (2) at each additional level. However, at higher levels, the networks are multigraphs, and the SBM likelihood does not generate multigraphs uniformly because it is based on a uniform generation of configurations (i.e., ) Peixoto (2017). Therefore, the correct way to build up the product is to directly count the number of multigraphs at each higher level using the dense ensemble Peixoto (2012), with each network instance occurring with the same probability.
Model variant  SBM  biSBM  hSBM  

Hierarchy level  
Eq. (22)  Eq. (13)  Eq. (22)  
Eq. (2)  Eq. (2)  Eq. (24)  
Eq. (9)  Eq. (9)  Eq. (25)  
Eq. (5)  Eq. (5)  
Dense ensemble?  no  no  yes 
Assuming that we have built higherlevel models, the prior for edge counts between groups can be rewritten as
(23) 
where
(24) 
and
(25) 
At the highest level (), denotes a singlenode multigraph with selfloops. Because there is no further block structure, we enforce and assume that the block is generated by a uniform prior, and reuse Eq. (22).
One peculiar consequence of forcing the hSBM, implemented in graphtool, to consider only typespecific blocks is that even when a network has no statistically justifiable structure, the hSBM finds the trivial bipartite partition and then builds a final hierarchical level on that trivial bipartite partition. In other words, it cannot help but find a single group at the topmost level. This explains the otherwise perplexing distribution of model description lengths shown in Fig. 5a: both the SBM and hSBM find the trivial partition, but this partition is more costly to express via the hSBM due to our having forced it to respect the network’s bipartite structure. Table 2 summarizes all model likelihood and prior functions pertinent to this paper, for reference.
References
 J.G. Young, F. S. Valdovinos, and M. E. J. Newman, Reconstruction of plant–pollinator networks from observational data, bioRxiv , 754077 (2019).
 T. Squartini, A. Almog, G. Caldarelli, I. van Lelyveld, D. Garlaschelli, and G. Cimini, Enhanced capitalasset pricing model for the reconstruction of bipartite financial networks, Physical Review E 96, 032315 (2017).
 R. Guimerà and M. SalesPardo, Justice Blocks and Predictability of U.S. Supreme Court Votes, PLOS ONE 6, e27188 (2011).
 G. Ghoshal, V. Zlatić, G. Caldarelli, and M. E. J. Newman, Random hypergraphs and their applications, Physical Review E 79, 066118 (2009).
 P. S. Chodrow, Configuration models of random hypergraphs, Journal of Complex Networks 8 (2020), cnaa018.
 P. W. Holland, K. B. Laskey, and S. Leinhardt, Stochastic blockmodels: First steps, Social Networks 5, 109 (1983).
 T. P. Peixoto, Bayesian stochastic blockmodeling, in Advances in Network Clustering and Blockmodeling (John Wiley & Sons, Hoboken, New Jersey, 2019) Chap. 11, pp. 289–332.
 E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing, Mixed membership stochastic blockmodels, Journal of Machine Learning Research 9, 1981 (2008).
 A. GodoyLorite, R. Guimerà, C. Moore, and M. SalesPardo, Accurate and scalable social recommendation using mixedmembership stochastic block models, Proceedings of the National Academy of Sciences 113, 14207 (2016).
 B. Karrer and M. E. J. Newman, Stochastic blockmodels and community structure in networks, Physical Review E 83, 016107 (2011).
 M. TarrésDeulofeu, A. GodoyLorite, R. Guimerà, and M. SalesPardo, Tensorial and bipartite block models for link prediction in layered networks and temporal networks, Physical Review E 99, 032307 (2019).
 T. P. Peixoto, Hierarchical Block Structures and HighResolution Model Selection in Large Networks, Physical Review X 4, 011047 (2014a).
 D. Hric, T. P. Peixoto, and S. Fortunato, Network Structure, Metadata, and the Prediction of Missing Nodes and Annotations, Physical Review X 6, 031038 (2016).
 M. E. J. Newman and A. Clauset, Structure and inference in annotated networks, Nature Communications 7, 1 (2016).
 L. Peel, D. B. Larremore, and A. Clauset, The ground truth about metadata and community detection in networks, Science Advances 3, e1602548 (2017).
 M. E. J. Newman, Estimating network structure from unreliable measurements, Physical Review E 98, 062321 (2018a).
 M. E. J. Newman, Network structure from rich but noisy data, Nature Physics 14, 542 (2018b).
 T. P. Peixoto, Reconstructing Networks with Unknown and Heterogeneous Errors, Physical Review X 8, 041011 (2018).
 J.G. Young, G. StOnge, P. Desrosiers, and L. J. Dubé, Universality of the stochastic block model, Physical Review E 98, 032309 (2018).
 S. C. Olhede and P. J. Wolfe, Network histograms and universality of blockmodel approximation, Proceedings of the National Academy of Sciences 111, 14722 (2014).
 D. B. Larremore, A. Clauset, and A. Z. Jacobs, Efficiently inferring community structure in bipartite networks, Physical Review E 90, 012805 (2014).
 T. P. Peixoto, Nonparametric Bayesian inference of the microcanonical stochastic block model, Physical Review E 95, 012317 (2017).
 M. A. Riolo, G. T. Cantwell, G. Reinert, and M. E. J. Newman, Efficient method for estimating the number of communities in a network, Physical Review E 96, 032310 (2017).
 T. Kawamoto and Y. Kabashima, Crossvalidation estimate of the number of clusters in a network, Scientific Reports 7, 1 (2017a).
 T. VallèsCatalà, T. P. Peixoto, M. SalesPardo, and R. Guimerà, Consistencies and inconsistencies between model selection and link prediction in networks, Physical Review E 97, 062316 (2018).
 T. P. Peixoto, Parsimonious Module Inference in Large Networks, Physical Review Letters 110, 148701 (2013).
 T. Kawamoto and Y. Kabashima, Counting the number of metastable states in the modularity landscape: Algorithmic detectability limit of greedy algorithms in community detection, Physical Review E 99, 010301 (2019).
 J. Calatayud, R. BernardoMadrid, M. Neuman, A. Rojas, and M. Rosvall, Exploring the solution landscape enables more reliable network community detection, Physical Review E 100, 052308 (2019).
 B. Bollobás, A Probabilistic Proof of an Asymptotic Formula for the Number of Labelled Regular Graphs, European Journal of Combinatorics 1, 311 (1980).
 B. K. Fosdick, D. B. Larremore, J. Nishimura, and J. Ugander, Configuring Random Graph Models with Fixed Degree Sequences, SIAM Review 60, 315 (2018).
 T. P. Peixoto, Entropy of stochastic blockmodel ensembles, Physical Review E 85, 056122 (2012).
 J. Rissanen, Information and Complexity in Statistical Modeling (Information Science and Statistics) (Springer New York, 2007).
 P. D. Grünwald, The Minimum Description Length Principle (Adaptive Computation and Machine Learning) (The MIT Press, 2007).
 P. D. Grünwald and T. Roos, Minimum Description Length Revisited, International Journal of Mathematics for Industry 10.1142/S2661335219300018 (2019).
 G. Schwarz, Estimating the Dimension of a Model, The Annals of Statistics 6, 461 (1978).
 X. Yan, C. Shalizi, J. E. Jensen, F. Krzakala, C. Moore, L. Zdeborová, P. Zhang, and Y. Zhu, Model selection for degreecorrected block models, Journal of Statistical Mechanics: Theory and Experiment 2014, P05007 (2014).
 G. E. Andrews, The Theory of Partitions (Cambridge University Press, 1998).
 N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller, Equation of State Calculations by Fast Computing Machines, The Journal of Chemical Physics 21, 1087 (1953).
 W. K. Hastings, Monte Carlo Sampling Methods Using Markov Chains and Their Applications, Biometrika 57, 97 (1970).
 T. P. Peixoto, Efficient Monte Carlo and greedy heuristic for the inference of stochastic block models, Physical Review E 89, 012804 (2014b).
 T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, 3rd Edition, 3rd ed. (The MIT Press, Cambridge, Mass, 2009).
 J. Erickson, Algorithms (Independently published, S.L., 2019).
 B. W. Kernighan and S. Lin, An efficient heuristic procedure for partitioning graphs, The Bell System Technical Journal 49, 291 (1970).
 T.C. Yen, The bipartiteSBM Python library, available at https://github.com/junipertcy/bipartiteSBM (2020).
 D. B. Larremore, A. Clauset, and C. O. Buckee, A Network Approach to Analyzing Highly Recombinant Malaria Parasite Genes, PLOS Computational Biology 9, e1003268 (2013).
 C. Moore, The Computer Science and Physics of Community Detection: Landscapes, Phase Transitions, and Hardness, arXiv:1702.00467 [condmat, physics:physics] (2017), arXiv:1702.00467 [condmat, physics:physics] .
 A. Condon and R. M. Karp, Algorithms for graph partitioning on the planted partition model, Random Structures & Algorithms 18, 116 (2001).
 T. P. Peixoto, The graphtool Python library, 10.6084/m9.figshare 1164194 (2014c), available at https://graphtool.skewed.de.
 A. Decelle, F. Krzakala, C. Moore, and L. Zdeborová, Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications, Physical Review E 84, 066106 (2011).
 E. Mossel, J. Neeman, and A. Sly, Reconstruction and estimation in the planted partition model, Probability Theory and Related Fields 162, 431 (2015).
 T. Kawamoto and Y. Kabashima, Detectability thresholds of general modular graphs, Physical Review E 95, 012304 (2017b).
 F. RicciTersenghi, G. Semerjian, and L. Zdeborová, Typology of phase transitions in Bayesian inference problems, Physical Review E 99, 042109 (2019).
 A. Clauset, E. Tucker, and M. Sainz, The Colorado Index of Complex Networks, available at https://icon.colorado.edu/ (2016).
 L. W. Jones, A. Davis, B. B. Gardner, and M. R. Gardner, Deep South: A Social Anthropological Study of Caste and Class, Southern Economic Journal 9, 159 (1942).
 A. Joern, Feeding patterns in grasshoppers (Orthoptera: Acrididae): Factors influencing diet specialization, Oecologia 38, 325 (1979).
 A.M. Niekamp, L. A. G. Mercken, C. J. P. A. Hoebe, and N. H. T. M. DukersMuijrers, A sexual affiliation network of swingers, heterosexuals practicing risk behaviours that potentiate the spread of sexually transmitted infections: A twomode approach, Social Networks Special Issue on Advances in TwoMode Social Networks, 35, 223 (2013).
 C. K. McMullen, Flowervisiting insects of the Galápagos Islands, PanPacific entomologist (USA) 69, 95 (1993).
 T. di Milano, Ordinanza di applicazione di misura coercitiva con mandato di catturaart. (Operazione Infinito), Ufficio del giudice per le indagini preliminari (2011).
 L. M. Gerdes, K. Ringler, and B. Autin, Assessing the Abu Sayyaf Group’s Strategic and Learning Capacities, Studies in Conflict & Terrorism 37, 267 (2014).
 O. RozenblattRosen, R. C. Deo, M. Padi, G. Adelmant, M. A. Calderwood, T. Rolland, M. Grace, A. Dricot, M. Askenazi, M. Tavares, S. J. Pevzner, F. Abderazzaq, D. Byrdsong, A.R. Carvunis, A. A. Chen, J. Cheng, M. Correll, M. Duarte, C. Fan, M. C. Feltkamp, S. B. Ficarro, R. Franchi, B. K. Garg, N. Gulbahce, T. Hao, A. M. Holthaus, R. James, A. Korkhin, L. Litovchick, J. C. Mar, T. R. Pak, S. Rabello, R. Rubio, Y. Shen, S. Singh, J. M. Spangle, M. Tasan, S. Wanamaker, J. T. Webber, J. RoeckleinCanfield, E. Johannsen, A.L. Barabási, R. Beroukhim, E. Kieff, M. E. Cusick, D. E. Hill, K. Münger, J. A. Marto, J. Quackenbush, F. P. Roth, J. A. DeCaprio, and M. Vidal, Interpreting cancer genomes using systematic host network perturbations by tumour virus proteins, Nature 487, 491 (2012).
 F. E. Clements and F. L. Long, Experimental Pollination: An Outline of the Ecology of Flowers and Insects, 336 (Carnegie Institution of Washington, 1923).
 A. C. Murphy, S. F. Muldoon, D. Baker, A. Lastowka, B. Bennett, M. Yang, and D. S. Bassett, Structure, function, and control of the human musculoskeletal network, PLOS Biology 16, e2002811 (2018).
 M. Coscia and V. Rios, Knowing Where and How Criminal Organizations Operate Using Web Content, in Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM ’12 (ACM, Maui, Hawaii, USA, 2012) pp. 1412–1421.
 J. Kunegis, KONECT: The Koblenz Network Collection, in Proceedings of the 22Nd International Conference on World Wide Web, WWW ’13 Companion (ACM, Rio de Janeiro, Brazil, 2013) pp. 1343–1350.
 J. C. Nacher and J.M. Schwartz, Modularity in Protein Complex and Drug Interactions Reveals New Polypharmacological Properties, PLOS ONE 7, e30028 (2012).
 C. Robertson, Flowers and Insects; Lists of Visitors of Four Hundred and FiftyThree Flowers (Science Press, Lancaster, PA, USA, 1929).
 K.I. Goh, M. E. Cusick, D. Valle, B. Childs, M. Vidal, and A.L. Barabási, The human disease network, Proceedings of the National Academy of Sciences 104, 8685 (2007).
 Y.Y. Ahn, S. E. Ahnert, J. P. Bagrow, and A.L. Barabási, Flavor network and the principles of food pairing, Scientific Reports 1, 1 (2011).
 M. Gerlach, T. P. Peixoto, and E. G. Altmann, A network approach to topic models, Science Advances 4, eaaq1360 (2018).
 D. Yang, D. Zhang, Z. Yu, and Z. Yu, Finegrained Preferenceaware Location Search Leveraging Crowdsourced Digital Footprints from LBSNs, in Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, UbiComp ’13 (ACM, Zurich, Switzerland, 2013) pp. 479–488.
 J. E. Goldford, H. Hartman, T. F. Smith, and D. Segrè, Remnants of an Ancient Metabolism without Phosphate, Cell 168, 1126 (2017).
 R. Alberich, J. MiroJulia, and F. Rossello, Marvel Universe looks almost like a real social network, arXiv:condmat/0202174 (2002), arXiv:condmat/0202174 .
 D. D. Lewis, Y. Yang, T. G. Rose, and F. Li, RCV1: A New Benchmark Collection for Text Categorization Research, Journal of Machine Learning Research 5, 361 (2004).
 A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and B. Bhattacharjee, Measurement and Analysis of Online Social Networks, in Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement, IMC ’07 (ACM, San Diego, California, USA, 2007) pp. 29–42.
 S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives, DBpedia: A Nucleus for a Web of Open Data, in The Semantic Web, Lecture Notes in Computer Science, Vol. 4825, edited by K. Aberer, K.S. Choi, N. Noy, D. Allemang, K.I. Lee, L. Nixon, J. Golbeck, P. Mika, D. Maynard, R. Mizoguchi, G. Schreiber, and P. CudréMauroux (Springer, Berlin, Heidelberg, 2007) pp. 722–735.
 S. H. Jeffreys, The Theory of Probability (Oxford University Press, New York, 1998).
 A. Ghasemian, H. Hosseinmardi, and A. Clauset, Evaluating overfit and underfit in models of network community structure, IEEE Transactions on Knowledge and Data Engineering 32, 1722 (2019).