Latent geometry of bipartite networks
Despite the abundance of bipartite networked systems, their organizing principles are less studied, compared to unipartite networks. Bipartite networks are often analyzed after projecting them onto one of the two sets of nodes. As a result of the projection, nodes of the same set are linked together if they have at least one neighbor in common in the bipartite network. Even though these projections allow one to study bipartite networks using tools developed for unipartite networks, one-mode projections lead to significant loss of information and artificial inflation of the projected network with fully connected subgraphs. Here we pursue a different approach for analyzing bipartite systems that is based on the observation that such systems have a latent metric structure: network nodes are points in a latent metric space, while connections are more likely to form between nodes separated by shorter distances. This approach has been developed for unipartite networks, and relatively little is known about its applicability to bipartite systems. Here, we fully analyze a simple latent-geometric model of bipartite networks, and show that this model explains the peculiar structural properties of many real bipartite systems, including the distributions of common neighbors and bipartite clustering. We also analyze the geometric information loss in one-mode projections in this model, and propose an efficient method to infer the latent pairwise distances between nodes. Uncovering the latent geometry underlying real bipartite networks can find applications in diverse domains, ranging from constructing efficient recommender systems to understanding cell metabolism.
1Introduction and Motivation
Many real-world networks have a bipartite structure: nodes can be separated into two disjoint sets and links exist only between nodes of different sets. Real bipartite networks are often characterized by two common properties: (i) heterogeneity in distributions of node degrees; and (ii) a large number of common neighbors shared between pairs of nodes. The heterogeneity of degree distributions has been studied extensively in both traditional (unipartite) and bipartite networks and comes as no surprise. In many bipartite systems heterogeneous degree distributions can be approximated by power laws, , which are observed for at least one of the two sets of nodes . The second property, on the other hand, has not been studied extensively and deserves a thorough investigation.
In Fig. we show the distribution of the number of common neighbors between nodes in several real bipartite systems: the actor-film network derived from the International Movie Database (IMDb) , condensed matter (Condmat) and high energy physics (HEP) collaboration networks derived from the arXiv , Wikipedia , and the network of metabolic reactions  (see Appendix ?). For each of these networks, we calculated the distribution of the number of common neighbors shared between pairs of nodes, . We see that the number of common neighbors in these networks is power-law distributed,
so that significant fractions of node pairs share many common neighbors. Similar observations have been made for other bipartite systems. For instance, the probability of two insect species pollinating different kinds of flowers in common has been shown to follow a truncated power law . Similarly, a fat-tail distribution of the number of shared requests between two users has been observed in peer-to-peer networks .
To better understand the mechanisms leading to the abundance of common neighbors in bipartite systems we first ask if the observed fat-tail distributions of are the consequence of heterogeneous degree distributions . To answer this question we randomly rewire our real bipartite networks by preserving the degrees of individual nodes (see Appendix ?). We find that in the randomized networks exhibits very fast decays, such that the maximum number of common neighbors between nodes is very small, see Fig. . This result suggests that the heterogeneity of is not caused by the heterogeneity of . Second, we also check if the heterogeneous shape of is driven by all pairs of nodes in the network or by a handful of high degree nodes. To this end, we focus on node pairs with a large number of common neighbors. We create a heatmap by counting pairs of HEP authors sharing at least publications, and whose publication record sizes, i.e., degrees, are and . As seen from Fig. , author pairs with common publications do not necessarily consist of authors that have published a large number of papers, as one would expect from a random collaboration pattern. On the contrary, we see that the majority of author pairs with at least common publications involve authors who barely published over publications each. This observation is not specific to the HEP collaboration network: we checked that in all considered networks both small and large degree nodes can have a large number of common neighbors.
The heterogeneity in the observed number of common neighbors implies the existence of a large number of -loops in real bipartite networks. Indeed, a pair of nodes sharing a large number of common neighbors will have a large number of -loops passing through them, that is, loops of the form . Supporting this observation, we also find that real bipartite systems are characterized by strong bipartite clustering, which quantifies the density of -loops in the network, see the definition in Section 3.7. Bipartite clustering is typically several orders of magnitude larger in the original networks compared to their degree-preserving randomized counterparts, Fig. . We also note that similar clustering-related heterogeneity has also been observed in unipartite networks. Many real unipartite networks have been shown to exhibit power-law distributions of edge multiplicity, defined as the number of triangles shared by edges .
1.1Latent geometry of bipartite networks
Here we show that the observed common properties of real bipartite networks can be explained by the existence of latent geometric spaces underlying these networks. That is, we assume that nodes in bipartite networks are points in some geometric space underlying the system. The latent coordinates of nodes in the space abstract node attributes, while latent distances between nodes play the role of a generalized similarity measure: the more similar the two nodes, the smaller the latent distance between them, the higher the probability that the nodes are connected, Figure 2. To illustrate, consider the IMDb network for instance, where actors are linked to films they starred in. Clearly, connections in this network are not random. Both actors and films can be characterized by numerous attributes, so that connections are the result of a certain mutual match between these attributes. These attributes include genres, geographic locations of films, film release dates, etc. Similar parallels can be drawn for other systems. For instance, connections between authors and manuscripts in collaboration networks are driven by many factors, including the expertise of the authors, their geographic location, their methodology, and so on. Collectively these attributes define similarity distances between nodes in a latent space.
The large numbers of common neighbors and strong clustering observed in real bipartite systems are then a reflection of the triangle inequality in the latent space. Consider, for example, two actors and and a film mapped to three points in a metric space. The triangle inequality in the space prescribes that , where denotes the distance between nodes in the space. If distances and are small, then is also small, so that both actors are likely to co-star in film , that is, is likely to be a common neighbor of and . As the number of common films between two actors increases, so does bipartite clustering, that is, the number of loops of the form . We stress the importance of the metric property in a latent space: if latent distances do not satisfy the triangle inequality, then bipartite networks built using these distances do not have many common neighbors and strong bipartite clustering (Appendix ?).
1.2Organization of the manuscript
To further support our geometric assumption we organize the rest of the manuscript as follows.
We begin with the review of related work in Section 2. In Section 3 we conduct a detailed analysis of a model of synthetic bipartite networks constructed in the latent space. We show that the model generates bipartite networks with either heterogeneous or homogeneous degree distributions in a given node domain, and with power-law distributions of the number of common neighbors and strong bipartite clustering. In Section 4, we investigate how the latent geometry of bipartite systems is transformed under one-mode projections, and prove that one-mode projections can not fully preserve latent geometry. However, we also show that under certain conditions latent geometry can be preserved approximately. In Section 5, we propose a procedure for efficient estimation of the latent pairwise distances between pairs of nodes. Final remarks are in Section 6.
Bipartite networks have been successfully used to model a large array of complex systems including collaboration networks , metabolic reactions , peer-to-peer networks , and recommender systems . Bipartite networks can be represented as hypergraphs, generalizations of graphs where a single edge can connect multiple nodes . Hypergraphs, in their turn, are further generalizable to multipartite hypergraphs, where hyperedges may connect several nodes of different type. Recently, tripartite hypergraphs have been proposed to model tagged social networks, also known as folksonomies .
The concept of latent space has been initially introduced to model homophily and similarity in social networks . Lately, latent space models are attracting great interest in many diverse fields including sociology , statistical physics  and computer science . Another closely related research area is that on random geometric graphs, well-studied in mathematics and engineering , particularly due to its relevance to wireless networks .
Two models of random bipartite geometric graphs have been proposed recently. The first model is the AB random geometric graph (AB RGG) , defined as the two sets of points scattered as two independent Poisson processes in Euclidean space with connections between points from different sets established if distance between them is less then the threshold distance. AB RGGs are motivated by wireless networks where transmission and reception of a signal occurs at different frequencies . Thus, of primary interest in AB RGGs are the connectivity and percolation properties .
The second model is inspired by the hidden variable formalism : network nodes map to points in the latent space and connections between them are drawn with probabilities depending on distances between the nodes in the underlying space. It was shown in  that if latent geometry is hyperbolic, then random geometric graphs in it reproduce common structural and dynamical properties of unipartite networks – scale-free degree distribution, strong clustering, community structure, and large-scale growth dynamics. Equivalent to hyperbolic random graphs are random graphs in Euclidean space with power-law distributed hidden variables . This model has been recently generalized to bipartite networks in , where it was called the model, and utilized to study cell metabolism.
3Topological properties of bipartite networks as reflections of latent geometry
In this section we conduct a detailed analysis of the model of a bipartite network in the simplest compact latent space, circle .
3.1Definitions and the model
We refer to the two groups of nodes in a bipartite network as top and bottom nodes, and denote their number by and respectively. Within the modeling framework we consider, network nodes map to points in a geometric space, and as a result, every node of the network is characterized by its coordinates in this space. Both top and bottom nodes belong to the same space and, thus, distances are defined between all pairs of nodes. Yet, to generate bipartite networks, connections are allowed only between nodes of different domains. To achieve heterogeneity in node degrees and to allow some nodes to connect over large distances, every node is also assigned a hidden variable. To distinguish top and bottom nodes we denote these hidden variables as and respectively.
To form a bipartite network, every top-bottom pair of nodes is connected with a probability , which depends on the distance between the nodes and their hidden variables,
where is the connection probability function, is the distance between the nodes in the geometric space, and is a characteristic distance scale, allowing one to vary the importance of small distances depending on the nodes’ hidden variables. Even though any integrable function can, in principle, serve as the connection probability function, we use
Our choice for the connection probability function is dictated by the maximum entropy principle  and formalizes one’s intuition that similar nodes are more likely to be connected than dissimilar nodes. Indeed, is a decreasing function of , such that as and as . Further, parameter in Eq. (Equation 3) controls the abundance of long-distance connections: the larger the , the less preferred the longer-distance connections.
We focus on the simplest realization of the model, where top and bottom nodes are placed uniformly at random on a -dimensional Euclidean ring of radius , with probability density functions (pdfs) . Given , the densities of top and bottom nodes on the ring are and . The hidden variables of the top and bottom nodes are drawn at random from pdfs and . To ease notation, in the rest of the paper we drop the indices from the top and bottom hidden variable distributions, i.e., and . To achieve heterogeneous degree distributions, we choose the characteristic scale in Eq. (Equation 2) as , which allows nodes with large hidden variables to connect over large distances with higher probability. Parameter rescales all latent distances and controls the expected degrees of the top and bottom domains. Without loss of generality, we set , which corresponds to the unit density of top nodes, i.e., . We are interested in large and sparse bipartite networks, , where is the number of edges.
Even though nodes from both domains in the model belong to the same Euclidean ring , we refer to the model as the model to emphasize its bipartite structure and to distinguish it from the model for unipartite networks developed in . The model is fully specified by the number of top and bottom nodes and , the connection probability function , and the pdfs and . It can be summarized as follows:
Sample the angular coordinates of top nodes , , uniformly at random from , and their hidden variables , , from the pdf ;
Sample the angular coordinates of bottom nodes , , uniformly at random from , and their hidden variables , , from the pdf ;
Connect every top-bottom node pair with probability
The hidden variables in the model allow long distance connections among some nodes and are necessary to achieve heterogeneous degree distributions. An alternative approach to achieve heterogeneity in node degrees is to consider latent spaces of non-zero curvature. Even though both approaches are fully equivalent (see Appendix ?), we utilize the former approach as it is more convenient for calculations.
The basic topological properties of synthetic networks constructed by the model can be obtained in a straightforward manner. Since angular node coordinates are sampled uniformly at random from , the expected degree of a top node with hidden variable and angular position , , is given by
Notice that due to the uniform angular distribution of nodes is independent of , . The evaluation of the inner integral in Eq. (Equation 5) leads to
where is the hypergeometric function. For sufficiently large networks, the expected degrees for nodes with values satisfying can be approximated as
and . The expected degree of the entire top node domain, , is given by averaging over all possible values,
where . Since the model is defined symmetrically for top and bottom nodes, the expected degrees for the bottom domain can be obtained by swapping top and bottom node variables in Eqs. (Equation 7) and (Equation 9),
It can be seen from Eqs. (Equation 7) and (Equation 10) that is a dumb parameter in the sense that for any particular value of , one can always rescale the hidden variables assigned to top and bottom nodes, , in order to obtain desired and values. Therefore, to simplify notation we set
Eqs. (Equation 12) and ( ?) indicate that the hidden variables of nodes are their expected degree values in the resulting topology, see Figure 3. Figure 3 also illustrates how the approximation in Eq. (Equation 7) becomes better for high values of as the size of the network increases.
To compute the degree distributions of the top and bottom nodes we consider the propagators and . The propagator () is defined as the probability that a top (bottom) node with hidden variables () forms exactly () connections to bottom (top) nodes. Since the angular coordinates of nodes are uniformly distributed, the propagators do not depend on the node angles and , and in the case of sparse bipartite networks they can be approximated by the Poisson distribution , that is,
The degree distributions of the top and bottom domains are obtained by averaging the corresponding propagators over the possible values of the hidden variables,
It can be seen from Eqs. (Equation 13- ?) that and are independent of one another—they only depend on and , respectively. Furthermore, the Poissonian character of the propagators and indicates that the resulting degree values of nodes in both domains are narrowly distributed around their hidden variables. This means that the functional forms of and will be similar to those of and , allowing one to construct different degree distributions by engineering proper pdfs of hidden variables and . Even though real bipartite systems are characterized by different degree distributions, of our primary interest are scale-free and Poissonian distributions, which we discuss in Section 3.5 below.
Degree-degree correlations can be quantified using the average nearest neighbor degree (ANND), defined as the average degree of all neighbors of nodes with given degree . It is straightforward to verify that the model is characterized by random degree-degree correlations due to the uniform placement of nodes on :
Indeed, the connection probability between two nodes with fixed hidden variables and is proportional to the product of these hidden variables:
Then, since hidden variables are equal to expected node degrees, this result can be regarded as the soft equivalent of in uncorrelated bipartite networks, where is the probability that randomly chosen nodes with degrees and are connected. The rigorous proof of Eqs. (Equation 15) and ( ?) can be obtained following the hidden variable formalism for bipartite networks .
3.5Categories of bipartite networks
Based on the degree distributions of the top and bottom domains, real bipartite networks often fall into two categories. The first category corresponds to networks with scale-free degree distribution in both top and bottom domains (sf/sf). The second category corresponds to networks with scale-free degree distribution in one domain and Poisson degree distribution in the other domain (sf/ps). Among the real networks that we consider, IMDb, Wikipedia, and the HEP collaboration network fall into the first category. The Condmat collaboration network and the network of metabolic reactions fall into the second category (see Appendix ?).
Since and are expected to be of similar functional form, one can see that a scale-free degree distribution can be obtained by using the continuous power-law distribution of on
Parameter is the smallest value, i.e., the expected minimum node degree, which also controls the mean value of
and the expected averaged degree ( ?).
The Poisson degree distribution can be obtained by choosing , where is the Dirac delta function and is the expected degree of the domain. The degree distribution in this case is given by
Due to the symmetry of the model, the degree distribution of the bottom domain can be obtained similarly through the proper choice of . The independence of and allows one to construct bipartite networks with an arbitrary combination of degree distributions for the top and bottom domains.
For illustration, we visualize in Figure 2 a toy sf/ps network consisting of and nodes. The hidden variables of the top nodes are drawn from the pdf , , , , while the bottom node hidden variables are chosen as for all nodes , where satisfies . The connections are drawn with probability prescribed by Eq. (Equation 4) with .
3.6Number of common neighbors
The number of common neighbors is the most basic non-binary network-based measure of similarity between two nodes in a bipartite network—the smaller the similarity distance between two nodes, the more similar the nodes are, and the larger is the number of neighbors they are expected to share. This makes the number of common neighbors a crucial measure, allowing one to estimate the similarity distance between two nodes in a bipartite network, Section 5. Below, we analyze this measure in the model.
Consider two top nodes characterized by hidden variables and and angular coordinates and . The probability that these two nodes are simultaneously connected to a bottom node with hidden variable and angular coordinate is
where is the connection probability in Eq. (Equation 4). The expected number of common neighbors between these two top nodes, , can be calculated by averaging over all possible positions and hidden variables of bottom nodes,
Due to the uniform distribution of angular coordinates, depends on the angular (similarity) distance between the two top nodes, , and not on their individual coordinates and . That is, . It is straightforward to verify (see Appendix ?) that is independent of the network size . It depends only on the node hidden variables and the distance between the nodes . It follows from Eq. (Equation 19) that for any values of , decreases as the angular distance increases, for both domains of sf/sf and of sf/ps bipartite networks, following a power law,
where the exponent is the parameter in the connection probability function in Eq. (Equation 4), see Fig. and Eq. ( ?) in Appendix ?. The conditional probability for two nodes with hidden variables separated by angular distance to have common neighbors, is narrowly distributed around its ensemble average , and in the case of sparse bipartite networks can be approximated by the Poisson distribution
where is the shorthand notation for , see Fig. and Appendix ?.
Finally, the unconditional distribution of the number of common neighbors, , is obtained by averaging over all possible hidden variables and angular distances ,
As before, the corresponding expressions for the bottom domain nodes can be obtained by swapping the variables () with () and following the same analysis. The solution of the integral in Eq. (Equation 22) depends on the functional form of , which in turn depends on the pdfs of the hidden variables and on the value of parameter . While in general there is no closed-form solution to Eq. (Equation 22), different closed-form solutions can be obtained for integer values of . For instance, when and (sf/ps networks), we can show that for the top domain scales as
with (see Appendix ?). Our numerical experiments indicate that a similar power-law scaling of also holds for a range of values and for both domains of sf/sf networks, cf. Fig. and Fig. in Appendix ?.
The power-law scaling of means that a large number of node pairs have many common neighbors, and therefore, many -loops passing through them, which as explained in Section 1 implies strong bipartite clustering. We focus on bipartite clustering below.
To quantify bipartite clustering, we consider the bipartite clustering coefficient introduced by Zhang et al. , which aims at quantifying the density of -loops adjacent to a node ,
The summation goes over all pairs of neighbors of node , is the number of common neighbors between and , and and are the degrees of and . As seen from Eq. (Equation 24), is essentially a normalized measure of the density of common neighbors in the vicinity of node .
has also the following simple and intuitive similarity-based interpretation. Let and be the sets of neighbors of nodes and excluding node . Then is the size of the intersection of and , , while is their union. Therefore, Eq. (Equation 24) can be written as
The ratio of the intersection and union of two sets is known as the Jaccard similarity coefficient . is given by the ratio of the sums of intersections and unions for all pairs of ’s neighbors (Eq. (Equation 25)). Therefore, can be interpreted as a combined or effective Jaccard similarity of ’s neighbors.
The average bipartite clustering coefficients for the top and bottom node domains can be written as
where and are the sets of all top and bottom nodes. Expressions for the expected bipartite clustering coefficients in the model are derived in Appendix ?. Qualitatively, are large in the model and independent of the number of top and bottom nodes , see Fig. . This result follows from the fact that the expected number of common neighbors between two nodes is independent of the network size (Appendix ?). In contrast, in the degree-preserving randomized counterparts of the modeled networks, the bipartite clustering coefficient is orders of magnitude smaller, and vanishes with the network size as , with (see Fig. ), which is the expected behavior for uncorrelated bipartite networks . A similar behavior holds for the real bipartite networks we consider (Fig. ).
Another important property of bipartite clustering coefficient in sf/sf networks is its self-similarity with respect to a degree-thresholding renormalization procedure . Non-iterative removal of top and bottom nodes with degrees smaller than certain thresholds does not affect the functional form of degree-dependent bipartite clustering coefficients, which follow the same master-curve when plotted as a function of the node degree normalized by the average degree of the corresponding domain (see Fig. ? and Section ?).
Taken together, our results in this section indicate that the model can generate a variety of bipartite network topologies, whose main characteristics are consistent with those of real bipartite systems. A natural question then is whether it is possible to reverse the synthesis, and given a bipartite network, to infer the geometric coordinates and hidden variables of its nodes, in a way congruent with the model. A tempting approach would be to first project the bipartite network onto one of its node domains, apply existing maximum-likelihood estimation techniques  to map the resulting one-mode projection, and then use the obtained unipartite map to infer the node coordinates of the other domain . A necessary condition for this approach to work is that the geometry of the bipartite network is properly preserved in its one-mode projections. We next examine to what extend this is the case.
In one-mode projections we project a bipartite network onto one of its node domains, such that nodes of the domain are connected if they have at least one common neighbor in the bipartite network. Even though one-mode projections allow one to study bipartite networks using tools developed for unipartite networks, projections can lead to significant loss of information and artificial inflation of the projected network with fully connected subgraphs. Historically, different approaches have been proposed to deal with the loss of information. One approach, for instance, is to weigh projected links using common neighbor statistics in the original network . Another approach to quantify the extent at which information is lost in one-mode projections and to identify circumstances under which one-mode projections are still acceptable is to reduce noise by identifying and removing insignificant links in the projected network .
Here, we analyze the effects of one-mode projections in the context of the model. Specifically, we ask if the latent geometry of a bipartite network is preserved in its one-mode projections. Answering this question is important, as it can shed light on how well latent geometry beneath bipartite networks can be inferred using algorithms developed for unipartite networks such as those in . In the following, we analyze the projections onto the top node domain. As before, the results for the bottom domain can be obtained by swapping the corresponding top and bottom domain variables.
The probability that two top domain nodes and are connected in the one-mode projection, is the probability that the nodes have at least one common neighbor in the bottom domain,
where enumerates the nodes of the bottom domain, and is the connection probability in Eq. (Equation 4).
We say that the latent geometry of the bipartite network is preserved in its one-mode projection if preserves the functional form prescribed by Eq. (Equation 2):
where is a monotone decreasing function of , which may or may not coincide with our choice for in Eq. (Equation 3), and is the characteristic distance scale for a pair of nodes with and in the projected network. In the case takes the form of Eq. (Equation 28) one could map projections of real bipartite networks to latent spaces using methods developed for unipartite networks. If, on the other hand, is not in the form of Eq. (Equation 28), these techniques may not map correctly bipartite networks, and they either need to be adjusted, or different techniques need to be developed.
To test if latent geometry is preserved in one–mode projections we compute below. We first note that since and depend on and , also depends on . Due to the uniform distribution of the , does not depend on the individual values of and per se, but on the angular distance between the nodes, . Thus, we can set, without loss of generality, and . Assuming a sufficiently large number of bottom nodes we can rewrite Eq. (Equation 27) as
Then we replace the logarithm on the right hand side of Eq. (Equation 29) with its Taylor series expansion,
We note that the first term of the sum in the above relation, i.e., the term corresponding to , is the expected number of common neighbors between nodes and , . Second, we perform the change of integration variable , to obtain
The leading contributions to the inner integrals in Eq. (Equation 31) come from the two maxima of each integrand at and . Specifically, for large , connection probability can be approximated, to the leading order, as
where (Appendix ?). The functional form of in Eq. (Equation 32) is clearly different from that in Eq. (Equation 28) since the characteristic scale is different, indicating that latent geometry is not preserved in one–mode projections, as one could intuitively expect.
At the same time, it is important to note certain similarity between one–mode projection connection probability and original bipartite connection probability . Both are decreasing functions of the angular distance normalized by characteristic scales , albeit these scales are different in the two cases. Yet since in both cases is larger for pairs of nodes with larger hidden variables, nodes characterized by larger values are more likely to connect over large distances and, therefore, are expected to have larger degrees not only in the bipartite network but also in its one-mode projection, consistent with our findings in Ref. . Furthermore, as seen from Eq. (Equation 32), for sufficiently large values as a function of has the same asymptotic behavior as ,
Our observation that latent geometry is not exactly preserved in one–mode projections is not specific to our choice of as the connection probability function in the model. We show below that the latent geometry cannot be fully preserved in one-mode projections, regardless of the functional form of in Eq. (Equation 3). Indeed, assuming that is given by Eq. (Equation 28), we can write
where . Next, we observe that the right hand side of Eq. (Equation 31) is a sum of convolutions and can be transformed into products of Fourier transforms, yielding
where , and . Since the left hand side of Eq. (Equation 34) does not depend on and while the right hand side does, the only admissible solution is . This solution corresponds to , where is the Dirac delta function, and cannot be interpreted as a connection probability function.
We thus find that latent geometry cannot be fully preserved for any functional form of the connection probability function . At the same time, our results indicate that connection probability in one–mode projections behaves similar to that in the original network for large angular distances. This result implies that it may be possible to infer approximately the latent geometry of real bipartite networks from their one-mode projections. Yet it remains unclear how accurate such inferences can be, especially in small bipartite networks whose one–mode projections are overinflated with cliques of sizes comparable to the network size. Such problems can render geometry inference using one-mode projections highly inaccurate, especially in sf/sf networks with power law exponents close to , as discussed at the end of Appendix ?.
5Inferring latent geometry
We have seen that the model can construct synthetic bipartite networks that resemble real networks across a range of non-trivial structural characteristics, which include: (i) heterogeneity in distributions of node degrees for at least one of the two domains of nodes; (ii) power law distribution of the number of common neighbors shared between pairs of nodes; and (iii) strong bipartite clustering. These results imply that we should be able to reverse the synthesis, and given a bipartite network, to infer the hidden variables of its nodes as well as their latent distances. Below, we show that hidden variables and latent distances can be estimated from the observed node degrees and the common neighbors shared by nodes, respectively.
Recall from Section 3.2 that the resulting node degree corresponding to a particular hidden variable is Poisson distributed with expected value . Thus, the node hidden variables can be estimated by the observed node degrees as
Since the variance of the Poisson distribution is equal to its mean, this estimation works better for higher degree nodes, and can be used in the case of a scale-free degree distribution in the domain, see Fig. . In the case of a Poisson degree distribution in the domain, all nodes have identical hidden variables that can be estimated as
where is the observed average degree in the domain.
The angular distance separating two nodes can be estimated using the observed number of common neighbors between the nodes. As shown in Section 3.6, the number of common neighbors shared by two nodes is Poisson distributed with an expected value given by Eq. (Equation 19), which depends on the nodes’ hidden variables and their angular distance . If the observed number of common neighbors is sufficiently large, we can approximate as
The angular distance can be estimated using Eq. (Equation 36), which can be solved analytically for integer values of the model parameter or numerically otherwise.
To test the accuracy of the proposed estimation we consider an sf/ps modeled network with . In this case, for the top domain is given by
(see Appendix ?) allowing us to estimate as
We note that the above relation is an approximation and may yield angular distances outside the expected range if is too large or too small.
This estimation procedure works well for pairs of nodes with large values, see Fig. , and it can be used for a fast estimation of the pairwise latent similarity distances between such nodes, e.g., in recommender systems.
Understanding the organizing principles determining the structure and evolution of real bipartite networks can lead to significant advances in many challenging problems including community detection , understanding signaling pathways in gene regulatory networks , multicast search , and construction of efficient recommender systems .
We have shown that three common properties of many real bipartite networks—heterogeneous degree distributions, power-law distributions of the number of common neighbors, and strong bipartite clustering—appear as natural reflections of latent geometric spaces underlying these networks, where nodes are points in these spaces, while connections preferentially occur at smaller distances. The distances between nodes in the latent space can be regarded as generalized similarity measures, arising from projections of properly weighted combinations of node attributes, controlling the appearance of links between node pairs.
For our analysis, we have used the simplest possible bipartite network model with latent geometry ( model) . Within the model, both the power law distribution of the number of common neighbors and the strong bipartite clustering emerge naturally as reflections of the metric property, i.e., the triangle inequality, of the latent space. To achieve heterogeneous degree distributions we have assigned hidden variables to both top and bottom node domains, so that nodes with larger hidden variables connect over larger distances with higher probability and, as a result, establish more connections than nodes with smaller hidden variables.
Although not fully geometric, the model is equivalent to the model in Appendix ?, which is fully geometric, not using any hidden variables (other than node coordinates). In the model, heterogeneous degree distributions are consequences of the exponential expansion of space in , coupled with proper boundary conditions.
As with unipartite networks, a particularly pertinent question is the possibility to infer latent geometries underlying real bipartite systems. Through the analysis of one-mode projections we have shown that latent geometry cannot be fully preserved but can be approximately preserved in one-mode projections of both sf/ps and sf/sf bipartite networks in the model. This result supports the possibility of inferring latent geometries underlying real bipartite systems by inferring the geometries of their one-mode projections using existing techniques , as in . However, since geometry is not preserved exactly but only approximately, using one-mode projections can render geometry inference inaccurate, especially in smaller networks with weaker bipartite clustering. Such inaccuracies are particularly high in sf/sf networks with power law exponents close to , calling for the development of proper methods to infer latent coordinates that do not use one-mode projections. We have shown that if instead of coordinates, only pairwise latent distances between nodes with large numbers of common neighbors are to be inferred, e.g., in recommender systems  or in soft community detection , then such inferences can be made quickly and reliably based on the common neighbor statistics.
This work was supported by NSF grants CNS-0964236 and CNS-1442999. F.P. also acknowledges support by the EU H2020 NOTRE project (grant 692058). We thank M. Á́. Serrano, M. Boguñá, and P. Krapivsky for many discussions of the manuscript.
- Networks: An Introduction.
Mark Newman. Oxford University Press, 2010.
- Basic notions for the analysis of large two-mode networks.
Matthieu Latapy, Clémence Magnien, Nathalie Del Vecchio, Nathalie Del Vecchio, and Nathalie Del Vecchio. Soc. Networks, 30(1):31–48, 2008.
- A general evolving model for growing bipartite networks.
Lixin Tian, Yinghuan He, Haijun Liu, and Ruijin Du. Phys. Lett. A, 376(23):1827–1832, 2012.
- An evolving model of online bipartite networks.
Chu Xu Zhang, Zi Ke Zhang, and Chuang Liu. Phys. A Stat. Mech. its Appl., 392(23):6100–6106, 2013.
- Competition for popularity in bipartite networks.
Mariano Beguerisse Díaz, Mason A. Porter, and Jukka Pekka Onnela. Chaos An Interdiscip. J. Nonlinear Sci., 20(4):43101, 2010.
- A mathematical model for generating bipartite graphs and its application to protein networks.
J C Nacher, T Ochiai, M Hayashida, and T Akutsu. J. Phys. A Math. Theor., 42(48):485005, 2009.
- The Internet Movie Database, http://www.imdb.com.
- ArXiv e-Print archive, http://www.arxiv.org.
- Wikipedia, the free encyclopedia that anyone can edit, http://www.wikipedia.org.
- Reconstruction of metabolic networks from genome data and analysis of their global structure for various organisms.
Hongwu Ma and An-Ping Zeng. Bioinformatics, 19(2):270–277, 2003.
- Two classes of bipartite networks: nested biological and social systems.
Enrique Burgos, Horacio Ceva, Laura Hernández, R P J Perazzo, Mariano Devoto, and Diego Medan. Phys. Rev. E, 78(4):046113, 2008.
- Small-world file-sharing communities.
Adriana Iamnitchi, Matei Ripeanu, and Ian Foster. In INFOCOM 2004. Twenty-third Annu. Jt. Conf. IEEE Comput. Commun. Soc., volume 2, pages 952–963. IEEE, 2004.
- Networks with arbitrary edge multiplicities.
Vinko Zlatić, Diego Garlaschelli, and Guido Caldarelli. EPL (Europhysics Lett., 97(2):28005, 2012.
- The structure of a social science collaboration network: Disciplinary cohesion from 1963 to 1999.
James Moody. Am. Sociol. Rev., 69(2):213–238, 2004.
- Empirical analysis of the evolution of a scientific collaboration network.
Marco Tomassini and Leslie Luthi. Phys. A Stat. Mech. its Appl., 385(2):750–764, 2007.
- Systems biology.
Bernhard Palsson. Cambridge University Press, 2015.
- Cascading failure and robustness in metabolic networks.
Ashley G Smart, Luis A N Amaral, and Julio M Ottino. Proc. Natl. Acad. Sci., 105(36):13223–13228, 2008.
- Building a scalable bipartite P2P overlay network.
Yunhao Liu, Li Xiao, and Lionel Ni. IEEE Trans. Parallel Distrib. Syst., 18(9):1296–1306, 2007.
- Bipartite network projection and personal recommendation.
Tao Zhou, Jie Ren, Matúš Medo, and Yi-Cheng Zhang. Phys. Rev. E, 76(4):046115, 2007.
- Recommender systems.
Linyuan Lü, Matúš Medo, Chi Ho Yeung, Yi-Cheng Zhang, Zi-Ke Zhang, and Tao Zhou. Phys. Rep., 519(1):1–49, 2012.
- Recommender systems survey.
Jesús Bobadilla, Fernando Ortega, Antonio Hernando, and Abraham Gutiérrez. Knowledge-Based Syst., 46:109–132, 2013.
- Recommender systems in e-commerce.
J Ben Schafer, Joseph Konstan, and John Riedl. In Proc. 1st ACM Conf. Electron. Commer., pages 158–166. ACM, 1999.
- Graphs and hypergraphs, volume 7.
Claude Berge and Edward Minieka. North-Holland publishing company Amsterdam, 1973.
- Random hypergraphs and their applications.
Gourab Ghoshal, Vinko Zlatić, Guido Caldarelli, and M E J Newman. Phys. Rev. E, 79(6):066118, 2009.
- Self-organization in social tagging systems.
Chuang Liu, Chi Ho Yeung, and Zi-Ke Zhang. Phys. Rev. E, 83(6):066104, 2011.
- Extracting tag hierarchies.
Gergely Tibély, Péter Pollner, Tamás Vicsek, and Gergely Palla. PLoS One, 8(12):e84133, 2013.
- Diffusion-based recommendation in collaborative tagging systems.
Shang Ming-Sheng and Zhang Zi-Ke. Chinese Phys. Lett., 26(11):118903, 2009.
- A hypergraph model of social tagging networks.
Zi-Ke Zhang and Chuang Liu. J. Stat. Mech. Theory Exp., 2010(10):P10005, 2010.
- Birds of a feather: Homophily in social networks.
Miller McPherson, Lynn Smith-Lovin, and James M Cook. Annu. Rev. Sociol., 27(1):415–444, 2001.
- Social distance as a metric: a systematic introduction to smallest space analysis.
David D McFarland and Daniel J Brown. Bond. Plur. Form Subst. Urban Soc. Networks, 6:213–252, 1973.
- Dynamic social network analysis using latent space models.
Purnamrita Sarkar and Andrew W Moore. ACM SIGKDD Explor. Newsl., 7(2):31–40, 2005.
- Models of social networks based on social distance attachment.
Marián Boguñá, Romualdo Pastor-Satorras, Albert Díaz-Guilera, and Alex Arenas. Phys. Rev. E, 70(5):056122, 2004.
- Navigating networks by using homophily and degree.
Özgür Simşek and David Jensen. Proc. Natl. Acad. Sci., 105(35):12758–12762, 2008.
- Preferential attachment in growing spatial networks.
Luca Ferretti and Michele Cortelezzi. Phys. Rev. E, 84(1):016103, 2011.
- Spatial networks.
Marc Barthélemy. Phys. Rep., 499(1-3):1–101, 2011.
- Theoretical Justification of Popular Link Prediction Heuristics.
Purnamrita Sarkar, Deepayan Chakrabarti, and Andrew W Moore. In IJCAI Proceedings-International Jt. Conf. Artif. Intell., volume 22, page 2722, 2011.
- Random Geometric Graphs.
Mathew Penrose. Oxford University Press, Oxford, 2003.
- Percolation, Connectivity, Coverage and Colouring of Random Geometric Graphs.
Paul Balister, Amites Sarkar, and Béla Bollobás. In Handb. Large-Scale Random Networks, pages 117–142. Springer, Berlin, 2008.
- Random Networks for Communication: From Statistical Physics to Information Systems.
M Franceschetti and R Meester. Cambridge University Press, Cambridge, 2008.
- Percolation and connectivity in AB random geometric graphs.
Srikanth K Iyer and Dahandapani Yogeshwaran. Adv. Appl. Probab., 44(01):21–41, 2012.
- Fundamentals of wireless communication.
David Tse and Pramod Viswanath. Cambridge university press, 2005.
- Continuum AB percolation and AB random geometric graphs.
Mathew D Penrose and Others. J. Appl. Probab., 51:333–344, 2014.
- Scale-Free Networks from Varying Vertex Intrinsic Fitness.
G Caldarelli, A Capocci, P De Los Rios, and M Muñoz. Phys. Rev. Lett., 89(25):258702, 2002.
- Class of correlated random networks with hidden variables.
Marián Boguná and Romualdo Pastor-Satorras. Phys. Rev. E, 68(3):036112, 2003.
- Hyperbolic geometry of complex networks.
Dmitri Krioukov, Fragkiskos Papadopoulos, Maksim Kitsak, Amin Vahdat, and Marián Boguñá. Phys. Rev. E, 82(3):036106, 2010.
- Popularity versus similarity in growing networks.
F Papadopoulos, M Kitsak, M A Serrano, M Boguna, and D Krioukov. Nature, 489(7417):537–540, 2012.
- Emergence of Soft Communities from Geometric Preferential Attachment.
Konstantin Zuev, Marián Boguñá, Ginestra Bianconi, and Dmitri Krioukov. Sci. Rep., 5:9421, 2015.
- Uncovering the hidden geometry behind metabolic networks.
M Ángeles Serrano, Marián Boguñá, and Francesc Sagués. Mol. Biosyst., 8:843–850, 2012.
- Self-Similarity of Complex Networks and Hidden Metric Spaces.
M Serrano, Dmitri Krioukov, and Marián Boguñá. Phys. Rev. Lett., 100(7):078701, 2008.
- Hidden variables in bipartite networks.
Maksim Kitsak and Dmitri Krioukov. Phys. Rev. E, 84(2):026114, 2011.
- Dynamical and correlation properties of the internet.
Romualdo Pastor-Satorras, Alexei Vázquez, and Alessandro Vespignani. Phys. Rev. Lett., 87(25):258701, 2001.
- Clustering coefficient and community structure of bipartite networks.
Peng Zhang, Jinliang Wang, Xiaojia Li, Menghui Li, Zengru Di, and Ying Fan. Phys. A Stat. Mech. its Appl., 387(27):6869–6875, 2008.
- Etude comparative de la distribution florale dans une portion des Alpes et du Jura.
Paul Jaccard. Impr. Corbaz, 37:547, 1901.
- Sustaining the Internet with hyperbolic mapping.
Marián Boguñá, Fragkiskos Papadopoulos, and Dmitri Krioukov. Nat. Commun., 1:62, 2010.
- Network Mapping by Replaying Hyperbolic Growth.
Fragkiskos Papadopoulos, Constantinos Psomas, and Dmitri Krioukov. IEEE/ACM Trans. Netw., 23(1):198–211, 2015.
- Network geometry inference using common neighbors.
Fragkiskos Papadopoulos, Rodrigo Aldecoa, and Dmitri Krioukov. Phys. Rev. E, 92(2):022807, 2015.
- Statistical properties of corporate board and director networks.
S. Battiston and M. Catanzaro. In Eur. Phys. J. B, volume 38, pages 345–352, 2004.
- Who Is the Best Connected Scientist?A Study of Scientific Coauthorship Networks.
Mark E.J. Newman. Complex Networks, 650:337–370, 2004.
- Clustering in P2P exchanges and consequences on performances.
Stevens Le Blond, JL Guillaume, and M Latapy. Lect. Notes Comput. Sci., 3640:193–204, 2005.
- Construction of bipartite and unipartite weighted networks from collections of journa