Stochastic blockmodel approximation of a graphon: Theory and consistent estimationThis paper appears in the proceedings of NIPS 2013. In this version we include an appendix with proofs.

Stochastic blockmodel approximation of a graphon: Theory and consistent estimationthanks: This paper appears in the proceedings of NIPS 2013. In this version we include an appendix with proofs.

Edoardo M. Airoldi
Dept. Statistics
Harvard University &Thiago B. Costa
SEAS, and Dept. Statistics
Harvard University &Stanley H. Chan
SEAS, and Dept. Statistics
Harvard University
Abstract

Non-parametric approaches for analyzing network data based on exchangeable graph models (ExGM) have recently gained interest. The key object that defines an ExGM is often referred to as a graphon. This non-parametric perspective on network modeling poses challenging questions on how to make inference on the graphon underlying observed network data. In this paper, we propose a computationally efficient procedure to estimate a graphon from a set of observed networks generated from it. This procedure is based on a stochastic blockmodel approximation (SBA) of the graphon. We show that, by approximating the graphon with a stochastic block model, the graphon can be consistently estimated, that is, the estimation error vanishes as the size of the graph approaches infinity.

\nipsfinalcopy

1 Introduction

Revealing hidden structures of a graph is the heart of many data analysis problems. From the well-known small-world network to the recent large-scale data collected from online service providers such as Wikipedia, Twitter and Facebook, there is always a momentum in seeking better and more informative representations of the graphs (Fienberg et al. 1985; Nowicki and Snijders 2001a; Hoff et al. 2002; Handcock et al. 2007; Airoldi et al. 2008; Xu et al. 2012; Azari and Airoldi 2012; Tang et al. 2013; Goldenberg et al. 2009; Kolaczyk 2009). In this paper, we develop a new computational tool to study one type of non-parametric representations which recently draws significant attentions from the community (Bickel and Chen 2009; Lloyd et al. 2012; Bickel et al. 2011; Zhao et al. 2011; Orbanz and Roy 2013).

The root of the non-parametric model discussed in this paper is in the theory of exchangeable random arrays (Aldous 1981; Hoover 1979; Kallenberg 1989), and it is presented in (Diaconis and Janson 2008) as a link connecting de Finetti’s work on partial exchangeability and graph limits (Lovász and Szegedy 2006; Borgs et al. 2006). In a nutshell, the theory predicts that every convergent sequence of graphs has a limit object that preserves many local and global properties of the graphs in the sequence. This limit object, which is called a graphon, can be represented by measurable functions , in a way that any obtained from measure preserving transformations of describes the same graphon.

Graphons are usually seen as kernel functions for random network models (Lawrence 2005). To construct an -vertex random graph for a given , we first assign a random label to each vertex , and connect any two vertices and with probability , i.e.,

(1)

where denotes the th entry of the adjacency matrix representing a particular realization of (See Figure 1). As an example, we note that the stochastic block-model is the case where is a piecewise constant function.

Figure 1: [Left] Given a graphon , we draw i.i.d. samples , from Uniform[0,1] and assign with probability , for . [Middle] Heat map of a graphon . [Right] A random graph generated by the graphon shown in the middle. Rows and columns of the graph are ordered by increasing , instead of for better visualization.

The problem of interest is defined as follows: Given a sequence of observed directed graphs , can we make an estimate of , such that with high probability as goes to infinity? This question has been loosely attempted in the literature, but none of which has a complete solution. For example, Lloyd et al. (Lloyd et al. 2012) proposed a Bayesian estimator without a consistency proof; Choi and Wolfe (Choi and Wolfe ) studied the consistency properties, but did not provide algorithms to estimate the graphon. To the best of our knowledge, the only method that estimates graphons consistently, besides ours, is USVT (Chatterjee ). However, our algorithm has better complexity and outperforms USVT in our simulations. More recently, other groups have begun exploring approaches related to ours (Wolfe and Olhede 2013; P.Latouche and Robin 2013).

The proposed approximation procedure requires to be piecewise Lipschitz. The basic idea is to approximate by a two-dimensional step function with diminishing intervals as increases.The proposed method is called the Stochastic blockmodel approximation (SBA) algorithm, as the idea of using a two-dimensional step function for approximation is equivalent to using the stochastic block models (Choi et al. 2012; Nowicki and Snijders 2001a; Hoff 2008; Channarond et al. 2012; Rohe et al. 2011). The SBA algorithm is defined up to permutations of the nodes, so the estimated graphon is not canonical. However, this does not affect the consistency properties of the SBA algorithm, as the consistency is measured w.r.t. the graphon that generates the graphs.

2 Stochastic blockmodel approximation: Procedure

In this section we present the proposed SBA algorithm and discuss its basic properties.

2.1 Assumptions on graphons

We assume that is piecewise Lipschitz, i.e., there exists a sequence of non-overlaping intervals defined by , and a constant such that, for any and ,

For generality we assume to be asymmetric i.e., , so that symmetric graphons can be considered as a special case. Consequently, a random graph generated by is directed, i.e., .

2.2 Similarity of graphon slices

The intuition of the proposed SBA algorithm is that if the graphon is smooth, neighboring cross-sections of the graphon should be similar. In other words, if two labels and are close i.e., , then the difference between the row slices and the column slices should also be small. To measure the similarity between two labels using the graphon slices, we define the following distance

(2)

Thus, is small only if both row and column slices of the graphon are similar.

The usage of for graphon estimation will be discussed in the next subsection. But before we proceed, it should be noted that in practice has to be estimated from the observed graphs . To derive an estimator of , it is helpful to express in a way that the estimators can be easily obtained. To this end, we let

and express as . Inspecting this expression, we consider the following estimators for and :

(3)
(4)

Here, the superscript can be interpreted as the dummy variables and in defining and , respectively. Summing all possible ’s yields an estimator that looks similar to :

(5)

where is the set of summation indices.

The motivation of defining the estimators in (3) and (4) is that a row of the adjacency matrix is fully characterized by the corresponding row of the graphon . Thus the expected value of is , and hence is an estimator for . To theoretically justify this intuition, we will show in Section 3 that is indeed a good estimator: it is not only unbiased, but is also concentrated round for large . Furthermore, we will show that it is possible to use a random subset of instead of to achieve the same asymptotic behavior. As a result, the estimation of can be performed locally in a neighborhood of and , instead of all vertices.

2.3 Blocking the vertices

The similarity metric discussed above suggests one simple method to approximate by a piecewise constant function (i.e., a stochastic block-model). Given , we can cluster the (unknown) labels into blocks using a procedure described below. Once the blocks are defined, we can then determine by computing the empirical frequency of edges that are present across blocks and :

(6)

where is the block containing so that summing over and yields an estimate of the expected number of edges linking block and .

To cluster the unknown labels we propose a greedy approach as shown in Algorithm 1. Starting with , we randomly pick a node and call it the pivot. Then for all other vertices , we compute the distance and check whether for some precision parameter . If , then we assign to the same block as . Therefore, after scanning through once, a block will be defined. By updating as , the process repeats until .

The proposed greedy algorithm is only a local solution in a sense that it does not return the globally optimal clusters. However, as will be shown in Section 3, although the clustering algorithm is not globally optimal, the estimated graphon is still guaranteed to be a consistent estimate of the true graphon as . Since the greedy algorithm is numerically efficient, it serves as a practical computational tool to estimate .

2.4 Main algorithm

  Input: A set of observed graphs and the precision parameter .
  Output: Estimated stochastic blocks .
  Initialize: , and .
  while  do
     Randomly choose a vertex from and assign it as the pivot for : .
     for Every other vertices  do
        Compute the distance estimate .
        If , then assign as a member of : .
     end for
     Update : .
     Update counter: .
  end while
Algorithm 1 Stochastic blockmodel approximation

Algorithm 1 illustrates the pseudo-code for the proposed stochastic block-model approximation. The complexity of this algorithm is , where is half the number of observations, is the size of the neighborhood, is the number of blocks and is number of vertices of the graph.

3 Stochastic blockmodel approximation: Theory of estimation

In this section we present the theoretical aspects of the proposed SBA algorithm. We will first discuss the properties of the estimator , and then show the consistency of the estimated graphon . Details of the proofs can be found in the supplementary material.

3.1 Concentration analysis of

Our first theorem below shows that the proposed estimator is both unbiased, and is concentrated around its expected value .

Theorem 1.

The estimator for is unbiased, i.e., . Further, for any ,

(7)

where is the size of the neighborhood , and is the number of observations.

Proof.

Here we only highlight the important steps to present the intuition. The basic idea of the proof is to zoom-in a microscopic term of and show that it is unbiased. To this end, we use the fact that and are conditionally independent on to show

which then implies , and by iterated expectation we have . The concentration inequality follows from a similar idea to bound the variance of and apply Bernstein’s inequality. ∎

That and are conditionally independent on is a critical fact for the success of the proposed algorithm. It also explains why at least 2 independently observed graphs are necessary, for otherwise we cannot separate the probability in the second equality above marked with .

3.2 Choosing the number of blocks

The performance of the Algorithm 1 is sensitive to the number of blocks it defines. On the one hand, it is desirable to have more blocks so that the graphon can be finely approximated. But on the other hand, if the number of blocks is too large then each block will contain only few vertices. This is bad because in order to estimate the value on each block, a sufficient number of vertices in each block is required. The trade-off between these two cases is controlled by the precision parameter : a large generates few large clusters, while small generates many small clusters. A precise relationship between the and , the number of blocks generated the algorithm, is given in Theorem 2.

Theorem 2.

Let be the accuracy parameter and be the number of blocks estimated by Algorithm 1, then

(8)

where is the Lipschitz constant and is the number of Lipschitz blocks in .

In practice, we estimate using a cross-validation scheme to find the optimal 2D histogram bin width (Wasserman 2005). The idea is to test a sequence of potential values of and seek the one that minimizes the cross validation risk, defined as

(9)

where and . Algorithm 2 details the proposed cross-validation scheme.

  Input: Graphs .
  Output: Blocks , and optimal .
  for a sequence of ’s do
     Estimate blocks from . [Algorithm 1]
     Compute , for .
     Compute , with .
  end for
  Pick the with minimum , and the corresponding .
Algorithm 2 Cross Validation

3.3 Consistency of

The goal of our next theorem is to show that is a consistent estimate of , i.e., as . To begin with, let us first recall two commonly used metric:

Definition 1.

The mean squared error (MSE) and mean absolute error (MAE) are defined as

Theorem 3.

If and , then

Proof.

The details of the proof can be found in the supplementary material . Here we only outline the key steps to present the intuition of the theorem. The goal of Theorem 3 is to show convergence of . The idea is to consider the following two quantities:

so that if we can bound and , then consequently can also be bounded.

The bound for the first term is shown in Lemma 1: By Algorithm 1, any vertex is guaranteed to be within a distance from the pivot of . Since is an average over and , by Theorem 1 a probability bound involving can be obtained.

The bound for the second term is shown in Lemma 2. Different from Lemma 1, here we need to consider two possible situations: either the intermediate estimate is close to the ground truth , or is far from the ground truth . This accounts for the sum in Lemma 2. Individual bounds are derived based on Lemma 1 and Theorem 1.

Combining Lemma 1 and Lemma 2, we can then bound the error and show convergence. ∎

Lemma 1.

For any and ,

(10)
Lemma 2.

For any and ,

(11)

The condition is necessary to make Theorem 3 valid, because if is independent of , it is not possible to drive (10) and (11) to even if . The other condition on is also important as it forces the numerators and denominators in the exponentials of (10) and (11) to be well behaved.

4 Experiments

In this section we evaluate the proposed SBA algorithm by showing some empirical results. For the purpose of comparison, we consider (i) the universal singular value thresholding (USVT) (Chatterjee ); (ii) the largest-gap algorithm (LG) (Channarond et al. 2012); (iii) matrix completion from few entries (OptSpace) (Keshavan et al. 2010).

4.1 Estimating stochastic blockmodels

Accuracy as a function of growing graph size.

Our first experiment is to evaluate the proposed SBA algorithm for estimating stochastic blockmodels. For this purpose, we generate (arbitrarily) a graphon

(12)

which represents a piecewise constant function with equi-space blocks.

(a) Growing graph size, (b) Growing no. observations,
Figure 2: (a) MAE reduces as graph size grows. For the fairness of the amount of data that can be used, we use observations for SBA, and observation for USVT (Chatterjee ) and LG (Channarond et al. 2012). (b) MAE of the proposed SBA algorithm reduces when more observations is available. Both plots are averaged over 100 independent trials.

Since USVT and LG use only one observed graph whereas the proposed SBA require at least observations, in order to make the comparison fair, we use half of the nodes for SBA by generating two independent observed graphs. For USVT and LG, we use one observed graph.

Figure 2(a) shows the asymptotic behavior of the algorithms when grows. Figure 2(b) shows the estimation error of SBA algorithm as grows for graphs of size 200 vertices.

Accuracy as a function of growing number of blocks.

Our second experiment is to evaluate the performance of the algorithms as , the number of blocks, increases. To this end, we consider a sequence of , and for each we generate a graphon of blocks. Each entry of the block is a random number generated from . Same as the previous experiment, we fix and . The experiment is repeated over 100 trials so that in every trial a different graphon is generated. The result shown in Figure 3(a) indicates that while estimation error increases as grows, the proposed SBA algorithm still attains the lowest MAE for all .

(a) Growing no. blocks, (b) Missing links
Figure 3: (a) As increases, MAE of all three algorithm increases but SBA still attains the lowest MAE. Here, we use observations for SBA, and observation for USVT (Chatterjee ) and LG (Channarond et al. 2012). (b) Estimation of graphon in the presence of missing links: As the amount of missing links increases, estimation error also increases.

4.2 Estimation with missing edges

Our next experiment is to evaluate the performance of proposed SBA algorithm when there are missing edges in the observed graph. To model missing edges, we construct an binary matrix with probability , where defines the percentage of missing edges. Given , matrices are generated with missing edges, and the observed graphs are defined as , where denotes the element-wise multiplication. The goal is to study how well SBA can reconstruct the graphon in the presence of missing links.

The modification of the proposed SBA algorithm for the case missing links is minimal: when computing (6), instead of averaging over all and , we only average and that are not masked out by all s. Figure 3(b) shows the result of average over 100 independent trials. Here, we consider the graphon given in (12), with and . It is evident that SBA outperforms its counterparts at a lower rate of missing links.

4.3 Estimating continuous graphons

Our final experiment is to evaluate the proposed SBA algorithm in estimating continuous graphons. Here, we consider two of the graphons reported in (Chatterjee ):

where . Here, can be considered as a special case of the Eigenmodel (Hoff 2008) or latent feature relational model (Miller et al. 2009).

The results in Figure 4 shows that while both algorithms have improved estimates when grows, the performance depends on which of and that we are studying. This suggests that in practice the choice of the algorithm should depend on the expected structure of the graphon to be estimated: If the graph generated by the graphon demonstrates some low-rank properties, then USVT is likely to be a better option. For more structured or complex graphons the proposed procedure is recommended.

(a) graphon (b) graphon
Figure 4: Comparison between SBA and USVT in estimating two continuous graphons and . Evidently, SBA performs better for (high-rank) and worse for (low-rank).

5 Concluding remarks

We presented a new computational tool for estimating graphons. The proposed algorithm approximates the continuous graphon by a stochastic block-model, in which the first step is to cluster the unknown vertex labels into blocks by using an empirical estimate of the distance between two graphon slices, and the second step is to build an empirical histogram to estimate the graphon. Complete consistency analysis of the algorithm is derived. The algorithm was evaluated experimentally, and we found that the algorithm is effective in estimating block structured graphons.

Code. An implementation of the stochastic blockmodel approximation (SBA) algorithm proposed in this paper is available online at: https://github.com/airoldilab/SBA

Acknowledgments. EMA is partially supported by NSF CAREER award IIS-1149662, ARO MURI award W911NF-11-1-0036, and an Alfred P. Sloan Research Fellowship. SHC is partially supported by a Croucher Foundation Post-Doctoral Research Fellowship.

References

  • Airoldi et al. (2008) E.M. Airoldi, D.M. Blei, S.E. Fienberg, and E.P. Xing. Mixed-membership stochastic blockmodels. Journal of Machine Learning Research, 9:1981–2014, 2008.
  • Aldous (1981) D.J. Aldous. Representations for partially exchangeable arrays of random variables. Journal of Multivariate Analysis, 11:581–598, 1981.
  • Azari and Airoldi (2012) H. Azari and E. M. Airoldi. Graphlet decomposition of a weighted network. Journal of Machine Learning Research, W&CP, 22:54–63, 2012.
  • Bickel and Chen (2009) P.J. Bickel and A. Chen. A nonparametric view of network models and Newman-Girvan and other modularities. Proc. Natl. Acad. Sci. USA, 106:21068–21073, 2009.
  • Bickel et al. (2011) P.J. Bickel, A. Chen, and E. Levina. The method of moments and degree distributions for network models. Annals of Statistics, 39(5):2280–2301, 2011.
  • Borgs et al. (2006) C. Borgs, J. Chayes, L. Lovász, V. T. Sós, B. Szegedy, and K. Vesztergombi. Graph limits and parameter testing. In Proc. ACM Symposium on Theory of Computing, pages 261–270, 2006.
  • Channarond et al. (2012) A. Channarond, J. Daudin, and S. Robin. Classification and estimation in the Stochastic Blockmodel based on the empirical degrees. Electronic Journal of Statistics, 6:2574–2601, 2012.
  • (8) S. Chatterjee. Matrix estimation by universal singular value thresholding. ArXiv:1212.1247. 2012.
  • (9) D.S. Choi and P.J. Wolfe. Co-clustering separately exchangeable network data. ArXiv:1212.4093. 2012.
  • Choi et al. (2012) D.S. Choi, P.J. Wolfe, and E.M. Airoldi. Stochastic blockmodels with a growing number of classes. Biometrika, 99:273–284, 2012.
  • Diaconis and Janson (2008) P. Diaconis and S. Janson. Graph limits and exchangeable random graphs. Rendiconti di Matematica e delle sue Applicazioni, Series VII, pages 33–61, 2008.
  • Fienberg et al. (1985) S. E. Fienberg, M. M. Meyer, and S. Wasserman. Statistical analysis of multiple sociometric relations. Journal of the American Statistical Association, 80:51–67, 1985.
  • Goldenberg et al. (2009) A. Goldenberg, A.X. Zheng, S.E. Fienberg, and E.M. Airoldi. A survey of statistical network models. Foundations and Trends in Machine Learning, 2:129–233, 2009.
  • Handcock et al. (2007) M. Handcock, A. E. Raftery, and J. Tantrum. Model-based clustering for social networks (with discussion). Journal of the Royal Statistical Society, Series A, 170:301–354, 2007.
  • Hoff (2008) P.D. Hoff. Modeling homophily and stochastic equivalence in symmetric relational data. In Neural Information Processing Systems (NIPS), volume 20, pages 657–664, 2008.
  • Hoff et al. (2002) P.D. Hoff, A.E. Raftery, and M.S. Handcock. Latent space approaches to social network analysis. Journal of the American Statistical Association, 97(460):1090–1098, 2002.
  • Hoover (1979) D.N. Hoover. Relations on probability spaces and arrays of random variables. Preprint, Institute for Advanced Study, Princeton, NJ, 1979.
  • Kallenberg (1989) O. Kallenberg. On the representation theorem for exchangeable arrays. Journal of Multivariate Analysis, 30(1):137–154, 1989.
  • Keshavan et al. (2010) R.H. Keshavan, A. Montanari, and S. Oh. Matrix completion from a few entries. IEEE Trans. Information Theory, 56:2980–2998, Jun. 2010.
  • Kolaczyk (2009) E. D. Kolaczyk. Statistical Analysis of Network Data: Methods and Models. Springer, 2009.
  • Lawrence (2005) N.D. Lawrence. Probabilistic non-linear principal component analysis with Gaussian process latent variable models. Journal of Machine Learning Research, 6:1783–1816, 2005.
  • Lloyd et al. (2012) J.R. Lloyd, P. Orbanz, Z. Ghahramani, and D.M. Roy. Random function priors for exchangeable arrays with applications to graphs and relational data. In Neural Information Processing Systems (NIPS), 2012.
  • Lovász and Szegedy (2006) L. Lovász and B. Szegedy. Limits of dense graph sequences. Journal of Combinatorial Theory, Series B, 96:933–957, 2006.
  • Miller et al. (2009) K.T. Miller, T.L. Griffiths, and M.I. Jordan. Nonparametric latent fature models for link prediction. In Neural Information Processing Systems (NIPS), 2009.
  • Nowicki and Snijders (2001a) K. Nowicki and T. A. B. Snijders. Estimation and prediction for stochastic blockstructures. Journal of the American Statistical Association, 96:1077–1087, 2001.
  • Orbanz and Roy (2013) P. Orbanz and D.M. Roy. Bayesian models of graphs, arrays and other exchangeable random structures, 2013. Unpublished manuscript.
  • P.Latouche and Robin (2013) P.Latouche and S. Robin. Bayesian model averaging of stochastic block models to estimate the graphon function and motif frequencies in a w-graph model. ArXiv:1310.6150, October 2013. Unpublished manuscript.
  • Rohe et al. (2011) K. Rohe, S. Chatterjee, and B. Yu. Spectral clustering and the high-dimensional stochastic blockmodel. Annals of Statistics, 39(4):1878–1915, 2011.
  • Tang et al. (2013) M. Tang, D.L. Sussman, and C.E. Priebe. Universally consistent vertex classification for latent positions graphs. Annals of Statistics, 2013. In press.
  • Wasserman (2005) L. Wasserman. All of Nonparametric Statistics. Springer, 2005.
  • Wolfe and Olhede (2013) P.J. Wolfe and S.C. Olhede. Nonparametric graphon estimation. ArXiv:1309.5936, September 2013. Unpublished manuscript.
  • Xu et al. (2012) Z. Xu, F. Yan, and Y. Qi. Infinite Tucker decomposition: nonparametric Bayesian models for multiway data analysis. In Proc. Intl. Conf. Machine Learning (ICML), 2012.
  • Zhao et al. (2011) Y. Zhao, E. Levina, and J. Zhu. Community extraction for social networks. In Proc. Natl. Acad. Sci. USA, volume 108, pages 7321–7326, 2011.

Appendix A Proofs for Section 3.1

Theorem 1.

The estimator for is unbiased. Further, for any , if the graph is directed, then

(13)

and if the graph is un-directed, then

(14)

where is the size of the sampling neighborhood , and is the number of observations.

Proof.

First, for given and , let us define the following two quantities

Consequently, we express as

In order to study (the estimator of ), it is desired to express in the same form of :

(15)

where is the sampling neighborhood, and . In (15), individual components are defined as

Thus, if we can show that and are unbiased estimators of and , i.e., and , then by linearity of expectation, will be an unbiased estimator of .

To this end, we consider the conditional expectation of given :

(16)

Therefore,

(17)

Then, by the law of iterated expectations, we have

(18)

Therefore, is an unbiased estimator of . The proof of can be similarly proved by switching roles of to . Since and are both unbiased, must be unbiased.

Now we proceed to prove the second part of the theorem. We first claim that

(19)

To prove this, we note that

We consider three cases:

Case 1. First assume and . (Occurs times.)