From which world is your graph?

# From which world is your graph?

Cheng Li
College of William & Mary
Felix M. F. Wong
Zhenming Liu
College of William & Mary
University of Oxford and The Alan Turing Instituite
###### Abstract

Discovering statistical structure from links is a fundamental problem in the analysis of social networks. Choosing a misspecified model, or equivalently, an incorrect inference algorithm will result in an invalid analysis or even falsely uncover patterns that are in fact artifacts of the model. This work focuses on unifying two of the most widely used link-formation models: the stochastic blockmodel (SBM) and the small world (or latent space) model (SWM). Integrating techniques from kernel learning, spectral graph theory, and nonlinear dimensionality reduction, we develop the first statistically sound polynomial-time algorithm to discover latent patterns in sparse graphs for both models. When the network comes from an SBM, the algorithm outputs a block structure. When it is from an SWM, the algorithm outputs estimates of each node’s latent position.

## 1 Introduction

Discovering statistical structures from links is a fundamental problem in the analysis of social networks. Connections between entities are typically formed based on underlying feature-based similarities; however these features themselves are partially or entirely hidden. A question of great interest is to what extent can these latent features be inferred from the observable links in the network. This work focuses on the so-called assortative setting, the principle that similar individuals are more likely to interact with each other. Most stochastic models of social networks rely on this assumption, including the two most famous ones – the stochastic blockmodel [HLL83] and the small-world model [WS98, Kle00], described below.

Stochastic Blockmodel (SBM). In a stochastic blockmodel [YP16, MNS15, AS15b, AS15a, Mas14, MNS13, BC09, LLDM08, NG04, NWS02], nodes are grouped into disjoint “communities” and links are added randomly between nodes, with a higher probability if nodes are in the same community. In its simplest incarnation, an edge is added between nodes within the same community with probability , and between nodes in different communities with probability , for . Despite arguably naïve modelling choices, such as the independence of edges, algorithms designed with SBM work well in practice [McS01, LLM10].

Small-World Model (SWM). In a small-world model, each node is associated with a latent variable , e.g., the geographic location of an individual. The probability that there is a link between two nodes is proportional to an inverse polynomial of some notion of distance, , between them. The presence of a small number of “long-range” connections is essential to some of the most intriguing properties of these networks, such as small diameter and fast decentralized routing algorithms [Kle00]. In general, the latent position may reflect geographic location as well as more abstract concepts, e.g., position on a political ideology spectrum.

The Inference Problem.  Without observing the latent positions, or knowing which model generates the underlying graph, the adjacency matrix of a social graph typically looks like the one shown in Fig. 5(a) (App. A.1). However, if the model generating the graph is known, it is then possible to run a suitable “clustering algorithm” [McS01, ACKS13] that reveals the hidden structure. When the vertices are ordered suitably, the SBM’s adjacency matrix looks like the one shown in Fig. 5(b) (App. A.1) and that of the SWM looks like the one shown in Fig. 5(c) (App. A.1). Existing algorithms typically depend on knowing the “true” model and are tailored to graphs generated according to one of these models, e.g.,  [McS01, ACKS13, Bar12, BJN15].

Our Contributions. We consider a latent space model that is general enough to include both these models as special cases. In our model, an edge is added between two nodes with a probability that is a decreasing function of the distance between their latent positions. This model is a fairly natural one, and it is quite likely that a variant has already been studied; however, to the best of our knowledge there is no known statistically sound and computationally efficient algorithm for latent-position inference on a model as general as the one we consider.

1. A unified model. We propose a model that is a natural generalization of both the stochastic blockmodel and the small-world model that captures some of the key properties of real-world social networks, such as small out-degrees for ordinary users and large in-degrees for celebrities. We focus on a simplified model where we have a modest degree graph only on “celebrities”; the supplementary material contains an analysis of the more realistic model using somewhat technical machinery.

2. A provable algorithm. We present statistically sound and polynomial-time algorithms for inferring latent positions in our model(s). Our algorithm approximately infers the latent positions of almost all “celebrities” (-fraction), and approximately infers a constant fraction of the latent positions of ordinary users. We show that it is statistically impossible to err on at most fraction of ordinary users by using standard lower bound arguments.

3. Proof-of-concept experiments. We report several experiments on synthetic and real-world data collected on Twitter from Oct 1 and Nov 30, 2016. Our experiments demonstrate that our model and inference algorithms perform well on real-world data and reveal interesting structures in networks.

Additional Related Work. We briefly review the relevant published literature. 1. Graphon-based techniques. Studies using graphons to model networks have focused on the statistical properties of the estimators [HRH01, ABFX08, RCY11, ACC13, PJW13, TSP13, WC14, KMS16, RQY16], with limited attention paid to computational efficiency. The “USVT” technique developed recently [Cha15] estimates the kernel well when the graph is dense. Xu et al. [XML14] consider a polynomial time algorithm for a sparse model similar to ours, but focus on edge classification rather than latent position estimation. 2. Correspondence analysis in political science. Estimating the ideology scores of politicians is an important research topic in political science [PR85, LBG03, CJR04, GB12, GB11, GS13, Bar12, BJN15]. High accuracy heuristics developed to analyze dense graphs include [Bar12, BJN15].

Organization. Section 2 describes background, our model and results. Section 3 describes our algorithm and an gives an overview of its analysis. Section 4 contains the experiments.

## 2 Preliminaries and Summary of Results

Basic Notation. We use , etc. to denote constants which may be different in each case. We use whp to denote with high probability, by which we mean with probability larger for any . All notation is summarized in Appendix H for quick reference.

Stochastic Blockmodel. Let be the number of nodes in the graph with each node assigned a label from the set uniformly at random. An edge is added between two nodes with the same label with probability and between the nodes with different labels with probability , with (assortative case). In this work, we focus on the case, where and the community sizes are exactly the same. (Many studies of the regimes where recovery is possible have been published [Abb16, MNS13, MNS15, Mas14].)

Let be the adjacency matrix of the realized graph and let , where with every entry equal to and , respectively. We next explain the inference algorithm, which uses two key observations. 1. Spectral Properties of . has rank and the non-trivial eigenvectors are and corresponding to eigenvalues and , respectively. If one has access to , the hidden structure in the graph is revealed merely by reading off the second eigenvector. 2. Low Discrepancy between and . Provided the average degree and the gap are large enough, the spectrum and eigenspaces of the matrices and can be shown to be close using matrix concentration inequalities and the Davis-Kahan theorem [Tro12, DK70]. Thus, it is sufficient to look at the projection of the columns of onto the top two eigenvectors of to identify the hidden latent structure.

Small-World Model (SWM). In a 1-dim. SWM, each node is associated with an independent latent variable that is drawn from the uniform distribution on . The probability of a link between two nodes is , where is a hyper-parameter.

The inference algorithm for small-world models uses different ideas. Each edge in the graph is considered as either “short-range” or “long-range.” Short-range edges are those between nodes that are nearby in latent space, while long-range edges have end-points that are far away in latent space. After removing the long-range edges, the shortest path distance between two nodes scales proportionally to the corresponding latent space distance (see Fig. 6 in App. A.2). After obtaining estimates for pairwise distances, standard buidling blocks are used to find the latent positions  [IM04a]. The key observation used to remove the long-range edges is: an edge is a short-range edge if and only if and will share many neighbors.

A Unified Model. Both SBM and SWM are special cases of our unified latent space model. We begin by describing the full-fledged bipartite (heterogeneous) model that is a better approximation of real-world networks, but requires sophisticated algorithmic techniques (see Appendix C for a detailed analysis). Next, we present a simplified (homogeneous) model to explain the key ideas.

Bipartite Model. We use graphon model to characterize the stochastic interactions between users. Each individual is associated with a latent variable in . The bipartite graph model consists of two types of users: the left side of the graph are the followers (ordinary users) and the right side are the influencers (celebrities). Both and are i.i.d. random variables from a distribution . This assumption follows the convention of existing heterogeneous models [ZLZ12, QR13a]. The probability that two individuals and interact is , where is a kernel function (these are sometimes referred to as graphon-based models [Lov12, PJW13, ACC13]). Throughout this paper we assume that is a small-world kernel, i.e., for some and suitable constants , and that . Let be a binary matrix that if and only if there is an edge between and . Our goal is to estimate based on for suitably large .

Simplified Model. The graph only has the node set is of celebrity users. Each is again an i.i.d. random variable from . The probability that two users and interact is . The denominator is a normalization term that controls the edge density of the graph. We assume , i.e., the average degree is . Unlike the SWM where the are drawn uniformly from , in the unified model can be flexible. When is the uniform distribution, the model is the standard SWM. When has discrete support (e.g., with prob. 1/2 and otherwise), then the unified model reduces to the SBM. Our distribution-agnostic algorithm can automatically select the most suitable model from SBM and SWM, and infer the latent positions of (almost) all the nodes.

Bipartite vs. Simplified Model. The simplified model suffers from the following problem: If the average degree is , then we err on estimating every individual’s latent position with a constant probability (e.g., whp the graph is disconnected), but in practice we usually want a high prediction accuracy on the subset of nodes corresponding to high-profile users. Assuming that the average degree is mismatches empirical social network data. Therefore, we use a bipartite model that introduces heterogeneity among nodes: By splitting the nodes into two classes, we achieve high estimation accuracy on the influencers and the degree distribution more closely matches real-world data. For example, in most online social networks, nodes have average degree, and a small fraction of users (influencers) account for the production of almost all “trendy” content while most users (followers) simply consume the content.

Additional Remarks on the Bipartite Model. 1. Algorithmic contribution. Our algorithm computes and then regularizes the product by shrinking the diagonal entries before carrying out spectral analysis. Previous studies of the bipartite graph in similar settings [Dhi01, ZRMZ07, WTSC16] attempt to construct a regularized product using different heuristics. Our work presents the first theoretically sound regularization technique for spectral algorithms. In addition, some studies have suggested running SVD on directly (e.g., [RQY16]). We show that the (right) singular vectors of do not converge to the eigenvectors of (the matrix with entries ). Thus, it is necessary to take the product and use regularization. 2. Comparison to degree-corrected models (DCM). In DCM, each node is associated with a degree parameter . Then we have . The DCM model implies the subgraph induced by the highest degree nodes is dense, which is inconsistent with real-world networks. There is a need for better tools to analyze the asymptotic behavior of such models and we leave this for future work (see, e.g., [ZLZ12, QR13a]).

Theoretical Results. Let be the cdf of . We say and are well-conditioned if:
(1) has finitely many points of discontinuity, i.e., the closure of the support of can be expressed as the union of non-overlapping closed intervals , , …, for a finite number .
(2) is near-uniform, i.e., for any interval that has non-empty overlap with ’s support, , for some constant .
(3) Decay Condition: The eigenvalues of the integral operator based on and decay sufficiently fast. We define the and let denote the eigenvalues of . Then, it holds that .

If we use the small-word kernel and choose that give rise to SBM or SWM, in each case the pair and are well-conditioned, as described below. As the decay condition is slightly more invoved, we comment upon it. The condition is a mild one. When is uniformly distributed on , it is equivalent to requiring to be twice differentiable, which is true for the small world kernel (Theorem F.2). When has a finite discrete support, there are only finitely many non-zero eigenvalues, i.e., this condition also holds. The decay condition holds in more general settings, e.g., when is piecewise linear [Kön86] (see App. F). Without the decay condition, we would require much stronger assumptions: Either the graph is very dense or . Neither of these assumptions is realistic, so effectively our algorithm fails to work. In practice, whether the decay condition is satisfied can be checked by making a log-log plot and it has been observed for several real-world networks, the eigenvalues follow a power-law distribution [MP02].

Next, we define the notion of latent position recovery for our algorithms.

###### Definition 2.1 ((α,β,γ)-Aproximation Algorithm).

Let , , and be defined as above, and let . An algorithm is called an -approximation algorithm if
1. It outputs a collection of disjoint points such that , which correspond to subsets of reconstructed latent variables.
2. For each , it produces a distance matrix . Let be such that for any

 D(i)ij,ik≤|xij−xik|≤(1+β)D(i)ij,ik+γ. (1)

3. .
In bipartite graphs, Eq.(1) is required for only influencers.

We do not attempt to optimize constants in this paper. We set , a small constant, and . Definition 2.1 allows two types of errors: s are not required to form a partition i.e., some nodes can be left out, and a small fraction of estimation errors is allowed in , e.g., if but , then the -th “row” in is incorrect. To interpret the definition, consider the blockmodel with communities. Condition 1 means that our algorithm will output two disjoint groups of points. Each group corresponds to one block. Condition 2 means that there are pairwise distance estimates within each group. Since the true distances for nodes within the same block are zero, our estimates must also be zero to satisfy Eq.1. Condition 3 says that the portion of misclassified nodes is . We can also interpret the definition when we consider a small-world graph, in which case . The algorithm outputs pairwise distances for a subset . We know that there is a sufficiently large such that the pairwise distances are all correct in .

Our algorithm does not attempt to estimate the distance between and for . When the support contains multiple disjoint intervals, e.g., in the SBM case, it first pulls apart the nodes in different communities. Estimating the distance between intervals, given the output of our algorithm is straightforward. Our main result is the following.

###### Theorem 2.2.

Using the notation above, assume and are well-conditioned, and and are for some suitably large . The algorithm for the simplified model shown in Figure 1 and that for the bipartite model shown in Figure 8 give us an -approximation algorithm w.h.p. for any constant . Furthermore, the distance estimates for each are constructed using the shortest path distance of an unweighted graph.

We focus only on the simplified model and the analysis for the bipartite graph algorithm is in Appendix C.

Pairwise Estimation to Line-embedding and High-dimensional Generalization. Our algorithm builds estimates on pairwise latent distance and uses well-studied metric-embedding methods [BCIS05, BG05] as blackboxes to infer latent positions. Our inference algorithm can be generalized to -dimensional space with being a constant. But the metric-embedding on becomes increasingly difficult, e.g., when , the approximation ratio for embedding a graph is  [IM04b].

## 3 Our algorithms

As previous noted, SBM and SWM are special cases of our unified model and both require different algorithmic techniques. Given that it is not surprising that our algorithm blends ingredients from both sets of techniques. Before proceeding, we review basics of kernel learning.

Notations. Let be the adjacency matrix of the observed graph (simplified model) and let . Let be the matrix with entries . Let () be the SVD of (). Let be a parameter to be chosen later. Let () be a diagonal matrix comprising the -largest eigenvalues of (). Let () and () be the corresponding singular vectors of (). Finally, let () be the low-rank approximation of (). Note that when a matrix is positive definite and symmetric SVD coincides with eigen-decomposition; as a consequence and .

Kernel Learning. Define an integral operator as . Let be the eigenfunctions of and be the corresponding eigenvalues such that and for each . Also let be the number of eigenfunctions/eigenvalues of , which is either finite or countably infinite. We recall some important properties of  [SS01, TSP13]. For , define the feature map , so that . We also consider a truncated feature . Intuitively, if is too small for sufficiently large , then the first coordinates (i.e., ) already approximate the feature map well. Finally, let such that its -th entry is . Let’s further write be the -th column of . Let . When the context is clear, shorten and to and , respectively.

There are two main steps in our algorithm which we explain in the following two subsections.

### 3.1 Estimation of Φ through K and A

The mapping is bijective so a (reasonably) accurate estimate of can be used to recover . Our main result is the design of a data-driven procedure to choose a suitable number of eigenvectors and eigenvalues of to approximate (see in Fig. 1).

###### Proposition 3.1.

Let be a tunable parameter such that and . Let be chosen by . Let be such that its first -coordinates are equal to , and its remaining entries are 0. If and ( and ) is well-conditioned, then with high probability:

 ∥ˆΦ−Φ∥F=O(√n(t/(ρ(n)))229) (2)

Specifically, by letting , we have

Remark on the Eigengap. In our analysis, there are three groups of eigenvalues: the eigenvalues of , those of , and those of . They are in different scales: (resulting from the fact that for all and ), and if and are sufficiently large. Thus, are independent of for a fixed and should be treated as . Also as . Since the procedure of choosing depends on (and thus also on ), depends on and can be bounded by a function in . This is the reason why Proposition 3.1 does not explicitly depend on the eigengap. We also note that we cannot directly find based on the input matrix . But standard interlacing results can give (see Lemma B.6 in Appendix.)

Intuition of the algorithm. Using Mercer’s theorem, we have . Thus, . On the other hand, we have . Thus, and are approximately the same, up to a unitary transformation. We need to identify sources of errors to understand the approximation quality.

Error source 1 Finite samples to learn the kernel. We want to infer about “continuous objects” and (specifically the eigenfunctions of ) but gives only the kernel values of a finite set of pairs.From standard results in Kernel PCA [RBV10, TSP13], we have with probability ,

 ∥UKS\nicefrac12KW−Φd(X)∥F≤2√2√logϵ−1λd(K)−λd+1(K)=2√2√logϵ−1δd.

Error source 2. Only observe . We observe only the realized graph and not , though it holds that . Thus, we can only use singular vectors of to approximate . We have: . When is dense (i.e., ), the problem is analyzed in [TSP13]. We generalize the results in [TSP13] for the sparse graph case. See Appendix B for a complete analysis.

Error source 3. Truncation error. When is large, the noise in “outweighs” the signal. Thus, we need to decide a such that only the first eigenvectors/eigenvalues of are used to approximate . Here, we need to address the truncation error: the tail is thrown away.

Next we analyze the magitude of the tail. We abuse notation so that refers to both a -dimensional vector and a -dimensional vector in which all the entries after -th one are . We have (A Chernoff bound is used to obtain that ). Using the decay condition, we show that a can be identified so that the tail can be bounded by a polynomial in . The details are technical and are provided in the supplementary material (cf. Proof of Prop. 3.1 in Appendix B).

### 3.2 Estimating Pairwise Distances from ˆΦ(xi) through Isomap

See in Fig. 1 for the pseudocode. After we construct our estimate , we estimate by letting . Recalling , a plausible approach is to estimate . However, is a convex function in . Thus, when is small, a small estimation error here will result in an amplified estimation error in (see also Fig. 7 in App. A.3). But when is small, is reliable (see the “reliable” region in Fig. 7).

Thus, our algorithm only uses large values of to construct estimates. The isomap technique introduced in topological learning [TdSL00, ST03] is designed to handle this setting. Specifically, the set forms a curve in (Fig. 2(a)). Our estimate will be a noisy approximation of the curve (Fig. 2(b)). Thus, we build up a graph on so that and are connected if and only if and are close (Fig. 2(c) and/or (d)). Then the shortest path distance on approximates the geodesic distance on . By using the fact that is a radial basis kernel, the geodesic distance will also be proportional to the latent distance.

Corrupted nodes. Excessively corrupted nodes may help build up “undesirable bridges” and interfere with the shortest-path based estimation (cf.Fig. 2(c)). Here, the shortest path between two green nodes “jumps through” the excessively corrupted nodes (labeled in red) so the shortest path distance is very different from the geodesic distance.

Below, we describe a procedure to remove excessively corrupted nodes and then explain how to analyze the isomap technique’s performance after their removal. Note that in this section mostly refers to the shortest path distance (rather than the number of eigenvectors we keep as used in the previous section).

Step 1. Eliminate corrupted nodes. Recall that are the latent variables. Let and . For any and , we let . Define projection , where is the curve formed by . Finally, for any point , define such that (i.e., ’s original latent position). For the points that fall outside of , define .

Let us re-parametrize the error term in Propostion 3.1. Let be that , where for sufficiently large . By a Markov inequality, we have .

Intuitively, when , becomes a candidate that can serve to build up undesirable shortcuts. Thus, we want to eliminate these nodes.

Looking at a ball of radius centered at a point , consider two cases.
Case 1. If is close to , i.e., corresponding to the blue nodes in Figure 2(c). For exposition purpose, let us assume . Now for any point , if , then by Lemma E.1, we have , which means is in . The total number of such nodes will be in the order of , by using the near-uniform density assumption.
Case 2. If is far away from any point in , i.e., corresponding to the red ball in Figure 2(c), any points in will also be far from . Then the total number of such nodes will be .

As for , there is a phase-transition phenomenon: When is far from , then a neighborhood of contains nodes. When is close to , then a neighborhood of contains nodes.

We can leverage this intuition to design a counting-based algorithm to eliminate nodes that are far from :

 \procDenoise(ˆzi):If |Ball(ˆzi,3/√f(n))|

Theoretical result. We classify a point into three groups:
1. Good: Satisfying . We further partition the set of good points into two parts. Good-I are points such that , while Good-II are points that are good but not in Good-I.
3. Unclear: otherwise.

We prove the following result (see Appendix E for a proof).

###### Lemma 3.2.

After running that uses the counting-based decision rule, all good points are kept, all bad points are eliminated, and all unclear points have no performance guarantee. The total number of eliminated nodes is .

Step 2. An isomap-based algorithm. Wlog assume there is only one closed interval for . We build a graph on so that two nodes and are connected if and only if , where is a sufficiently large constant (say 10). Consider the shortest path distance between arbitrary pairs of nodes and (that are not eliminated.) Because the corrupted nodes are removed, the whole path is around . Also, by the uniform density assumption, walking on the shortest path in is equivalent to walking on with “uniform speed”, i.e., each edge on the path will map to an approximately fixed distance on . Thus, the shortest path distance scales with the latent distance, i.e., , which implies Theorem 2.2. See Appendix E.3 for a detailed analysis.

Bipartite Model. Although we have focused our discussion on the simplified model, we make a few remarks about inference in the more realistic bipartite model. A more detailed discussion and the inference algorithm is available at the beginning of Appendix C and full details follow in that appendix. In the bipartite case, we no longer have access to the kernel matrix for pairs of celebrity nodes; however, any non-diagonal entry of , say the one, can be written as where and are independent Bernoulli random variables with parameters and . This gives rise to a square kernel (of ) which can be used to identify the eigenvalues and eigenvectors of the kernel operator used in the analysis of the simplified model. The diagonal entries have to be regularized as there is no independence in the corresponding terms.

Discussion: “Gluing together” two algorithms? The unified model is much more flexible than SBM and SWM. We were intrigued that the generalized algorithm needs only to “glue together” important techniques used in both models: Step 1 uses the spectral technique inspired by SBM inference methods, while Step 2 resembles techniques used in SWM: the isomap only connects between two nodes that are close, which is the same as throwing away the long-range edges.

## 4 Experiments

We apply our algorithm to a social interaction graph from Twitter to construct users’ ideology scores. We assembled a dataset by tracking keywords related to the 2016 US presidential election for 10 million users. First, we note that as of 2016 the Twitter interaction graph behaves “in between” the small-world and stochastic blockmodels (see Figure 4), i.e., the latent distributions are bimodal but not as extreme as the SBM.

Ground-truth data. Ideology scores of the US Congress (estimated by third parties [Tau12]) are usually considered as a “ground-truth” (see, e.g.,  [BJN15]) dataset. We apply our algorithm and other baselines on Twitter data to estimate the ideology score of politicians (members of the 114th Congress), and observe that our algorithm has the highest correlation with ground-truth. See Fig. 3. Beyond correlation, we also need to estimate the statistical significance of our estimates. We set up a linear model , in which ’s are our estimates and ’s are ground-truth. We then use bootstrapping to compute the standard error of our estimator, and then use the standard error to estimate the p-value of our estimator. The details of this experiment and additional empirical evaluation are available in Appendix G.

## References

• [Abb16] Emmanuel Abbe. Community detection and the stochastic block model. 2016.
• [ABFX08] Edoardo M. Airoldi, David M. Blei, Stephen E. Fienberg, and Eric P. Xing. Mixed membership stochastic blockmodels. J. Mach. Learn. Res., 9:1981–2014, 2008.
• [ACC13] Edo M Airoldi, Thiago B Costa, and Stanley H Chan. Stochastic blockmodel approximation of a graphon: Theory and consistent estimation. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 692–700. Curran Associates, Inc., 2013.
• [ACKS13] Ittai Abraham, Shiri Chechik, David Kempe, and Aleksandrs Slivkins. Low-distortion inference of latent similarities from a multiplex social network. In SODA, pages 1853–1872. SIAM, 2013.
• [AS15a] Emmanuel Abbe and Colin Sandon. Community detection in the general stochastic block model: Fundamental limits and efficient algorithms for recovery. In Proceedings of 56th Annual IEEE Symposium on Foundations of Computer Science, Berkely, CA, USA, pages 18–20, 2015.
• [AS15b] Emmanuel Abbe and Colin Sandon. Detection in the stochastic block model with multiple clusters: proof of the achievability conjectures, acyclic BP, and the information-computation gap. arXiv preprint arXiv:1512.09080, 2015.
• [Bar12] Pablo Barberá. Birds of the Same Feather Tweet Together. Bayesian Ideal Point Estimation Using Twitter Data. 2012.
• [BC09] Peter J. Bickel and Aiyou Chen. A nonparametric view of network models and newmanâgirvan and other modularities. Proceedings of the National Academy of Sciences, 106(50):21068–21073, 2009.
• [BCIS05] Mihai Badoiu, Julia Chuzhoy, Piotr Indyk, and Anastasios Sidiropoulos. Low-distortion embeddings of general metrics into the line. In Proceedings of the 37th Annual ACM Symposium on Theory of Computing, Baltimore, MD, USA, May 22-24, 2005, pages 225–233, 2005.
• [BG05] I. Borg and P.J.F. Groenen. Modern Multidimensional Scaling: Theory and Applications. Springer, 2005.
• [BJN15] Pablo Barberá, John T. Jost, Jonathan Nagler, Joshua A. Tucker, and Richard Bonneau. Tweeting from left to right. Psychological Science, 26(10):1531–1542, 2015.
• [Cha15] Sourav Chatterjee. Matrix estimation by universal singular value thresholding. Ann. Statist., 43(1):177–214, 02 2015.
• [CJR04] J. Clinton, S. Jackman, and D. Rivers. The statistical analysis of roll call data. American Political Science Review, 98(2):355–370, 2004.
• [CR13] Raviv Cohen and Derek Ruths. Classifying political orientation on twitter: Itâs not easy! In International AAAI Conference on Weblogs and Social Media, 2013.
• [Dhi01] Inderjit S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’01, pages 269–274, New York, NY, USA, 2001. ACM.
• [DK70] C. Davis and W. M. Kahan. The rotation of eigenvectors by a perturbation. SIAM J. Numer. Anal., 7:1–46, 1970.
• [GB11] S. Gerrish and D. Blei. Predicting legislative roll calls from text. In Proc. ICML, 2011.
• [GB12] S. Gerrish and D. Blei. How the vote: Issue-adjusted models of legislative behavior. In Proc. NIPS, 2012.
• [GS13] J. Grimmer and B. M. Stewart. Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis, 2013.
• [HLL83] Paul W Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmodels: First steps. Social networks, 5(2):109–137, 1983.
• [HRH01] Peter D. Hoff, Adrian E. Raftery, and Mark S. Handcock. Latent space approaches to social network analysis. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 97:1090–1098, 2001.
• [IM04a] Piotr Indyk and Jiri Matoušek. Low-distortion embeddings of finite metric spaces. Handbook of discrete and computational geometry, page 177, 2004.
• [IM04b] Piotr Indyk and Jiri Matousek. Low-distortion embeddings of finite metric spaces. In in Handbook of Discrete and Computational Geometry, pages 177–196. CRC Press, 2004.
• [Kat87] Tosio Kato. Variation of discrete spectra. Communications in Mathematical Physics, 111(3):501–504, 1987.
• [Kle00] Jon Kleinberg. The small-world phenomenon: An algorithmic perspective. In Proceedings of the thirty-second annual ACM symposium on Theory of computing, pages 163–170. ACM, 2000.
• [KMS16] Varun Kanade, Elchanan Mossel, and Tselil Schramm. Global and local information in clustering labeled block models. IEEE Trans. Information Theory, 62(10):5906–5917, 2016.
• [Kön86] H. König. Eigenvalue Distribution of Compact Operators. Operator Theory: Advances and Applications. Birkhäuser, 1986.
• [LBG03] M. Laver, K. Benoit, and J. Garry. Extracting policy positions from political texts using words as data. American Political Science Review, 97(2), 2003.
• [LLDM08] Jure Leskovec, Kevin J Lang, Anirban Dasgupta, and Michael W Mahoney. Statistical properties of community structure in large social and information networks. In Proceedings of the 17th international conference on World Wide Web, pages 695–704. ACM, 2008.
• [LLM10] Jure Leskovec, Kevin J Lang, and Michael Mahoney. Empirical comparison of algorithms for network community detection. In Proceedings of the 19th international conference on World wide web, pages 631–640. ACM, 2010.
• [Lov12] L. Lovasz. Large Networks and Graph Limits. American Mathematical Society colloquium publications. American Mathematical Society, 2012.
• [LP12] Jeffrey R Lax and Justin H Phillips. The democratic deficit in the states. American Journal of Political Science, 56(1):148–166, 2012.
• [Mas14] Laurent Massoulié. Community detection thresholds and the weak Ramanujan property. In Proceedings of the 46th Annual ACM Symposium on Theory of Computing, pages 694–703. ACM, 2014.
• [McS01] Frank McSherry. Spectral partitioning of random graphs. In Foundations of Computer Science, 2001. Proceedings. 42nd IEEE Symposium on, pages 529–537. IEEE, 2001.
• [MNS13] Elchanan Mossel, Joe Neeman, and Allan Sly. A proof of the block model threshold conjecture. arXiv preprint arXiv:1311.4115, 2013.
• [MNS15] Elchanan Mossel, Joe Neeman, and Allan Sly. Reconstruction and estimation in the planted partition model. Probability Theory and Related Fields, 162(3-4):431–461, 2015.
• [MP02] Milena Mihail and Christos Papadimitriou. On the eigenvalue power law. In International Workshop on Randomization and Approximation Techniques in Computer Science, pages 254–262. Springer, 2002.
• [New06] Mark EJ Newman. Finding community structure in networks using the eigenvectors of matrices. Physical review E, 74, 2006.
• [NG04] Mark EJ Newman and Michelle Girvan. Finding and evaluating community structure in networks. Physical review E, 69(2):026113, 2004.
• [NWS02] Mark EJ Newman, Duncan J Watts, and Steven H Strogatz. Random graph models of social networks. Proceedings of the National Academy of Sciences, 99(suppl 1):2566–2572, 2002.
• [O10] Roberto Imbuzeiro Oliveira et al. Sums of random hermitian matrices and an inequality by rudelson. Electron. Commun. Probab, 15(203-212):26, 2010.
• [Oli10] Roberto Oliveira. Sums of random hermitian matrices and an inequality by rudelson. Electron. Commun. Probab., 15:203–212, 2010.
• [PJW13] Sofia C. Olhede Patrick J. Wolfe. Nonparametric graphon estimation. 2013.
• [PR85] K.T̃. Poole and H. Rosenthal. A spatial model for legislative roll call analysis. American Journal of Political Science, 29(2):357–384, 1985.
• [QR13a] Tai Qin and Karl Rohe. Regularized spectral clustering under the degree-corrected stochastic blockmodel. In C.j.c. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3120–3128. 2013.
• [QR13b] Tai Qin and Karl Rohe. Regularized spectral clustering under the degree-corrected stochastic blockmodel. In Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS’13, pages 3120–3128, USA, 2013. Curran Associates Inc.
• [RAK07] U. N. Raghavan, R. Albert, and S. Kumara. Near linear time algorithm to detect community structures in large-scale networks. Physical Review E, 76(3), 2007.
• [RBV10] Lorenzo Rosasco, Mikhail Belkin, and Ernesto De Vito. On learning with integral operators. J. Mach. Learn. Res., 11:905–934, March 2010.
• [RCY11] Karl Rohe, Sourav Chatterjee, and Bin Yu. Spectral clustering and the high-dimensional stochastic blockmodel. The Annals of Statistics, 39(4):1878–1915, 2011.
• [RQY16] Karl Rohe, Tai Qin, and Bin Yu. Co-clustering directed graphs to discover asymmetries and directional communities. Proceedings of the National Academy of Sciences, 113(45):12679–12684, 2016.
• [SS01] Bernhard Scholkopf and Alexander J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2001.
• [ST03] Vin De Silva and Joshua B. Tenenbaum. Global versus local methods in nonlinear dimensionality reduction. In Advances in Neural Information Processing Systems 15, pages 705–712. MIT Press, 2003.
• [Tau12] Joshua Tauberer. Observing the unobservables in the us congress. Law Via the Internet, 2012.
• [TdSL00] Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319, 2000.
• [Tro12] Joel A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12(4):389–434, 2012.
• [TSP13] Minh Tang, Daniel L. Sussman, and Carey E. Priebe. Universally consistent vertex classification for latent positions graphs. Ann. Statist., 41(3):1406–1430, 06 2013.
• [WC14] Patrick J. Wolfe and David Choi. Co-clustering separately exchangeable network data. The Annals of Statistics, 42(1):29–63, 2014.
• [Wey12] H. Weyl. Das asymptotische verteilungsgesetz der eigenwerte linearer partieller differentialgleichungen (mit einer anwendung auf die theorie der hohlraumstrahlung). Mathematische Annalen, 71:441–479, 1912.
• [WS98] Duncan J Watts and Steven H Strogatz. Collective dynamics of âsmall-worldâ networks. nature, 393(6684):440–442, 1998.
• [WTSC16] Felix Ming Fai Wong, Chee-Wei Tan, Soumya Sen, and Mung Chiang. Quantifying political leaning from tweets, retweets, and retweeters. IEEE Trans. Knowl. Data Eng., 28(8):2158–2172, 2016.
• [WZ15] Andrew J. Wathen and Shengxin Zhu. On spectral distribution of kernel matrices related to radial basis functions. Numerical Algorithms, 70(4):709–726, 2015.
• [XML14] Jiaming Xu, Laurent Massoulié, and Marc Lelarge. Edge label inference in generalized stochastic block models: from spectral theory to impossibility results. In Maria Florina Balcan, Vitaly Feldman, and Csaba SzepesvÃ¡ri, editors, Proceedings of The 27th Conference on Learning Theory, volume 35 of Proceedings of Machine Learning Research, pages 903–920, Barcelona, Spain, 13–15 Jun 2014. PMLR.
• [YP16] Se-Young Yun and Alexandre Proutière. Optimal cluster recovery in the labeled stochastic block model. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 965–973, 2016.
• [ZB05] Laurent Zwald and Gilles Blanchard. On the convergence of eigenspaces in kernel principal component analysis. In Advances in Neural Information Processing Systems 18 [Neural Information Processing Systems, NIPS 2005, December 5-8, 2005, Vancouver, British Columbia, Canada], pages 1649–1656, 2005.
• [ZLZ12] Yunpeng Zhao, Elizaveta Levina, and Ji Zhu. Consistency of community detection in networks under degree-corrected stochastic block models. Ann. Statist., 40(4):2266–2292, 08 2012.
• [ZRMZ07] T. Zhou, J. Ren, M. Medo, and Y.-C. Zhang. Bipartite network projection and personal recommendation. 76(4):046115, October 2007.

This section provides additional illustrations related to our work.

### a.1 Model Selection Probem (presented in Section 1)

Without observing the latent positions or knowing which model generated the underlying graph, the adjacency matrix of a social graph typically looks like the one shown in Fig. 5(a). However, if the model generating the graph is known, it is then possible to run a suitable “clustering algorithm” [McS01, ACKS13] that reveals the hidden structure. When the vertices are ordered suitably, the SBM’s adjacency matrix looks like the one shown in Fig. 5(b) and that of the SWM looks like the one shown in Fig. 5(c).

### a.2 Algorithm for the Small-world Model (presented in Section 2)

The inference algorithm for small-world networks uses different ideas. Each edge in the graph can be thought of as a “short-range” or “long-range” one. Short-range edges are those between nodes that are nearby in latent space, while long-range ones have end-points that are far away in latent space. After the removal of all the long-range edges, the shortest path distance between two nodes scales proportionally to the corresponding latent space distance (see Fig. 6). Once estimates for pairwise distances are obtained, standard buidling blocks may be used to find the latent positions  [IM04a].

### a.3 Sensitivity of the Gram matrix K (presented in Section 3.2)

After we construct our estimate , we may estimate by letting . Recalling , one plausible approach would be estimating . A main issue with this approach is that is a convex function in . Thus, when is small, a small estimation error here will result in an amplified estimation error in (cf. Fig. 7). But when is small, is reliable (see the “reliable” region in Fig. 7).

## Appendix B Simplified model case: Using A to approximate Φ(X)

This section proves the following proposition.

###### Proposition B.1 (Repeat of Proposition 3.1).

Let be a tunable parameter such that and . Let be chosen by . Let be such that its first -coordinates are equal to . If and is well-conditioned, then with high probability:

 ∥ˆΦ−Φ∥F=O⎛⎜⎝√n(tρ(n))229⎞⎟⎠ (4)

We will break down our analysis into three components, each of which corresponds to an approximation error source presented in Section 3.1. Some statements that appeared earlier are repeated in this section to make it self-contained.

Before proceeding, we need more notation.

Additional Notation. Let denote the reproducing kernel Hilbert space of , so that each element can be uniquely expressed as . The inner product of elements in is given by .

### b.1 Error Source 1: Finite Samples to Learn the Kernel

Recall that we want to infer about “continous objects” and (more specifically eigenfunctions of the integral operator derived using and ) but gives only the kernel values for a finite set of pairs, so estimates constructed from are only approximations. Here, we need only an existing result from Kernel PCA [RBV10, TSP13].

###### Lemma B.2.

Using the notations above, we have

 ∥UKS\nicefrac12KW−Φd(X)∥F≤2√2√logϵ−1λd(K)−λd+1(K)=2√2√logϵ−1δd (5)

We remark on the (implicit) dependence on the sample size in (5). Here, the right-hand side is the total error on all the samples, which is independent of , and hence the average square error shrinks as .

### b.2 Error Source 2: Only Observe A

We observe only the realized graph rather than the gram matrix , such that . Thus, we can use only singular vectors of to approximate . Our main goal is to prove the following lemma.

###### Lemma B.3.

Using the notation above, we have

 ∥∥√C(n)UAS1/2A−UKS1/2K∥∥F=O(t√dnδ2dρ(n)) (6)

The outline of the proof is as follows.

Step 1. Show that is small. This can be done by observing that are independent for different pairs of and applying a tail inequality on independent matrix sum.

Step 2. Apply a Davis-Kahan theorem to show that and are close.. Let and be the projection operators onto the linear subspaces spanned by the eigenvectors corresponding to the largest eigenvalues of and respectively. Davis-Kahan theorem gives a sufficient condition that and are close (upto a unitary operation), i.e., needs to be small (from step 1) and needs to be large (from is a suitable constant). Thus and are close up to a unitary operation, which implies and are close. We will specifically show that is small. refers to the Hilbert-Schmidt norm.

###### Definition B.4.

The Hilbert-Schmidt norm of a bounded operator on a Hilbert space is

 ∥A∥2HS=∑i∈I∥Aei∥2, (7)

where is an orthonormal basis of .

Step 3. Show that and are close (up to a unitary operation). We first argue that is small. Then by observing that and are “square root” of and , we can show and are close.

We now follow the workflow to prove the proposition.

#### b.2.1 Step 1. ∥A−K∥ is small

We use the following concentration bound for matrix [Tro12].

###### Theorem B.5.

Consider a finite sequence of independent random, self-adjoint matrices with dimension . Assume that each random matrix satisfies

 E[Xk]=0 and λmax(Xk)≤Ra.s. (8)

Then for all ,

 Pr[λmax(∑kXk)≥t]≤dexp(−t2/2σ2+Rt/3), (9)

where .

We apply the above theorem to bound . Let represent the probability that there is a link between and . Let random matrix be that the -th entry and -th entry are 1 with probability , and otherwise. The remaining entries in are all . Let . Note that and are all independent. We also have .

Note that:

1. a.s.

2. is a matrix such that only -th and -th entries can non-zero. Furthermore,

 (F2i,j)i,i=(F2i,j)j,j={p