Statistical guarantees for local graph clustering
Local graph clustering methods aim to find small clusters in very large graphs. These methods take as input a graph and a seed node, and they return as output a good cluster in a running time that depends on the size of the output cluster but that is independent of the size of the input graph. In this paper, we adopt a statistical perspective on local graph clustering, and we analyze the performance of the -regularized PageRank method (a popular local graph clustering method) for the recovery of a single target cluster, given a seed node inside the cluster. Assuming the target cluster has been generated by a random model, we present two results. In the first, we show that the optimal support of -regularized PageRank recovers the full target cluster, with bounded false positives. In the second, we show that if the seed node is connected solely to the target cluster then the optimal support of -regularized PageRank recovers exactly the target cluster. We also show that the solution path of -regularized PageRank is monotonic. From a computational perspective, this permits the application of the forward stagewise algorithm, which in turn permits us to approximate the entire solution path of the local cluster in a running time that does not depend on the size of the entire graph.
In many data applications, one is interested in finding small-scale structure in a very large data set. As an example, consider the following version of the so-called local graph clustering problem: given a large graph and a seed node in that graph, quickly find a good small cluster that includes that seed node. From an algorithmic perspective, one typically considers worst-case input graphs, and one may be interested in running time guarantees, e.g, to find a good cluster in a time that depends linearly or sub-linearly on the size of the entire graph. From a statistical perspective, such a local graph clustering problem can be understood as a recovery problem. One assumes that there exists a target cluster in a given large graph, where the graph is assumed to have been generated by a random model, and the objective is to recover the target cluster from one node inside the cluster.
In this paper, we consider the so-called -regularized PageRank algorithm [FKSCM2017], a popular algorithm for the local graph clustering problem, and we establish statistical recoverability guarantees for it. Previous theoretical analysis on local graph clustering, e.g., [ACL06, ZLM13], is based on the notion of conductance (a cluster quality metric that considers the internal versus external connectivity of a cluster) and considers running time performance for worst-case input graphs. In contrast, our goal will be to study the average-case performance of the -regularized PageRank algorithm, under a certain type of a local random graph model. This model concerns the target cluster and its adjacent nodes, and it encompasses the stochastic block model [holland1983stochastic, abbe2017community] and the planted clustering model [alon1998finding, arias2014community] as special cases.
Within this random graph model, we provide theoretical guarantees for the unique optimal solution of the -regularized PageRank optimization problem (see Theorem 1 and Theorem 2). Observe that our statistical perspective is more aligned with statistical guarantees for the sparse regression problem (and the lasso problem [tibshirani1996regression]), where the objective is to recover the true parameter and/or support from noisy data. Given this connection, we also establish a result for the exact support recovery of -regularized PageRank (see Theorem 3).
We are also interested in the computational performance of local graph clustering algorithms, i.e., that the running time depends on the size of the output but not on the size of the entire graph. To that end, we will show that the solution path of our algorithm for the -regularized PageRank problem is monotonic (see Theorem 4). This means that, as the norm regularization parameter becomes smaller, the individual coordinates of the solution are monotonically increasing, and thus the number of nonzero nodes in the optimal solution increases. This is a crucial property that allows us to use popular statistical tools, in particular, the forward stagewise algorithm [lars, friedman2001elements, tibshirani2015general], to obtain an approximate solution of the entire -regularized PageRank solution path. This, in combination with known results about the locality of gradient-based algorithms [FKSCM2017], means that, by terminating our algorithm early, we can find the approximate path in a running time that does not depend on the size of the entire graph. This makes the forward stagewise algorithm a good candidate for recovering an approximate and partial solution path of the -regularized problem for very large graphs.
There is a large body of related work, the most relevant of which is: work in theoretical computer science on local graph algorithms; work in statistics on stochastic graph models; and work in statistics on solution path algorithms. We discuss each in turn.
The origins of local graph clustering are with the work of Spielman and Teng [ST13]. Subsequent to their original results, there has been a great deal of follow-up work on local graph clustering procedures, including with random walks [ACL06], local Lanczos spectral approximations [2017-ecml-pkdd], evolving sets [AGPT2016], seed expansion methods [KK2014], optimization-based approaches [FKSCM2017, FDM2017], and local flow methods [WFHM2017]. There also exist local higher-order clustering [YBLG2017], linear algebra approaches [2017-ecml-pkdd], spectral methods based on Heat Kernel PageRank [KG14], newer seed set selection techniques for local flow methods [VKG18], and parallel local spectral approaches [SKFM2016]. In all of these cases, given a seed node, or a seed set of nodes, the goal of existing local graph clustering approaches is to compute a cluster “nearby” the seed that is related to the “best” cluster nearby the seed. Here, “best” and “nearby” are intentionally left under-specified, as they can be formalized in one of a few different but related ways. For example, “best” is usually related to a clustering score such as conductance. In fact, many existing methods for local graph clustering with theoretical guarantees are motivated through the problem of finding a cluster that is near the seed node and that also has small conductance value [ST13, ACL06, FKSCM2017, VKG18, WFHM2017].
There are also numerous papers in statistics on partitioning random graphs. Arguably, the stochastic block model (SBM) is the most commonly employed random model for graph partitioning, and it has been extensively studied [abbe2015community, abbe2015exact, zhang2016minimax, massoulie2014community, mossel2018proof, newman2002random, mossel2015reconstruction, rohe2011spectral, amini2018semidefinite, abbe2017community]. Recent work has also generalized the SBM to a degree-corrected block model, to capture degree heterogeneity of the network [chen2018convexified, gulikers2017spectral, zhao2012consistency, gao2018community]. The literature in this area is too extensive to cover in this paper, but we refer the readers to excellent survey papers on the graph partitioning problem [abbe2017community]. We should emphasize that the traditional graph partitioning problem is quite different than the local graph clustering problem. Among other things, while the former partitions all the vertices of a graph into different clusters, for the latter problem our objective is to find a single cluster given a seed node in the cluster.
Solution path algorithms are designed to solve the optimization problem over a full range of regularization parameter values, or to find a subset of the full solution path when the algorithm is terminated early. Since the seminal work of [lars], the idea of designing path algorithms has gained much attention in the sparse regression literature [zou2007degrees, hastie2004entire, arnold2016efficient], which have rendered the exploration of full regression coefficient paths and characterization of bias-variance tradeoff very efficient. Unlike regression setting, however, this type of path algorithm has been less studied in local graph clustering. In [gleich2016seeded], the authors propose a method to generate an approximate solution path associated with a PageRank diffusion to reveal cluster structures at multiple scale. In this work, we exploit the fact that the solution path of the -regularized PageRank problem is monotonic, which allows us to apply forward stagewise algorithm to provably approximate the entire regularization solution path.
We write for any . Throughout the paper we assume we have a connected, undirected graph , where denotes the set of nodes, with , and denotes the set of edges. We denote by the adjacency matrix of , i.e.,
For an unweighted graph, is set to for all . We denote by the diagonal degree matrix of , i.e., , where is the weighted degree of node . In this case, denotes the degree vector, and the volume of a subset of nodes is define as for . We denote by the graph Laplacian; and .
For given sets of indexes , we write to denote the submatrix of indexed by and . If is a singleton, we use to indicate the -th row of whose columns are indexed by . Analogously, we use to indicate the -th column of whose rows are indexed by . We denote by a set difference between and , and denote by the complement of .
2 -regularized PageRank
PageRank [page1999pagerank, brin1998anatomy] is a popular approach for ranking the nodes of a graph. It is defined as the stationary distribution of a Markov chain, which is encoded by a convex combination of the input distribution and the (lazy) random walk operator :
where and where is the teleportation parameter. To measure the ranking or importance of the nodes of the “whole” graph, PageRank is often computed by setting the input vector to be a uniform distribution over .
For local graph clustering, where the aim is to identify a target cluster, given a seed node in the cluster, the input distribution is set to be equal to one for the seed node and zero everywhere else. This “personalized” PageRank [haveliwala2002topic] measures the closeness or similarity of the nodes to the given seed node, and it outputs a ranking of the nodes that is “personalized” with respect to the seed node (as opposed to the original PageRank, which considers the entire graph). From an operational point of view, the underlying dynamic process in (1) defining personalized PageRank “teleports” a random walker back to the original seed node with probability .
The personalized PageRank vector can be obtained by solving the linear system (1), but this can be prohibitively expensive—especially when there is a single seed node or a small seed set of seed nodes, and when one is interested in very small clusters in a very large graph. In the seminal work of [ACL06], the authors propose an iterative algorithm, called Approximate Personalized PageRank (APPR), to solve this running time problem. They do so by approximating the personalized PageRank vector, while running in time independent of the size of the entire graph. APPR was developed from an algorithmic (or “theoretical computer science”) perspective, but it is equivalent to applying a coordinate descent type algorithm to the linear system (1) with a particular scheme of early stopping. Motivated by this, [FKSCM2017] recently proposed the -regularized PageRank optimization problem. Unlike APPR, the solution method for the -regularized PageRank optimization problem is purely optimization-based. It uses an norm regularization to set automatically to be zero nodes dissimilar to the seed node, thereby resulting in a highly sparse output. In this manner, -regularized PageRank can estimate the personalized ranking, while maintaining the most relevant nodes at the same time. Prior work [FKSCM2017] also showed that proximal gradient descent (ISTA) can solve the -regularized PageRank problem, with access to only a small portion of the entire graph, i.e., without even touching the entire graph, thereby allowing the method to easily scale to very large-scale graphs.
In this paper, we investigate the statistical performance of -regularized PageRank by reformulating the local graph clustering into the problem of sparse recovery. Here is a more precise definition of the -regularized PageRank optimization problem from [FKSCM2017] that we consider.
Definition 1 (-regularized PageRank).
Given a graph , with , and a seed vector , the -regularized PageRank [FKSCM2017] on the graph is defined as
where recall , and where is a user-specified parameter that controls the amount of the regularization.
To see the intuition behind (2), observe that if we set and , then we can see that it recovers the original PageRank solution of (1). In other words, the optimization problem in (2) adds an additional norm regularization to the quadratic objective of the linear system (1) (and set ) in order to keep the nodes most relevant to the seed node and to set the rest to be zero.
Proximal gradient descent (ISTA) and the locality property.
Since the -regularized PageRank problem (2) is convex, there are numerous ways to solve it using convex optimization techniques. Prior work [FKSCM2017] applies proximal gradient descent (ISTA) to the -regularized PageRank optimization problem. This proceeds by alternating between locally updating the PageRank vector (gradient descent step), and applying soft-thresholding step. That is, for ,
where is the soft-thresholding operator, given by
Importantly, it has been shown in [FKSCM2017] that proximal gradient descent, when initialized at zero, solves the -regularized PageRank optimization problem with running time that depends only on the cardinality of the support and its neighbors of the optimal solution. Therefore, when the size of the support set of the solution is small and it has small external connectivity, the proximal gradient descent finds the local cluster with access to only a small portion of the graph. This property, also called the strong locality property, is a key feature of any local graph clustering method, enabling them to handle very large/massive graphs [SKFM2016].
If we set step size (so each component has step size ) in the updating equation (3), then using and , the proximal gradient descent can be rewritten as
This shows that -regularized PageRank is the stationary distribution of the process consisting of the original diffusion (1) followed by soft-thresholding. The soft-thresholding then promotes sparsity structure on the unique stationary distribution of -regularized PageRank diffusion.
Here, we state additional properties of -regularized PageRank that will be useful for our analysis. The proof of these lemmas can be found in Appendix B.1.
The following lemma guarantees that the -regularized PageRank vector is non-negative.
Let be the vector given in (2). Then is non-negative, i.e., for all .
The following lemma guarantees that the gradient of at the optimal solution cannot be positive.
Let be the support set of the optimal solution. Then
3 Statistical guarantees under random model
In this section, we introduce a random model that we consider for generating a target cluster, and then we provide recovery guarantees for -regularized PageRank. Our results show that the optimal support of -regularized PageRank recovers the target cluster with bounded false-positives. Under additional assumptions, we show that exact recovery is also possible.
3.1 Random graph model
We will assume the graph is generated according to the following model.
Definition 2 (Local random model).
Given a graph that has vertices, let be a target cluster inside the graph, and let denote the complement of . If two vertices and belong to , then we draw an edge between and with probability , independently of all other edges; if and , then we draw an edge with probability , independently of all other edges; and otherwise, we allow any (deterministic or random) model to generate edges among vertices in .
Definition 2 says that the adjacency matrix is symmetric, and for any , we have that is an independent draw from a Bernoulli distribution with probability if , and from a Bernoulli distribution with probability if and . For the rest of the graph, i.e., when both and belong to , can be generated from an arbitrary fixed model. Under this definition, we can also naturally define the average graph, which is the graph induced by the average adjacency matrix , where the expectation is taken with respect to the distribution defined by Definition 2. That is, the average graph is an undirected graph whose adjacency matrix is , where
The average degree matrix is similarly denoted by and the average graph Laplacian is defined as . The model in Definition 2 allows us to formulate the problem of local graph clustering as the recovery of a target cluster. Since we are interested in recovering a single target cluster, it is natural to make assumptions only for nodes in the target cluster and nodes adjacent to the target cluster, and to leave the interactions between other nodes unspecified.
This model is also fairly general and covers several popular random graph models appearing in the literature, including the stochastic block model (SBM) [holland1983stochastic, abbe2017community] and the planted clustering model [alon1998finding, arias2014community, chen2016statistical]. For instance, if the subgraph with the vertices within is generated from the SBM, then the entire graph follows the SBM. On the other hand, if the subgraph of is generated from the classical Erdős-Rényi model with probability , the entire graph follows the Planted Densest Subgraph (in this case nodes in do not belong to any clusters). Hence, the results we obtain here for our model holds more broadly across these different random graph models.
Before we move on to our results, we need additional notation. We write to denote a singleton of the given seed node. Let denote the cardinality of the target cluster. According to our local model, any node in the target cluster has the same average degree, , which we denote by . For the nodes outside , we write to denote its average degree, where the expectation is taken with respect to any distribution. For graphs generated according to Definition 2, the following parameter plays a crucial role in determining the behavior of -regularized PageRank for local graph clustering:
Intuitively, one can think of this ratio as the ratio of the random walker staying inside under the average graph. (Note is the average degree of the target cluster and is the degree of the target cluster when restricted to the subgraph .) Thus, we can expect that the performance of any random walk-based methods will depend strongly on the number . In the extreme scenario where , we have , while for , we have . With this definition, we can also write and .
3.2 Recovery of target cluster with bounded false positives
Here, we investigate the performance of -regularized PageRank on the graph generated by the local random model in Definition 2, and we state two of our main theorems.
Our first main result guarantees full recovery of the target cluster for an appropriate choice of the regularization parameter. In particular, if we set to be less than , then the optimal solution (2) fully recovers the target cluster , as long as the seed node is initialized inside . The proof of Theorem 1 is given in Section A.4.
Theorem 1 (Full recovery).
Suppose that . If we set
then with probability at least ,111The precise statement is as follows: assume for a fixed constant , then with probability at least , the statement in the theorem holds. the solution to Problem (2) fully recovers the cluster , i.e.,
Our next main result provides an upper bound on the false positives present in the support set of the -regularized PageRank vector. By “false positives,” we mean the nonzero nodes that belong to . We measure the size of false positives using a notion of volume, where we recall the volume of a subset of vertices is given by . The proof of Theorem 2 is given in Section A.5.
Theorem 2 (Bounds on false positives).
The above results, Theorem 1 and Theorem 2, show several regimes where -regularized PageRank can fully recover the target cluster with nonvanishing probability. In particular, when , the size of the target cluster is required to be larger than , which includes the constant size . This is often the regime of interest for local graph clustering, where the goal is to find small- and meso-scale clusters in massive graphs [LLDM2009, LLM10_communities_CONF]. In addition, Theorem 1 indicates that if is small, then we need to set to be small to recover the entire cluster. Intuitively, more mass will leak out to for small , so we need to run more steps of random walk ( smaller in our optimization framework) to find the right cluster. However, this means that the -regularized PageRank vector will also pick up many nonzero nodes in , resulting in many false positives in the support set. Indeed, Theorem 2 shows that the volume of false positives grows quadratically as , so we need to be bounded to get a meaningful recovery from local clustering. In the case of , this amounts to requiring that in order for the target cluster to keep high mass inside .
Several other comments are worth making regarding these results. First, the current bound we obtain in (9) may not be tight with respect to and other constants, and the factor may be an artifact of our proof (see also the proof of Theorem 2 and Lemma 10). Based on our empirical results, -regularized PageRank performs well across a broad range of values, and we have not seen much difference in terms of performance among different ’s. The role of in -regularized PageRank is closely tied to the regularization parameter , and we leave the question of selecting optimal for future work. On the other hand, we think the rate is still tight, and we demonstrate this through the simulation study (see Section 5).
3.3 Exact recovery of target cluster with no false positives
Next, we study the scenarios under which -regularized PageRank can exhibit a stronger recovery guarantee. Specifically, under some additional conditions, we show that the support set of the optimal solution (2) identifies the target cluster exactly, without making any false positives. For this stronger exact recovery result, we require the following assumption about the parameters of the model.
We assume and , i.e., the within-cluster connectivity and the size of the target cluster do not scale with the size of the graph . Also, we assume for a fixed numerical constant .
As we noted above, the setting is often the case of interest for local graph clustering, where we would like to identify small- and medium-scale structure in large graphs [LLDM2009, LLM10_communities_CONF]. In this case, Assumption 1 requires , so that the underlying “signal” of the problem does not vanish as the size of the graph grows, . As discussed earlier, this means must also scale as for the local clustering algorithm to find the target without making many false positives.
Now we turn to the statements of exact recovery guarantees for -regularized PageRank when applied to the noisy graph generated from Definition 2. In particular, the fact that from Assumption 1, allows that with nonvanishing probability there is a node in the target cluster that is solely connected to . This node will serve as a “good” seed node input in the -regularized PageRank. With this choice of seed node, we now give conditions under which the optimal solution has no false positives with nonvanishing probability. The proof of Theorem 3 is given in Section A.6.
Theorem 3 (No false positives).
then for sufficiently large, with probability at least ,333The precise statement is as follows: assume for a fixed constant , then with probability at least , the statement in the theorem holds. there is a good starting node in such that -regularized PageRank parameterized with that node as a seed node satisfies
as long as
where is a universal constant.
While Theorem 3 guarantees no false positives in the solution of -regularized PageRank, when combined with Theorem 1, it establishes that -regularized PageRank recovers the target cluster exactly, even when the target cluster is constant-sized. We require and in the condition (10) to avoid overly complicated constants; while this simplifies the statements of the theorem, it is not difficult to show that a similar result holds more generally.
Some sort of condition like (11) about the realized degree seems necessary in order that the -regularized PageRank has no false positives. The optimization program (2) assigns less weights to low degree nodes in the penalty, so any nodes adjacent to will become active unless the -regularized PageRank penalizes them with nontrivial weights. Unlike Theorem 1 and Theorem 2, condition (11) rules out some specific models to which Theorem 3 can be applied. For example, planted clustering model with and does not satisfy this condition because the degrees in do not concentrate. For the stochastic block model, this condition is still satisfied if nodes adjacent to the target cluster belong to the clusters with degree larger than . In practice, condition (11) may not be always applicable for every node adjacent to , in which case the nodes that violate this condition may enter the model as false positives. We require the condition here though, since our model is essentially local and we do not have control outside beyond its neighbors.
4 Stagewise PageRank and solution paths
In this section, we show that the stagewise algorithm can be used for our -regularized PageRank problem to approximate the whole regularization path. This is possible because we show that regularization path is monotonic. Furthermore, we show that the stagewise algorithm requires only local operations per iteration. This means that the computational complexity per iteration depends only on the current nonzero nodes and its neighbors. The local operations, in combination with the monotonicity of the stagewise algorithm, allows us to implement the path algorithm without touching the whole graph, thus making the algorithm strongly local and scalable to large-scale graph analysis.
Forward stagewise algorithm is a popular path algorithm used in sparse regression, and it has been widely studied by many authors, including [lars, hastie2007forward, rosset2004boosting, rosset2007piecewise, zhao2007stagewise, tibshirani2015general]. For the case of norm regularization, the algorithm produces a sequence of iterates by updating in a direction that maximizes the inner product between the current iterate and the negative gradient of the objective function, and at the same time constrains the direction to have a small norm. When applied to -regularized PageRank optimization problem, we then obtain the following coordinate-wise scheme (recall is the objective function of -regularized PageRank (2)): for ,
The two main features of this algorithm are: 1) we greedily select the coordinate at each iteration that maximizes the magnitude of gradient, and 2) we update the current iterate by adding a small step size to the th coordinate. This conservative update of the variable counterbalances the greedy selection step, thus making the algorithm more stable.
The stagewise algorithm is known to have implicit regularization effects closely related to norm regularization [tibshirani2015general], and if each component of the -regularized solution has a monotone path, then the sequence of outputs generated by the stagewise algorithm exactly coincides with the regularization path, as the step size vanishes [lars, rosset2004boosting]. Interestingly, we show that this is indeed our case for -regularized PageRank. The stagewise algorithm is recently advocated by [tibshirani2015general], even when there is no correspondence to the -regularization path, due to its computational efficiency and implicit regularization effect.
Theorem 4 (Monotonicity of solution path).
Let denote the solution for (2) indexed by . Then, is monotone as a function of , i.e., whenever , where is applied component-wise. Moreover, if is positive on node , the inequality becomes strict.
Theorem 4 shows that once a node is picked up by -regularized PageRank at some , then it will never leave the model thereafter. Furthermore, Theorem 4, combined with known results from [lars, rosset2004boosting], guarantees that the sequence of PageRank vectors produced by (12) gives a provable approximation to the trajectories of -regularized PageRank as varies, and in the limit as it exactly coincides. The following corollary is thus immediate and we omit the proof.
The stagewise algorithm described in (12) converges to the -regularized PageRank solution path as the step size goes to , i.e., .
Therefore, the stagewise algorithm allows us to explore the entire -regularization path via a single run of simple iterative steps. Another advantage of the stagewise algorithm is that it enjoys the locality property, in that the algorithm only touches the chosen nodes and its neighbors as it progresses. This is obvious from the update step of (12) and the expression of the gradient ; if the current iterate has support set , then the gradient has nonzero components only at and its neighbors. This implies that one can compute an approximation to the regularization path, where nodes around the seed node are part of the path, without touching the whole graph. Thus, when terminated early, the algorithm produces an approximate and partial solution path, with access to only a small number of nodes; and the running time of the entire algorithm depends on the size of the output and not on the size of the full graph.
For the -regularized PageRank, the parameter controls the extent to which the random walk has moved farther from the seed, and so different values of can be used to reveal various scales of local clustering structure around the seed node. Therefore, in the setting of local graph clustering, the stagewise algorithm allows us to provably and efficiently track the evolution of a -PageRank diffusion and better understand the local cluster properties of the graph. This is well-suited for the purpose of exploratory graph analysis, and the idea of using path algorithms for exploring the graph has been also studied in [gleich2016seeded]. In addition to the exploratory analysis, the stagewise algorithm can be a competitive algorithm to find the target cluster if the size of the target cluster is small/medium and one needs a fine-scale resolution of the solution path. However, when the size of the target cluster is quite large, using optimization algorithms with a coarse grid of regularization parameter may lead to better computational savings without exploring the entire solution path from scratch. Overall, the stagewise algorithm must be used in a way complementary to the optimization algorithms that directly solve (2). We also refer the readers to [tibshirani2015general] for comprehensive study of the stagewise algorithm for general sparse modeling problem.
In practice, the performance of the stagewise algorithm relies on the selection of step size : when is large, the stagewise PageRank can fail to approximate the solution path and miss details of the local structure; and when is small, the algorithm is capable of accurately approximate the solution path, but at the cost of many more iterations. The issue of tuning step size is important for implementing the stagewise algorithm in general, and we refer the reader to [tibshirani2015general] for more details. In Section 5.1, we investigate the empirical performance of stagewise PageRank for different values of step size.
5 Empirical evaluation
In this section, we provide an empirical evaluation of our main theoretical results. In particular, we use simulated data to illustrate the behavior of the -regularized PageRank method and the stagewise algorithm, for various parameter settings of the stochastic block model; and we apply our method to an image data set, in order to seek certain segments in the image, given pixels in that segment. To solve Problem (2), we use proximal gradient descent, which is known to enjoy both the locality property and a linear convergence rate [FKSCM2017].
5.1 Simulated data
Here, we run a series of simulations on the graph generated from the stochastic block model. In all cases, the model consists of clusters, each of which has the same number of nodes and only one of which is the target cluster . We use the same parameters and across different clusters to generate edges within and between clusters. We fix .
Simulation 1. -regularized solution path and stagewise algorithm.
We have seen in Section 4 that the stagewise algorithm generates a provable approximation to the -regularization path. Here, we visually compare the actual solution path to the stagewise algorithm paths for different step sizes. We generate data from the stochastic block model with and . Each cluster has nodes. Figure 1 shows the -regularization path and stagewise component paths for one particular draw from the stochastic block model. We only show the solution paths for nodes in the target cluster without seed node, among nodes. Note that when the step size is small, , the stagewise path appears to closely match to the -regularization path; for moderate step size, , the stagewise path exhibits some jagged pattern but nevertheless accurately approximates the optimal path; and for relatively large step size, , while the jagged pattern becomes more evident visually, it is clear that the overall trend still coincides well with that of the solution path.
Simulation 2. Bounds on the false positives.
Next, we study the role of in determining the quality of the cluster recovered from noisy graph. In particular, we examine the bound predicted by Theorem 2, , using the simulated data. The graph is generated from the stochastic block model where we fix and . We vary , and from to with evenly spaced on the -scale, to get the total values. The probability is then determined accordingly. To generate the plots, we solve problem (2) and select the largest that recovers the entire target cluster. Figure 2(a) shows plotted against ; we see that the error scales approximately linearly, which is consistent with our finding. Shown in Figure 2(b) is the same figure as Figure 2(a), except that each curve is now scaled by . We see all the curves nearly coincide, which suggests that the result we obtain in Theorem 2 may not be tight. Improving the bound would be an interesting future direction. Finally, Figure 2(c) zooms in Figure 2(a) for large values of where is strictly less than . In this case, the recovered cluster contains only a small portion of false positives, showing good recovery results.
Simulation 3. Selection of the tuning parameter.
Finally, we discuss the practical issue of finding the regularization parameter in (2), or equivalently the number of iterations in the stagewise algorithm. We define the true positive rates and false rates as and , respectively. We will also make use of conductance, which is defined as the ratio , where . Conductance measures the weight of the edges that are being removed (numerator) divided by the volume of the cluster (denominator). Lower conductance values correspond to better quality clusters. We generate the target cluster using a stochastic block model with clusters and each cluster has nodes. We set and is varying in order to generate various , as is shown in Figure 3. For each experiment we run stagewise algorithm with and the results are averaged over trials.
Figure 3 illustrates how the false discovery rates and true positive rates and conductance change as the stagewise algorithm explores the -regularization path. For large , Figure 3(a), conductance is a good metric for finding the target cluster. This means that we will find the target cluster with low FR and large TR if out of all solutions on the path we choose the one with minimum conductance. As gets smaller, the minimum conductance does not relate to the target cluster. However, it is clear from Figures 3(b) and 3(c) that the stagewise algorithm with minimum conductance still finds the target cluster with good accuracy if the algorithm is terminated early. Finally, in Figure 3(d) we demonstrate a case where is small and conductance fails completely to relate to the target cluster. Developing general strategy for selecting the number of iterations, or the regularization parameter, is generally challenging, and we leave this for future work.
5.2 Image data
In this section, we present the performance of -regularized PageRank on real graphs generated from image data. In particular, given an image, we generate an adjacency matrix using the color and position features of pixels, as described in [SM2000]. Then we run local graph clustering by using certain pixels as seed nodes. For each seed node, we solve a different -regularized problem. In Figure 4 we demonstrate that there exists such that -regularized PageRank finds a good approximation of certain segments in the image. The seed nodes are depicted in Figure 4(a), and the target clusters are self-explanatory by looking at Figure 4(b) and Figure 4(c). We provide details about the seed nodes and the regularization parameter in Figure 4(a). For all clusters we use .
We have examined from a statistical perspective the -regularized PageRank algorithm for the local graph clustering problem. To do so, we reformulated the local graph clustering problem as a statistical recovery problem, where we impose a certain local random model on the graph, and then our task is to recover the target cluster generated from the model. Our results show that the optimal support of the -regularized PageRank vector identifies the target cluster with bounded false positives, and in certain settings exact recovery is also possible. Additionally, we have brought the idea of solution path algorithms from the sparse regression literature to the local graph clustering literature, and we showed that the forward stagewise algorithm gives a provable approximation to the entire regularization path of this algorithm. This extends the domain of existing local graph clustering algorithm, providing an efficient tool for exploring very large graphs, without the need even to touch the entire graph. This is a primitive that can be highly desirable in modern data analysis, where computational challenges are of great concern.
We would like to acknowledge ARO, DARPA, NSF, ONR, and Intel for providing partial support for this work.
Appendix A Proofs of theorems
In this section, we prove all of our theorems. To establish the theorems, we need a few concentration inequalities and intermediate results that shall be used in the proofs. In Section A.1, we give degree concentration inequalities for random graphs, and in Section A.2, we state recovery guarantees of -regularized PageRank on the average graph. Section A.3 gives a few important results on the -regularized PageRank when restricted to the target cluster. Based on these results, we give proofs of our four theorems, respectively, in Section A.4, Section A.5, Section A.6, and Section A.7. All of the proofs of intermediate lemmas are defered to Section B.
a.1 Concentration lemmas
Here, we present several concentration lemmas for degrees of random graphs. The first lemma is consequence of Chernoff’s inequality for the sum of independent Bernoulli random variables, and applying union bound (c.f. see vershynin2018high).
Lemma 3 (Adapted from vershynin2018high).
Let be the sum of independent Bernoulli random variables, with for . Then, there exists a universal constant such that if , then with probability at least ,
According to our random model (Definition 2), the degree vector for the target cluster comprises of random variables which are the sum of independent Bernoulli random variables with common mean . Therefore, by applying Lemma 3, it is straightforward to see the following result.
Lemma 4 (Adapted from vershynin2018high).
There exists a universal constant such that if , then with probability at least ,
where is the average degree of the vertices in the cluster .
The next lemma bounds the number of edges between node and the target cluster when .
Suppose that and for a positive constant . Then for sufficiently large, with probability at least ,
a.2 Exact recovery for expected graph
Here we define the “ground truth” which we obtain by applying -regularized PageRank to the average graph of our random model. Recall the adjacency matrix defined in (5), the associated diagonal degree matrix , and the Laplacian matrix . Writing , we denote the average -regularized PageRank as
Compared to (2), we see that the matrices associated with the graph are now replaced by its expected counterpart.
The following lemma shows that there exist a range of values for which the average -regularized PageRank gives an exact recovery for the target cluster, namely .
Lemma 6 shows that has high probability mass on the seed node with a long tail further away from the seed node. The mass outside the target cluster is being thresholded exactly to zero via -norm regularization, thus identifying the target cluster without any false positives. Of course, in practice, the average graph is not available, and we are instead given an instance of the graph from the random model. In this case, the average -regularized PageRank, , can serve as a ground truth which will allow us to estimate the magnitude of the -regularized PageRank vector by analyzing the error between and . Lemma 6 is being used in Lemma 9 below.
a.3 Reduced -regularized PageRank
Here, we introduce the following regularized optimization problem:
Note that the above reduced problem is defined as the -regularized PageRank problem restricted to the target cluster . Since our aim is to recover the target cluster based on the local model (Definition 2), we need to analyze the properties of the solution to this reduced problem. Here, abusing notation, we use to denote the vector either in the original space or in the reduced space .
We present several lemmas about that will be critical for the proof of theorems. First, we give a guarantee on the recovery of the target cluster for the optimal solution (A.3).
For the above result, since by construction is zero outside , all we need to show is that for . Next, we compare the support set of to .
For any regularization parameter , we have
Finally, we have the following estimate on the entries of on (recall and are the components of on and respectively, see Lemma 6).
a.4 Proof of Theorem 1
a.5 Proof of Theorem 2
To prove Theorem 2, we first need the following lemma for bounding the volume of . This lemma is a stronger version of [fountoulakis15, Theorem 2].
For any regularization parameter , it holds that
Now, by our choice of (8), then
where the second step follows from (6). Furthermore, by Theorem 1, contains the target cluster , so the errors in are solely attributed to the presence of false positives. Denoting by FP the set of false positives in , we can write
Since by Lemma 4, it follows that
with probability at least . This proves the theorem.
a.6 Proof of Theorem 3
To prove Theorem 3, we first show that with nonvanishing probability, the target cluster has a good seed node that is connected only within .
Then we select this node as a seed node in the -regularized PageRank.
Notice that the above condition is same as the optimality condition (4) for the full-dimensional problem, for nodes in . Also, by definition, satisfies (4) for nodes in . Hence, if satisfies (17), by uniqueness of the solution it follows that . Since by construction, this proves that there is no false positive in .
Write into the sum of two terms,
The first term can be ignored, since by Lemma 11, we choose the seed node to be the one that is solely connected to . For the second term, we use Hölder’s inequality to bound
where is a fixed positive constant. Since by assumption, we can further bound it as
We need the above bound to be strictly less than , or
Plugging in expression (10) to the above inequality, and using and , we obtain
for some constant . This proves the theorem.
a.7 Proof of Theorem 4
First, by rosset2007piecewise, we know that the optimal solution path for the -regularized problem (2) is piecewise linear as a function of , and in particular this implies that the path is continuous.
Next, we prove that if the support set of remains constant in the interval (), then is strictly decreasing on , i.e.,