How Well Do Local Algorithms Solve Semidefinite Programs?
Several probabilistic models from high-dimensional statistics and machine learning reveal an intriguing –and yet poorly understood– dichotomy. Either simple local algorithms succeed in estimating the object of interest, or even sophisticated semi-definite programming (SDP) relaxations fail. In order to explore this phenomenon, we study a classical SDP relaxation of the minimum graph bisection problem, when applied to Erdős-Rényi random graphs with bounded average degree , and obtain several types of results. First, we use a dual witness construction (using the so-called non-backtracking matrix of the graph) to upper bound the SDP value. Second, we prove that a simple local algorithm approximately solves the SDP to within a factor of the upper bound. In particular, the local algorithm is at most suboptimal, and suboptimal for large degree.
We then analyze a more sophisticated local algorithm, which aggregates information according to the harmonic measure on the limiting Galton-Watson (GW) tree. The resulting lower bound is expressed in terms of the conductance of the GW tree and matches surprisingly well the empirically determined SDP values on large-scale Erdős-Rényi graphs.
We finally consider the planted partition model. In this case, purely local algorithms are known to fail, but they do succeed if a small amount of side information is available. Our results imply quantitative bounds on the threshold for partial recovery using SDP in this model.
Semi-definite programming (SDP) relaxations are among the most powerful tools available to the algorithm designer. However, while efficient specialized solvers exist for several important applications [BM03, WS08, NN13], generic SDP algorithms are not well suited for large-scale problems. At the other end of the spectrum, local algorithms attempt to solve graph-structured problems by taking, at each vertex of the graph, a decision that is only based on a bounded-radius neighborhood of that vertex [Suo13]. As such, they can be implemented in linear time, or constant time on a distributed platform. On the flip side, their power is obviously limited.
Given these fundamental differences, it is surprising that these two classes of algorithms behave similarly on a number of probabilistic models arising from statistics and machine learning. Let us briefly review two well-studied examples of this phenomenon.
In the (generalized) hidden clique problem, a random graph over vertices is generated as follows: A subset of vertices is chosen uniformly at random among all sets of that size. Conditional on , any two vertices , are connected by an edge independently with probability if and probability otherwise. Given a single realization of this random graph , we are requested to find the set . (The original formulation [Jer92] of the problem uses , but it is useful to consider the case of general , .)
SDP relaxations for the hidden clique problem were studied in a number of papers, beginning with the seminal work of Feige and Krauthgamer [FK00, AV11, MPW15, DM15b, BHK16]. Remarkably, even the most powerful among these relaxations –which are constructed through the sum-of-squares (SOS) hierarchy– fail unless [BHK16], while exhaustive search succeeds with high probability as soon as for a constant. Local algorithms can be formally defined only for a sparse version of this model, whereby and hence each node has bounded average degree [Mon15]. In this regime, there exists an optimal local algorithm for this problem that is related to the ‘belief propagation’ heuristic in graphical models. The very same algorithm can be applied to dense graphs (i.e. ), and was proven to succeed if and only if [DM15a]. Summarizing, the full power of the SOS hierarchy, despite having a much larger computational burden, does not qualitatively improve upon the performance of simple local heuristics.
As a second example, we consider the two-groups symmetric stochastic block model (also known as the planted partition problem) that has attracted considerable attention in recent years as a toy model for community detection in networks [DKMZ11, KMM13, MNS13, Mas14, BLM15, GV15]. A random graph over vertices is generated by partitioning the vertex set into two subsets333To avoid notational nuisances, we assume even. of size uniformly at random. Conditional on this partition, any two vertices , are connected by an edge independently with probability if or (the two vertices are on the same side of the partition), and with probability otherwise (the two vertices are on different sides). Given a single realization of the random graph , we are requested to identify the partition.
While several ‘success’ criteria have been studied for this model, for the sake of simplicity we will focus on weak recovery (also referred to as ‘detection’ or ‘partial recovery’). Namely, we want to attribute labels to the vertices so that –with high probability– at least vertices are labeled correctly (up to a global sign flip that cannot be identified). It was conjectured in [DKMZ11] that this is possible if and only if , where is an effective ‘signal-to-noise ratio’ parameter. This conjecture followed from the heuristic analysis of a local algorithm based –once again– on belief-propagation. The conjecture was subsequently proven in [MNS12, MNS13, Mas14] through the analysis of carefully constructed spectral algorithms. While these algorithms are, strictly speaking, not local, they are related to the linearization of belief propagation around a ‘non-informative fixed point’.
Convex optimization approaches for this problem are based on the classical SDP relaxation of the minimum-bisection problem. Denoting by the adjacency matrix of , the minimum bisection problem is written as
The following SDP relaxes the above problem, where is the average degree:
(Here, the term can be thought of as a relaxation of the hard constraint .) This SDP relaxation has a weak recovery threshold that appears to be very close to the ideal one . Namely, Guédon and Vershynin [GV15] proved for a universal constant, while [MS16] established for large average degree .
Summarizing, also for the planted partition problem local algorithms (belief propagation) and SDP relaxations behave in strikingly similar ways444An important remark is that strictly local algorithms are ineffective in the planted partition problem. This peculiarity is related to the symmetry of the model, and can be resolved in several ways, for instance by an oracle that reveals an arbitrarily small fraction of the true vertex labels, or running belief propagation a logarithmic (rather than constant) number of iterations. We refer to Section 2.3 for further discussion of this point.. In addition to the above rigorous results, numerical evidence suggests that the two thresholds are very close for all degrees , and that the reconstruction accuracy above these thresholds is also very similar [JMRT16].
The conjectural picture emerging from these and similar examples can be described as follows. For statistical inference problems on sparse random graphs, SDP relaxations are no more powerful than local algorithms (eventually supplemented with a small amount of side information to break symmetries). On the other hand, any information that is genuinely non-local is not exploited even by sophisticated SDP hierarchies. Of course, formalizing this picture is of utmost practical interest, since it would entail a dramatic simplification of algorithmic options.
With this general picture in mind, it is natural to ask: Can semidefinite programs be (approximately) solved by local algorithms for a large class of random graph models? A positive answer to this question would clarify the equivalence between local algorithms and SDP relaxations.
Here, we address this problem by considering the semidefinite program (3), for two simple graph models, the Erdős-Rényi random graph with average degree , , and the two-groups symmetric block model, . We establish the following results (denoting by the value of (3)).
- Approximation ratio of local algorithms.
We prove that there exists a simple local algorithm that approximates (when ) within a factor , asymptotically for large . In particular, the local algorithm is at most a factor suboptimal, and suboptimal for large degree.
Note that concentrates tightly around its expected value. When we write that an algorithm approximates , we mean that it returns a feasible solution whose value satisfies the claimed approximation guarantee.
- Typical SDP value.
Our proof provides upper and lower bounds on for , implying in particular where the term has explicit upper and lower bounds. While the lower bound is based on the analysis of a local algorithm, the upper bound follows from a dual witness construction which is of independent interest.
Our upper and lower bounds are plotted in Fig. 1 together with the results of numerical simulations.
- A local algorithm based on harmonic measures.
The simple local algorithm above uses randomness available at each vertex of and aggregates it uniformly within a neighborhood of each vertex. We analyze a different local algorithm that aggregates information in proportion to the harmonic measure of each vertex. We characterize the value achieved by this algorithm in the large limit in terms of the conductance of a random Galton-Watson tree. Numerical data (obtained by evaluating this value and also solving the SDP (3) on large random graphs), as well as a large- asymptotic expansion, suggest that this lower bound is very accurate, cf. Fig. 1.
- SDP detection threshold for the stochastic block model.
We then turn to the weak recovery problem in the two-group symmetric stochastic block model . As above, it is more convenient to parametrize this model by the average degree and the signal-to-noise ratio . It was known from [GV15] that the threshold for SDP to achieve weak recovery is , and in [MS16] that for large degree. Our results provide more precise information, implying in particular for a universal constant.
2 Main results
In this section we recall the notion of local algorithms, as specialized to solving the problem (3). We then state formally our main results. For general background on local algorithms, we refer to [HLS14, GS14, Lyo14]: this line of work is briefly discussed in Section 3.
Note that the application of local algorithms to solve SDPs is not entirely obvious, since local algorithms are normally defined to return a quantity for each vertex in , instead of a matrix whose rows and columns are indexed by those vertices. Our point of view will be that a local algorithm can solve the SDP (3) by returning, for each vertex , a random variable , and the SDP solution associated to this local algorithm is the covariance matrix of with respect to the randomness of the algorithm execution. An arbitrarily good approximation of this solution can be obtained by repeatedly sampling the vector (i.e. by repeatedly running the algorithm with independent randomness).
Formally, let be the space of (finite or) locally finite rooted graphs, i.e. of pairs where is a locally finite graph and is a distinguished root vertex. We denote by the space of tuples where and associates a real-valued mark to each vertex of . Given a graph and a vertex , we denote by the subgraph induced by vertices whose graph distance from is at most , rooted at . If carries marks , it is understood that inherits the ‘same’ marks. We will write in this case instead of the cumbersome (but more explicit) notation .
A radius- local algorithm for the semidefinite program (3) is any measurable function such that
if , where denotes graph isomorphism that preserves the root vertex and vertex marks.
Letting be i.i.d. with , we have . (Here and below denotes expectation with respect to the random variables ).
We denote the set of such functions by .
A local algorithm is a radius- local algorithm for some fixed (independent of the graph). The set of such functions is denoted by .
We apply a local algorithm to a fixed graph by generating i.i.d. marks as , and producing the random variable for each vertex . In other words, we use the radius- local algorithm to compute, for each vertex of , a function of the ball of radius around that vertex that depends on the additional randomness provided by the ’s in this ball. The covariance matrix is a feasible point for the SDP (3), achieving the value where
We are now in position to state our main results.
2.1 Erdős-Rényi random graphs
We first prove an optimality guarantee, in the large limit, for the value achieved by a simple local algorithm (or more precisely, a sequence of simple local algorithms) solving (3) on the Erdős-Rényi graph.
Fix and let be the adjacency matrix of the Erdős-Rényi random graph . Then for , almost surely, . For , almost surely,
Furthermore, there exists a sequence of local algorithms that achieves the lower bound. Namely, for each , there exist and such that, almost surely,
As anticipated in the introduction, the upper and lower bounds of (5) approach each other for large , implying in particular . This should be compared with the result of [MS16] yielding . Also, by simple calculus, the upper and lower bounds stay within a ratio bounded by for all , with the worst-case ratio being achieved at . Finally, they again converge as , implying in particular for .
The result for is elementary and only stated for completeness. Indeed, for , the graph decomposes with high probability into disconnected components of size , which are all trees or unicyclic [JLR11]. As a consequence, the vertex set can be partitioned into two subsets of size so that at most one connected component of has vertices on both sides of the partition, and hence at most two edges cross the partition. By using the feasible point with the indicator vector of the partition, we get whence the claim follows immediately.
In the proof of Theorem 2.2, we will assume . Note that the case follows as well, since is a Lipschitz continuous function of , with Lipschitz constant equal to one. This implies that , are continuous functions of .
The local algorithm achieving the lower bound of Theorem 2.2 is extremely simple. At each vertex , it outputs a weighted sum of the random variables with , with weight proportional to (here is the graph distance between vertices and ). When applied to random -regular graphs, this approach is related to the Gaussian wave function of [CGHV15] and is known to achieve the SDP value in the large limit [MS16].
2.2 A local algorithm based on harmonic measures
A natural question arising from the previous section is whether a better local algorithm can be constructed by summing the random variables with different weights, to account for the graph geometry. It turns out that indeed this is possible by using a certain harmonic weighting scheme that we next describe, deferring some technical details to Section 6. Throughout we assume .
Recall that the random graph converges locally to a Galton-Watson tree (see Section 6 for background on local weak convergence). This can be shown to imply that it is sufficient to define the function for trees. Let be an infinite rooted tree and consider the simple random walk on started at , which we assume to be transient. The harmonic measure assigns to vertex , with , a weight which is the probability555For each distance , the weights form a probability distribution over vertices at distance from the root. These distributions can be derived from a unique probability measure over the boundary of at infinity, as is done in [LPP95], but this is not necessary here. that the walk exits for the last time at [LPP95]. We then define
Technically speaking, this is not a local function because the weights depend on the whole tree . However a local approximation to these weights can be constructed by truncating at a depth : details are provided in Section 6.
Given the well-understood relationship between random walks and electrical networks, it is not surprising that the value achieved by this local algorithm can be expressed in terms of conductance. The conductance of a rooted tree is the intensity of current flowing out of the root when a unit potential difference is imposed between the root and the boundary (‘at infinity’). It is understood that if is finite.
For a Galton-Watson tree with offspring distribution , let be two independent and identically distributed copies of the conductance of . Let be the adjacency matrix of the Erdős-Rényi random graph . Then for , almost surely,
Furthermore, for each , there exist and such that, almost surely,
Finally, for large , this lower bound behaves as
The lower bound is not explicit but can be efficiently evaluated numerically, by sampling the distributional recursion satisfied by . This numerical technique was used in [JMRT16], to which we refer for further details. The result of such a numerical evaluation is plotted as the lower solid line in Figure 1. This harmonic lower bound seems to capture extremely well our numerical data for (red circles).
2.3 Stochastic block model
As discussed in the previous sections, local algorithms can approximately solve the SDP (3) for the adjacency matrix of . The stochastic block model provides a simple example in which they are bound to fail, although they can succeed with a small amount of additional side information.
As stated in the introduction, a random graph over vertices is generated as follows. Let be distributed uniformly at random, conditional on . Conditional on , any two vertices , are connected by an edge independently with probability if and with probability otherwise. We will assume : in the social sciences parlance, the graph is assortative.
The average vertex degree of such a graph is . We assume to ensure that has a giant component with high probability. The signal-to-noise ratio parameter plays a special role in the model’s behavior. If , then the total variation distance between and the Erdős-Rényi graph is bounded away from . On the other hand, if , then we can test whether or with probability of error converging to as [MNS12].
The next theorem lower-bounds the SDP value for the stochastic block model.
Let be the adjacency matrix of the random graph . If and , then for a universal constant (independent of and ), almost surely,
(The first bound in (12) dominates for large , whereas the second dominates near the information-theoretic threshold for large .)
On one hand, this theorem implies that local algorithms fail to approximately solve the SDP (3) for the stochastic block model, for the following reason: The local structures of and are the same asymptotically, in the sense that they both converge locally to the Galton-Watson tree with offspring distribution. This and the upper bound of Theorem 2.2 immediately imply that for any ,
In particular, the gap between this upper bound and the lower bound (12) for the SDP value is unbounded for large .
This problem is related to the symmetry between and labels in this model. It can be resolved if we allow the local algorithm to depend on where grows logarithmically in , or alternatively if we provide a small amount of side information about the hidden partition. Here we explore the latter scenario (see also [MX16] for related work).
Suppose that for each vertex , the label is revealed independently with probability for some fixed , and that the radius- local algorithm has access to the revealed labels in . More formally, let be the set of possible vertex labels, where codes for ‘unrevealed’, let be any assignment of labels to vertices, and let be the space of tuples (where as before).
A radius- local algorithm using partially revealed labels for the semidefinite program (3) is any measurable function such that
if , where denotes isomorphism that preserves the root vertex, vertex marks, and vertex labels in .
Letting be i.i.d. with , we have , where denotes expectation only over .
We denote the set of such functions by , and we denote . For any , we denote
so that yields a solution to the SDP (3) achieving value . Then we have the following result:
Let be the adjacency matrix of the random graph . For any fixed , let be random and such that, independently for each , with probability we have , and with probability we have that identifies the component of the hidden partition containing . If and , then for any , there exist and for which, almost surely,
2.4 Testing in the stochastic block model
Semidefinite programming can be used as follows to test whether or :
Given the graph , compute the value of the program (3).
If , reject the null hypothesis .
(Here is a small constant independent of .) The rationale for this procedure is provided by Theorem 2.2, implying that, if , then the probability of false discovery (i.e. rejecting the null when ) converges to as .
We have the following immediate consequence of Theorem 2.2 and Theorem 2.5 (here error probability refers to the probability of false discovery plus the probability of miss-detection, i.e. not rejecting the null when ):
The SDP-based test has error probability converging to 0 provided , where
3 Further related work
The SDP relaxation (3) has attracted a significant amount of work since Goemans-Williamson’s seminal work on the MAXCUT problem [GW95]. In the last few years, several authors used this approach for clustering or community detection and derived optimality or near-optimality guarantees. An incomplete list includes [BCSZ14, ABH16, HWX16, HWX15, ABC15]. Under the assumption that is generated according to the stochastic block model (whose two-groups version was introduced in Section 2.3), these papers provide conditions under which the SDP approach recovers exactly the vertex labels. This can be regarded as a ‘high signal-to-noise ratio’ regime, in which (with high probability) the SDP solution has rank one and is deterministic (i.e. independent of the graph realization). In contrast, we focus on the ‘pure noise’ scenario in which is an Erdős-Rényi random graph, or on the two-groups stochastic block-model close to the detection threshold. In this regime, the SDP optimum has rank larger than one and is non-deterministic. The only papers that have addressed this regime using SDP are [GV15, MS16], discussed previously.
Several papers applied sophisticated spectral methods to the stochastic block model near the detection threshold [Mas14, MNS13, BLM15]. Our upper bound in Theorem 2.2 is based on a duality argument, where we establish feasibility of a certain dual witness construction using an argument similar to [BLM15].
Several papers studied the use of local algorithms to solve combinatorial optimization problems on graphs. Hatami, Lovász and Szegedy [HLS14] investigated several notions of graph convergence, and put forward a conjecture implying –in particular– that local algorithms are able to find (nearly) optimal solutions of a broad class of combinatorial problems on random -regular graphs. This conjecture was disproved by Gamarnik and Sudan [GS14] by considering maximum size independent sets on random -regular graphs. In particular, they proved that the size of an independent set produced by a local algorithm is at most times the maximum independent set, for large enough . Rahman and Virag [RV14] improved this result by establishing that no local algorithm can produce independent sets of size larger than times the maximum independent set, for large enough . This approximation ratio is essentially optimal, since known local algorithms can achieve of the maximum independent set. It is unknown whether a similar gap is present for small degree . In particular, Csóka et al. [CGHV15] establish a lower-bound on the max-size independent set on random -regular graphs. A similar technique is used by Lyons [Lyo14] to lower bound the max-cut on random -regular graphs. In summary, the question of which graph-structured optimization problems can be approximated by local algorithms is broadly open.
By construction, local algorithms can be applied to infinite random graphs, and have a well defined value provided the graph distribution is unimodular (see below). Asymptotic results for graph sequences can be ‘read-off’ these infinite-graph settings (our proofs will use this device multiple times). In this context, the (random) solutions generated by local algorithms, together with their limits in the weak topology, are referred to as ‘factors of i.i.d. processes’ [Lyo14].
We use upper case boldface for matrices (e.g. , , …), lower case boldface for vectors (e.g. , , etc.) and lower case plain for scalars (e.g. ). The scalar product of vectors is denoted by , and the scalar product between matrices is indicated in the same way .
Given a matrix , is the vector that contains its diagonal entries. Conversely, given , is the diagonal matrix with entries .
We denote by the all-ones vector and by the identity matrix.
We follow the standard big-Oh notation.
5 Upper bound: Theorem 2.2
In this section, we prove the upper bound in Theorem 2.2. Denote . Introducing dual variables and and invoking strong duality, we have
The minimum over occurs at , hence is equivalently given by the value of the dual minimization problem over :
We prove the upper bound in Theorem 2.2 by constructing a dual-feasible solution , parametrized by a small positive constant . Denote the diagonal degree matrix of as and set
The following is the main result of this section, which ensures that the first case in the definition of in (20) holds with high probability.
For fixed , let be the adjacency matrix of the Erdős-Rényi random graph , and let be the diagonal degree matrix. Then for any and for , with probability approaching 1 as ,
Let us first show that this implies the desired upper bound:
Proof of Theorem 2.2 (upper bound).
As , this implies and . Then
By Theorem 5.1, as . Taking and then ,
To obtain the bound almost surely rather than in expectation, note that if and are two fixed graphs that differ in one edge, with adjacency matrices and , then
for any feasible point of (3), so . Let be any ordering of the edges , and denote by the filtration where is generated by . Then by coupling, this implies for each
Hence for each , and
Bernstein’s inequality yields for a constant . Then, applying the martingale tail bound of [dlP99, Theorem 1.2A], for any ,
for a constant . Then the Borel-Cantelli lemma implies almost surely for all large , and the result follows by taking . ∎
In the remainder of this section, we prove Theorem 5.1. Heuristically, we might expect that Theorem 5.1 is true by the following reasoning: The matrix is the deformed Laplacian, or Bethe Hessian, of the graph. By a relation in graph theory known as the Ihara-Bass formula [Bas92, KS00], the values of for which this matrix is singular are the inverses of the non-trivial eigenvalues of a certain “non-backtracking matrix” [KMM13, SKZ14]. Theorem 3 of [BLM15] shows that this non-backtracking matrix has, with high probability, the bulk of its spectrum supported on the complex disk of radius approximately , with a single outlier eigenvalue at . From this, the observation that , and a continuity argument in , one deduces that has, with high probability for large , only a single negative eigenvalue when . If the eigenvector corresponding to this eigenvalue has positive alignment with , then adding a certain multiple of the rank-one matrix should eliminate this negative eigenvalue.
Direct analysis of the rank-one perturbation of is hindered by the fact that the spectrum and eigenvectors of are difficult to characterize. Instead, we will study a certain perturbation of the non-backtracking matrix. We prove Theorem 5.1 via the following two steps: First, we prove a generalization of the Ihara-Bass relation to edge-weighted graphs. For any graph , let be a set of possibly negative edge weights. For each such that , define the symmetric matrix and diagonal matrix by
Let denote the set of directed edges , and define the weighted non-backtracking matrix , with rows and columns indexed by , as
The following result relates with a generalized deformed Laplacian defined by and :
Lemma 5.2 (Generalized Ihara-Bass formula).
For any graph , edge weights , with , and the matrices , , and as defined above,
Second, we consider a weighted non-backtracking matrix of the above form for the complete graph with vertices, with rows and columns indexed by all ordered pairs of distinct indices , and defined as
We prove in Section 5.2 that no longer has an outlier eigenvalue at , but instead has all of its eigenvalues contained within the complex disk of radius approximately :
Fix , let be the adjacency matrix of the Erdős-Rényi random graph , and define by (22). Let denote the spectral radius of . Then for any , with probability approaching 1 as ,
Using these results, we prove Theorem 5.1:
Proof of Theorem 5.1.
Denote by the diagonal degree matrix of . Let be the event on which and . Each diagonal entry of is distributed as , hence for a constant by Bernstein’s inequality and a union bound. This and Lemma 5.3 imply holds with probability approaching 1.
On , for all . Applying Lemma 5.2 for the complete graph with edge weights , and noting for any and any when is sufficiently large, we have for