Correlation Clustering in Data Streams
Clustering is a fundamental tool for analyzing large data sets. A rich body of work has been devoted to designing data-stream algorithms for the relevant optimization problems such as -center, -median, and -means. Such algorithms need to be both time and and space efficient. In this paper, we address the problem of correlation clustering in the dynamic data stream model. The stream consists of updates to the edge weights of a graph on nodes and the goal is to find a node-partition such that the end-points of negative-weight edges are typically in different clusters whereas the end-points of positive-weight edges are typically in the same cluster. We present polynomial-time, -space approximation algorithms for natural problems that arise.
We first develop data structures based on linear sketches that allow the “quality” of a given node-partition to be measured. We then combine these data structures with convex programming and sampling techniques to solve the relevant approximation problem. Unfortunately, the standard LP and SDP formulations are not obviously solvable in -space. Our work presents space-efficient algorithms for the convex programming required, as well as approaches to reduce the adaptivity of the sampling.
The correlation clustering problem was first formulated as an optimization problem by Bansal, Blum, and Chawla (2004). The input is a complete weighted graph on nodes, where each pair of nodes has weight . A positive-weight edge indicates that and should be in the same cluster, whereas a negative-weight edge indicates that and should be in different clusters. Given a node-partition , we say edge agrees with , denoted by , if the relevant soft constraint is observed. The goal is to find the partition that maximizes
or, equivalently, that minimizes . Solving this problem exactly is known to be NP-hard. A large body of work has been devoted to approximating and , along with variants and , where we consider partitions with at most clusters. In this paper, we focus on multiplicative approximation results. If all weights are , there is a polynomial time approximation scheme (PTAS) for (Bansal et al., 2004; Giotis and Guruswami, 2006) and a -approximation (Chawla et al., 2015), for . When there is an upper bound, , on the number of clusters in , and all weights are , Giotis and Guruswami (2006) introduced a PTAS for both problems. Even is interesting, with an efficient local-search approximation introduced by Coleman, Saunderson, and Wirth (2008).
If the weights are arbitrary, there is a -approximation for (Swamy, 2004; Charikar, Guruswami, and Wirth, 2005) and an -approximation for (Charikar et al., 2005; Demaine et al., 2006). These methods use convex programming: as originally described, this cannot be implemented in space, even when the input graph is sparse. This aspect is well known in practice, and Elsner and Schudy (2009); Bagon and Galun (2011); Bonchi, Garcia-Soriano, and Liberty (2014) discuss the difficulty of scaling the convex programming approach.
Clustering and Graph Analysis in Data Streams.
Given the importance of clustering as a basic tool for analyzing massive data sets, it is unsurprising that a considerable effort has gone into designing clustering algorithms in the relevant computational models. In particular, in the data-stream model we are permitted a limited number of passes (ideally just one) over the data while using only limited memory. This model abstracts the challenges in traditional applications of stream processing such as network monitoring, and also leads to I/O-efficient external-memory algorithms. Naturally, in either context, an algorithm should also be fast, both in terms of the time to process each stream element and in returning the final answer.
Classical clustering problems including -median (Guha et al., 2000; Charikar et al., 2003), -means (Ailon, Jaiswal, and Monteleoni, 2009), and -center (Guha, 2009; McCutchen and Khuller, 2008; Charikar et al., 2004) have all been studied in the data stream model, as surveyed by Silva et al. (2013). Non-adaptive sampling algorithms for correlation clustering can be implemented in the data stream model, as applied by Ailon and Karnin (2012), to construct additive approximations. *ChierichettiDK14 presented the first multiplicative approximation data stream algorithm: a polynomial-time -approximation for on -weighted graphs using passes and semi-streaming space — that is, a streaming algorithm using memory (Feigenbaum et al., 2005). Pan et al. (2015) and Bonchi et al. (2014) discuss faster non-streaming implementations of related ideas but *ChierichettiDK14 remained the state of the art data stream algorithm until our work. Using space roughly proportional to the number of nodes can be shown to be necessary for solving many natural graph problems including, it will turn out, correlation clustering. For a recent survey of the semi-streaming algorithms and graph sketching see McGregor (2014).
In the basic graph stream model, the input is a sequence of edges and their weights. The available space to process the stream and perform any necessary post-processing is bits. Our results also extend to the dynamic graph stream model where the stream consists of both insertions and deletions of edges; the weight of an edge is specified when the edge is inserted and deleted (if it is subsequently deleted). For simplicity, we assume that all weights are integral. We will consider three types of weighted graphs: (a) unit weights, where all ; (b) bounded weights, where all weights are in the range for some constant ; and (c) arbitrary weights, where all weights are in the range where . We denote the sets of positive-weight and negative-weight edges by and , respectively, and define and .
We note that many of our algorithms, such as those based on sparsification (Ahn and Guha, 2018), can also be implemented in MapReduce.
1.1 Our Results
We summarize our results in Table 1.
For , we provide the following single-pass streaming algorithms, each needing space: (i) a polynomial-time -approximation for bounded weights (Theorem 7), and (ii) a approximation for arbitrary weights in time (Theorem 23).
We show that any constant pass algorithm that can test whether in a single pass, for unit weights, must store bits (Theorems 31). For arbitrary weights, the lower bound increases to (Theorem 30) and to in the case the graph of negative edges may be dense. We provide a single-pass algorithm that uses space and time and provides an approximation (Theorem 19). Since Demaine et al. (2006) and Charikar et al. (2005) provide approximation-preserving reductions from the “minimum multicut” problem to with arbitrary weights, it is expected to be difficult to approximate the latter to better than a factor in polynomial time. For unit weights when , we provide a single-pass polynomial time algorithm that uses space (Theorem 4). We provide a -space PTAS for for bounded weights (Theorem 10).
We also consider multiple-pass streaming algorithms. For unit weights, we present a -pass algorithm that mimics the algorithm of Ailon et al. (2008), and provides a -approximation in expectation (Theorem 28), improving on the result of Chierichetti et al. (2014). For , on unit-weight graphs with , we give a -pass polynomial-time algorithm using space (Theorem 29). This result is based on emulating an algorithm by Giotis and Guruswami (2006) in the data stream model.
1.2 Techniques and Roadmap
In Section 2, we present three basic data structures for the and query problems where a partition is specified at the end of the stream, and the goal is to return an approximation of or . They are based on linear sketches and incorporate ideas from work on constructing graph sparsifiers via linear sketches. These data structures can be constructed in the semi-streaming model and can be queried in time. As algorithms rely on relatively simple matrix-vector operations, they can be implemented fairly easily in MapReduce.
In Section 3 and 4, we introduce several new ideas for solving the LP and SDP for and . In each case, the convex formulation must allow each candidate solution to be represented, verified, and updated in small space. But the key point made here is that the formulation plays an outsized role in terms of space efficiency, both from the perspective of the state required to compute and the operational perspective of efficiently updating that state. In future, we expect the space efficiency of solving convex optimization to be increasingly important.
We discuss multipass for algorithms for in Section 5. Our results are based on adapting existing algorithms that, if implemented in the data stream model, may appear to take passes. However, with a more careful analysis we show that passes are sufficient. Finally, we present space lower bounds in Section 6. These are proved using reductions from communication complexity and establish that many of our algorithms are space-optimal.
2 Basic Data Structures and Applications
We introduce three basic data structures that can be constructed with a single-pass over the input stream that defines the weighted graph . Given a query partition , these data structures return estimates of or . Solving the correlation clustering optimization problem with these structures directly would require exponential time or space. Instead, we will exploit them carefully to design more efficient solutions. However, in this section, we will present a short application of each data structure that illustrates their utility.
2.1 First Data Structure: Bilinear Sketch
Consider a graph with unit weights () and a clustering . Our first data structure allows us to solve the query problem, which is, given and , to report (an approximation of) . Define the matrices and where and
Note that if , then
whereas, if then
Hence, the (squared) matrix distance, induced by the Frobenius norm, gives exactly
To efficiently estimate when is not known a priori, we can repurpose the bilinear sketch approach of Indyk and McGregor (2008). The basic sketch is as follows:
Let and be independent random vectors whose entries are 4-wise independent; in a single pass over the input, compute
Specifically, we maintain a counter that is initialized to 0 and for each in the stream we add to the counter and if is deleted we subtract from the counter; the final value of the counter equals . Note that and can be determined by a hash function that can be stored in space such that each entry can be constructed in time.
Given query partition , return .
For each , and .
The following theorem will be proved by considering an algorithm that computes multiple independent copies of the above sketch and combines the estimates from each.
For unit weights, there exists an -space algorithm for the query problem. Each positive edge is processed in time, while the query time is .
We first observe that, given , the time to compute is . This follows because for a cluster , on nodes, we can compute and in time. Hence the total query timeis as claimed.
We next argue that repeating the above scheme a small number of times in parallel yields a good estimate of . To do this, note that
We then apply Lemma 1 to and deduce that
Hence, running parallel repetitions of the scheme and averaging the results appropriately yields a -approximation for with probability at least . Specifically, we partition the estimates into groups, each of size . We can ensure that with probability at least , the mean of each group is within a factor by an application of the Chebyshev bound; we then argue using the Chernoff bound that the median of the resulting group estimates is a approximation with probability at least . ∎
We note that by setting in the above theorem, it follows that we may estimate for all partitions using space. Hence, given exponential time, we can also -approximate While this is near-optimal in terms of space, in this paper we focus on polynomial-time algorithms.
Application to Cluster Repair.
Consider the Cluster Repair problem (Gramm et al., 2005), in which, for some constant , we are promised and want to find the clustering .
We first argue that, given spanning forest of we can limit our attention to checking a polynomial number of possible clusterings. The spanning forest can be constructed using a -space algorithm in the dynamic graph stream model (Ahn, Guha, and McGregor, 2012a). Let be the clustering corresponding to the connected components of . Let be the forests that can be generated by adding and then removing edges from where . Let be the node-partition corresponding to the connected components of .
The optimal partition of is for some . Furthermore, .
Let be the set of edges in the optimal clustering that are between nodes in the same cluster and suppose that let , i.e., is the set of positive edges that need to be added and is the set of edges that need to be deleted to transform into a collection of node-disjoint clusters. Since , we know . It is possible to transform into a spanning forest of by adding at most edges. It is then possible to generate a spanning forest of with the same connected components as by deleting at most edges from . Hence, one of the forests considered has the same connected components at .
To bound , we proceed as follows. There are less than different forests that can result from adding at most edges to . For each, there are at most forests that can be generated by deleting at most edges from the, at most , edges in . Hence, . ∎
The procedure is then to take advantage of this bounded number of partitions by computing each in turn, and estimating . We report the that minimizes the (estimated) repair cost. Consequently, setting in Theorem 2 yields the following theorem.
For a unit-weight graph with where , there exists a polynomial-time data-stream algorithm using space that with high probability approximates .
2.2 Second Data Structure: Sparsification
The next data structure is based on graph sparsification and works for arbitrarily weighted graphs. A sparsification of graph is a weighted graph such that the weight of every cut in is within a factor of the weight of the corresponding cut in . A celebrated result of Benczúr and Karger (1996) shows that it is always possible to ensure the the number of edges in is . A subsequent result shows that this can be constructed in the dynamic graph stream model.
There is a single-pass algorithm that returns a sparsification using space and time .
The next lemma establishes that a graph sparsifier can be used to approximate and of a clustering.
Let and be sparsifications of and such that all cuts are preserved within factor , and let . For every clustering ,
The proofs for and are symmetric, so we restrict our attention to . Let . The weight of edges in that are cut is within a factor in the sparsifier. Consider an arbitrary cluster , then letting represent the weight in the sparsifier,
where the third line follows because, for each , the weights of cuts and are approximately preserved. Summing over all cluster s, the total additive error is
(assuming ), as required.
The last part of theorem follows because by considering the trivial all-in-one-cluster partition. ∎
Application to with Bounded Weights.
In Section 3, based on the sparsification construction, we develop a -time streaming algorithm that returns a -approximation for when has arbitrary weights. However, in the case of unit weights, a RAM-model PTAS for is known (Giotis and Guruswami, 2006; Bansal et al., 2004). It would be unfortunate if, by approximating the unit-weight graph by a weighted sparsification, we lost the ability to return a approximation in polynomial time.
Let be an arbitrary node-partition, where and .
For each , let be a random sample of nodes in .
For all possible -partition of each of :
For each , let be the partition of
Compute and record the cost of the clustering in which is assigned to the th cluster, where
For all the clusterings generated, return the clustering that maximizes .
Giotis and Guruswami (2006) prove that the above algorithm achieves a approximation factor with high probability if all weights are . We explain in Section A that their analysis actually extends to the case of bounded weights. The more important observation is that we can simulate this algorithm in conjunction with a graph sparsifier. Specifically, the sets and can be determined before the stream is observed. To emulate step 3, we just need to collect the edges incident to each during the stream. If we simultaneously construct a sparsifier during the stream we can evaluate all of the possible clusterings that arise. This leads to the following theorem.
For bounded-weight inputs, there exists a polynomial-time semi-streaming algorithm that, with high probability, -approximates .
2.3 Third Data Structure: Node-Based Sketch
In this section, we develop a data structure that supports queries to for arbitrarily weighted graphs when is restricted to be a 2-partition. For each node , define the vector, , indexed over the edges, where the only non-zero entries are:
For a two-partition , .
The result follows immediately from consideration of the different possible for values for the th coordinate of the vector . The sum can be expanded as
Hence if and only if the edge is a disagreement. ∎
We apply the -sketching result of Kane, Nelson, and Woodruff (2010) to compute a random linear sketch of each .
For arbitrary weights, and for query partitions that contain two clusters, to solve the query problem, there exists an -space algorithm. The query time is .
Unfortunately, for queries where , space is necessary, as shown in Section 6.
Application to with Bounded Weights.
We apply the above node-based sketch in conjunction with another algorithm by Giotis and Guruswami (2006), this time for . Their algorithm is as follows:
Sample nodes and for every possible -partition of :
Consider the clustering where is assigned to the th cluster where
For all the clusterings generated, return the clustering that minimizes .
As with the max-agreement case, Giotis and Guruswami (2006) prove that the above algorithm achieves a approximation factor with high probability if all weights are . We explain in Section A that their analysis actually extends to the case of bounded weights. Again note we can easily emulate this algorithm for in the data stream model in conjunction with the third data structure. The sampling of and its incident edges can be performed using one pass and space. We then find the best of these possible partitions in post-processing using the above node-based sketches.
For bounded-weight inputs, there exists a polynomial-time semi-streaming algorithm that with high probability -approximates .
3 Convex Programming in Small Space:
In this section, we present a linear programming-based algorithm for . At a high level, progress arises from new ideas and modifications needed to implement convex programs in small space. While the time required to solve convex programs has always been an issue, a relatively recent consideration is the restriction to small space (Ahn and Guha, 2013). In this presentation, we pursue the Multiplicative Weight Update technique and its derivatives. This method has a rich history across many different communities (Arora et al., 2012), and has been extended to semi-definite programs (Arora and Kale, 2007). In this section, we focus on linear programs in the context of ; we postpone the discussion of SDPs to Section 4.
In all multiplicative weight approaches, the optimization problem is first reduced to a decision variant, involving a guess, , of the objective value; we show later how to instantiate this guess. The LP system is
where , , and . To solve the MWM-LP approximately, the multiplicative-weight update algorithm proceeds iteratively. In each iteration, given the current solution, , the procedure maintains a set of multipliers (one for each constraint) and computes a new candidate solution which (approximately) satisfies the linear combination of the inequalities, as defined in Theorem 11.
Theorem 11 (Arora et al. (2012)).
Suppose that, and in each iteration , given a vector of non-negative multipliers , a procedure (termed Oracle) provides a candidate satisfying three admissibility conditions,
, for all .
We set to . Assuming we start with , after iterations the average vector, , satisfies , for all .
The computation of the new candidate depends on the specific LP being solved. The parameter is called the width, and controls the speed of convergence. A small-width Oracle is typically a key component of an efficient solution, for example, to minimize running times, number of rounds, and so forth. However, the width parameter is inherently tied to the specific formulation chosen. Consider the standard LP relaxation for , where variable indicates edge being cut.
The triangle constraints state that if we cut one side of a triangle, we must also cut at least one of the other two sides. The size of formulation is in , where is the size of the vertex set, irrespective of the number of nonzero entries in . Although we will rely on the sparsification of , that does not in any way change the size of the above linear program. To achieve space, we need new formulations, and new algorithms to solve them.
The first hurdle is the storage requirement. We cannot store all the edges/variables which can be . This is avoided by using a sparsifier and invoking (the last part of) Lemma 6. Let be the sparsification of with . For edge let denote its weight after sparsification. For each pair , let denote the set of all paths involving edges only in the set . Consider the following LP for , similar to that of Wirth (2004), but in this sparsified setting:
The intuition of an integral (/) solution is that for all edges that are not cut, and for all that are cut. That is, the relevant variable is whenever the edge disagrees with the input advice. By Lemma 6, the objective value of LP0 is at most times the optimum value of . However, LP0 now has exponential size, and it is unclear how we can maintain the multipliers and update them in small space. To overcome this major hurdle, we follow the approach below.
3.1 A Dual Primal Approach
Consider a primal minimization problem, for example, , in the canonical form:
The dual of the above problem for a guess, of the optimum solution (to the Primal) becomes
which is the same as the decision version of MWM-LP as described earlier. We apply Theorem 11 to the Dual LP, however we still want a solution to the Primal LP. Note that despite approximately solving the Dual LP, we do not have a Primal solution. Even if we had some optimal solution to the Dual LP, we might still require a lot of space or time to find a Primal solution, though we could at least rely on complementary slackness conditions. Unfortunately, similar general conditions do not exist for approximately optimum (or feasible) solutions. To circumvent this issue:
We apply the multiplicative-weight framework to the Dual LP and try to find an approximately feasible solution such that and .
The Oracle is modified to provide a , subject to conditions (i)– (iii) of Theorem 11, or an that, for some , satisfies
Intuitively, the Oracle is asked to either make progress towards finding a feasible dual solution or provide an -approximate primal solution in a single step.
If the Oracle returns an then we know that is not satisfiable. We can then consider smaller values of , say . We eventually find a sufficiently small that the Dual LP is (approximately feasible) and we have a satisfying
Note that computations for larger continue to remain valid for smaller .
This idea, of applying the multiplicative-weight update method to a formulation with exponentially many variables (the Dual), and modifying the Oracle to provide a solution to the Primal (that has exponentially many constraints) in a single step, has also benefited solving Maximum Matching in small space (Ahn and Guha, 2018). However in Ahn and Guha (2018), the constraint matrix was unchanging across iterations (objective function value did vary) – here we will have the constraint matrix vary across iterations (along with value of the objective function). Clearly, such a result will not apply for arbitary constraint matrices and the correct choice of a formulation is key.
One key insight is that the dual, in this case (and as a parallel with matching) has exponentially many variables, but fewer constraints. Such a constraint matrix is easier to satisfy approximately in a few iterations because there are many more degrees of freedom. This reduces the adaptive nature of the solution, and therefore we can make a lot of progress in satisfying many of the primal constraints in parallel. Other examples of this same phenomenon are the numerous dynamic connectivity/sparsification results in Guha et al. (2015), where the algorithm repeatedly finds edges in cuts (dual of connectivity) to demonstrate connectivity. In that example, the seemingly adaptive iterations collapse into a single iteration.
Parts of the three steps, that is, (a)–(c) outlined above, have been used to speed up running times of SDP-based approximation algorithms (Arora and Kale, 2007). In such cases, there was no increase to the number of constraints nor consideration of non-standard formulations, It is often thought, and as explicitly discussed by Arora and Kale (2007), that primal-dual approximation algorithms use a different set of techniques from the primal-dual approach of multiplicative-weight update methods. By switching the dual and the primal, in this paper, we align both sets of techniques and use them interchangeably.
The remainder of Section 3 is organized as follows. We first provide a generic Oracle construction algorithm for MWM-LP, in Section 3.2. As a warm up example, we then apply this algorithm on the multicut problem in Section 3.3 – the multicut problem is inherently related to for arbitrary weights (Charikar et al., 2005; Demaine et al., 2006). We then show how to combine all the ideas together to solve in Section 3.4.
3.2 From Rounding Algorithms to Oracles
The following lemma shows how to satisfy the first two conditions of Theorem 11; the width parameter has to be bounded separately for a particular problem.
By construction, , addressing condition (i). So we prove that . Since is a scaled version of ,
The inequality in the second line follows from only being positive if the corresponding Primal LP constraint is violated. Finally, by construction, and ; since we also assumed that , the lemma follows. ∎
3.3 Warmup: Streaming Multicut Problem
The Minimum Multicut problem is defined as follows. Given a weighted undirected graph and pairs of vertices , for , the goal is to remove the lowest weight subset of edges such that every , is disconnected from .
In the streaming context, suppose that the weights of the edges are in the range and the edges are ordered in an arbitrary order defining a dynamic data stream (with both insertions and deletions). We present a -approximation algorithm for the multicut problem that uses space and time excluding the time to construct a sparsifier. The term dominates the time required for sparsifier construction, for more details regarding streaming sparsifiers, see Kapralov et al. (2014); Guha et al. (2015). The algorithm comprises the following, the parameter will eventually be set to .
Sparsify the graph defined by the dynamic data stream, preserving all cuts, and thus the optimum multicut, within factor. Let be the edges in the sparsification and , where , from the results of Ahn et al. (2012b). Let refer to weights after the sparsification.
Given an edge set , let be the set of all – paths in the edge set . The LP that captures Multicut is best viewed as relaxation of a assignment. Variable is an indicator of whether edge is in the multicut. If we interpret as assignment of lengths, then for all , all have length at least . The relaxation is therefore:
Compute an initial upper bound . (Lemma 13)
Following the dual-primal approach above, as decreases (note the initial being high, we cannot hope to even approximately satisfy the dual), we consider the (slightly modified) dual