Faster Sublinear Algorithms using Conditional Sampling

A conditional sampling oracle for a probability distribution returns samples from the conditional distribution of restricted to a specified subset of the domain. A recent line of work [CFGM13, CRS14] has shown that having access to such a conditional sampling oracle requires only polylogarithmic or even constant number of samples to solve distribution testing problems like identity and uniformity. This significantly improves over the standard sampling model where polynomially many samples are necessary.

Inspired by these results, we introduce a computational model based on conditional sampling to develop sublinear algorithms with exponentially faster runtimes compared to standard sublinear algorithms. We focus on geometric optimization problems over points in high dimensional Euclidean space. Access to these points is provided via a conditional sampling oracle that takes as input a succinct representation of a subset of the domain and outputs a uniformly random point in that subset. We study two well studied problems: k-means clustering and estimating the weight of the minimum spanning tree. In contrast to prior algorithms for the classic model, our algorithms have time, space and sample complexity that is polynomial in the dimension and polylogarithmic in the number of points.

Finally, we comment on the applicability of the model and compare with existing ones like streaming, parallel and distributed computational models.

1 Introduction

Consider a scenario where you are given a dataset of input points , from some domain , stored in a random access memory and you want to estimate the number of distinct elements of this (multi-)set. One obvious way to do so is to iterate over all elements and use a hash table to find duplicates. Although simple, this solution becomes unattractive if the input is huge and it is too expensive to even parse it. In such cases, one natural goal is to get a good estimate of this number instead of computing it exactly. One way to do that is to pick some random numbers from and estimate, based on those, the total number of distinct elements in the set. This is equivalent to getting samples from a probability distribution where the probability of each element is proportional to the number of times it appears in . In the context of probability distributions, this is a well understood problem, called support estimation, and tight bounds are known for its sampling complexity. More specifically, in [VV11], it is shown that the number of samples needed is which, although sublinear, still has a huge dependence on the input size .

In several situations, more flexible access to the dataset might be possible, e.g. when data are stored in a database, which can significantly reduce the number of queries needed to perform support estimation or other tasks. One recent model, called conditional sampling, introduced by [CFGM13, CRS14] for distribution testing, describes such a possibility. In that model, there is an underlying distribution , and a conditional sampling oracle takes as input a subset of the domain and produces a sample from conditioned to a set . [CFGM13] and [CRS14] study several problems in distribution testing obtaining surprising results: Using conditional queries it is possible to avoid polynomial lower bounds that exist for the sampling complexity in the standard distribution testing framework and get testers with only polylogarithmic or even constant query requirements. In follow up work, Acharya, Cannone and Kamath [ACK15b] consider the support estimation problem we described above and prove that the support estimation problem can be solved using only conditional samples. This is a doubly exponentially better guarantee compared to the optimal classical algorithm which requires samples [VV11].

Inspired by the power of these results, we introduce a computational model based on conditional sampling where the dataset is provided as a distribution and algorithms have access to a conditional sampling oracle that returns datapoints at random from a specified set. More precisely, an algorithm is given access to an oracle that takes as input a function and returns a tuple with with chosen uniformly at random from the subset . If no such tuple exists the oracle returns . The function is represented as a binary circuit. We assume that queries to the conditional sampling oracle Cond take time linear in the circuit size. Equivalently, we could assume constant time, as we are already paying linear cost in the size of the circuit to construct it.

Most works in conditional sampling, measure the performance of algorithms only by their query complexity. The work of [CRS14] considers the description complexity of the query set by examining restricted conditional queries that either specify pairs of points or intervals. However, in many cases, such simple queries might not be sufficient to obtain efficient algorithms. We use the circuit size of a set’s description as a measure of simplicity to allow for richer queries which is naturally counted towards the runtime of our algorithms.

Except from its theoretical interest, it is practically useful to consider algorithms that perform well in the conditional sampling model. This is because efficient algorithms for the conditional sampling model can directly be implemented in a number of different computational models that arise when we have to deal with huge amount of data. In particular, let’s assume that we have an algorithm that solves a problem using conditional queries where the description of the sets used has size and the additional computational time needed is .

Parallel Computing:

We notice that the computation of one conditional sample can be very easily parallelized because it suffices to assign to each processor a part of the input and send to each of them the description of the circuit. Each processor can compute which of its points satisfy the circuit and pick one at random among them. Then, we can select as output the sample of one processor chosen at random. The probability of choosing one processor in this phase is proportional to the number of points in the input assigned to this processor that belong to the conditioning set. This way we can implement in just a few steps a conditional sampling query. If the input is divided evenly among processors the load on each of them is . Combining the answers can be done in steps and therefore, the running time of in the parallel computation model is which gives a non-trivial parallelization of the problem . Except from the running time, one important issue that can decrease the performance of a parallel algorithm is the communication that is needed among the processors as described in the work of Afrati et. al. [ABS12]. This communication cost can be bounded by the size of the circuit at each round plus the communication for the partition of the input that happens only once in the beginning.

Streaming Algorithms:

The implementation of a conditional query in the streaming model where we want to minimize the number of passes of the input is pretty straightforward. With one pass of the input we can select one point uniformly at random from the points that belong to the conditioning set using standard streaming techniques. The space that we need for each of these passes is just and we need passes of the input.

Distributed Computing:

The implementation of a conditional query in the distributed computational model can follow the same ideas as in the parallel computational model.

The surprising result in all the above cases is that one has to deal with transferring appropriately the conditional sampling model to the wanted computational model and then we can get high performance algorithms once , and are small. In this work we design algorithms that achieve all these quantities to be only polylogarithmic in the size of the input, which leads to very efficient algorithms in all the above models.

1.1 Previous Work on Sublinear Algorithms

We consider two very well studied combinatorial problems: -means clustering and minimum spanning tree. For these problems we know the following about the sublinear algorithms in the classical setting.

1.1.1 -means Clustering

Sublinear algorithms for -median and -means clustering first studied by Indyk [Ind99]. In this work, given a set of points from a metric space, an algorithm is given that computes a set of size that approximates the cost of the optimal clustering within a constant factor and runs in time . Notice that the algorithm is sublinear as the input contains all the pairwise distances between the points which have total size .

In followup work, Mettu and Plaxton [MP04] gave a randomized constant approximation algorithm for the -median problem with running time subject to the constraint , where denotes the ratio between the maximum and the minimum distance between any pair of distinct points in the metric space. Also Meyerson et. al. [MOP04] presented a sublinear algorithm for the -median problem with running time under the assumption that each cluster has size .

In a different line of work Mishra, Oblinger and Pitt [MOP01] and later Czumaj and Sohler [CS07] assume that the diameter of the set of points is bounded and known. The running time of the algorithm by Mishra et. al. [MOP01] is only logarithmic in the input size but is polynomial in . Their algorithm is very simple since it just picks uniformly at random a subset of points and solves the clustering problem on that subset. Following similar ideas, Czumaj and Sohler [CS07] gave a tighter analysis of the same algorithm proving that the running time depends only on the diameter and is independent of . The dependence on is still polynomial in this work. The guarantee in both these works is a constant multiplicative approximation algorithm with an additional additive error term.

1.1.2 Minimum Spanning Tree in Euclidean metric space

There is a large body of work on sublinear algorithms for the minimum spanning tree. In [Ind99], given points in a metric space an algorithm is provided that outputs a spanning tree in time achieving a -approximation to the optimum. When considering only the task of estimating of the weight of the optimal spanning tree, Czumaj and Sohler [CS04] provided an algorithm that gets an -approximation. The running time of this algorithm is .

To achieve better guarantees several assumptions could be made. One first assumption is that we are given a graph that has bounded average degree and the weights of the edges are also bounded by . For this case, the work of Chazelle et. al. [CRT05] provides a sublinear algorithm with running time that returns the weight of the minimum spanning tree with relative error . Although the algorithm completely gets rid of the dependence in the number of points it depends polynomially in the maximum weight . Also in very dense graphs is polynomial in and therefore we also have a polynomial dependence on .

Finally, another assumption that we could make is that the points belong to the -dimensional Euclidean space. For this case, the work of Czumaj et. al. [CEF05] provide an -approximation algorithm that requires time . Note that in this case the size of the input is and not since given the coordinates of the points we can calculate any distance. Therefore, the algorithms described before that get running time are not sublinear anymore. Although Czumaj et. al. [CEF05] manage to achieve a sublinear algorithm in this case they cannot escape from the polynomial dependence on . Additionally, their algorithm has exponential dependence on the dimension of the Euclidean space.

1.2 Our Contribution

The main result of our work is that on the conditional sampling framework we can get exponentially faster sublinear algorithms compared to the sublinear algorithms in the classical framework.

We first provide some basic building blocks – useful primitives for the design of algorithms. These building blocks are:

  1. Compute the size of a set given its description, Section 3.1.

  2. Compute the maximum of the weights of the points of a set given the description of the set and the description of the weights, Section 3.2.

  3. Compute the sum of the weights of the points of a set given the description of the set and the description of the weights, Section 3.3.

  4. Get a weighted conditional sample from the input set of points given the description of the weights, Section 3.4.

  5. Get an -sample given the description of labels to the points Section 3.5.

For all these primitives, we give algorithms that run in time polylogarithmic in the domain size and the value range of the weights. We achieve this by querying the conditional sampling oracle with random subsets produced by appropriately chosen distribution on the domain. Intuitively, this helps to estimate the density of the input points on different parts of the domain. One important issue of conditioning on random sets in that the description complexity of the set can be almost linear on the domain size. To overcome this difficulty we replace the random sets with pseudorandom ones based on Nisan’s pseudorandom generator [Nis90]. The implementation of these primitives is of independent interest and especially the fourth one since it shows that the weighted conditional sample, which is similar to sampling models that have been used in the literature [ACK15a], can be simulated by the conditional sampling model with only a polylogarithmic overhead in the query complexity and the running time.

After describing and analyzing these basic primitives, we use them to design fast sublinear algorithms for the -means clustering and the minimum spanning tree.

1.2.1 -means Clustering

Departing from the works of Mishra, Oblinger and Pitt [MOP01] and Czumaj and Sohler [CS07] where the algorithms start by choosing a uniform random subset, we start by choosing a random subset based on weighted sampling instead of uniform. In the classical computational model we need at least linear time to get one conditional sample and thus it is not possible to use the power of weighted sampling to get sublinear time algorithms for the -means problem. But when we are working in the conditional sampling model, then the adaptive sampling can be implemented in polylogarithmic time and queries. This enables us to use all the known literature about the ways to get efficient algorithms using conditional sampling [AV07]. Quantitatively the advantage from the use of the weighted sampling is that we can get sublinear algorithms with running times where is the diameter of the metric space and the number of points on the input. This is exponentially better from Indyk [Ind99] in terms of and exponentially better from Czumaj and Sohler [CS07] in terms of . This shows the huge advantage that one can get from the ability to use or implement conditional sampling queries. We develop and analyze these ideas in detail in Section 4.

1.2.2 Minimum Spanning Tree in Euclidean metric space

Based on the series of works on sublinear algorithms for minimum spanning trees, we develop algorithms that exploit the power of conditional sampling and achieve polylogarithmic time with respect to the number of input points and only polynomial with respect to the dimension of the Euclidean space. This is very surprising since in the classical model it seems to exist a polynomial barrier that we cannot escape from. Compared to the algorithm by Czumaj et. al. [CEF05], we get running time which is exponential improvement with respect to both the parameters and .

We present our algorithm at Section 5. From a technical point of view, we use a gridding technique similar to [CEF05] but prove that using a random grid can significantly reduce the runtime of the algorithm as we avoid tricky configurations that can happen in worst case.

2 Model and Preliminaries

Notation

For we denote the set by . We use to denote algorithms.

Given a function that takes values over the rationals we use to denote the binary circuit that takes as input the binary representation of the input of and outputs the binary representation of the output . If the input or the output are rational numbers then the representation is the pair .

Suppose we are given an input of length , where every belongs in some set . In this work, we will fix for some to be the discretized -dimensional Euclidean space. Our goal is to compute the value of a symmetric function in input . We assume that all are distinct and define as the set . Since we consider symmetric functions , it is convenient to extend the definition of to sets .

A randomized algorithm that estimates the value is called sublinear if and only if its running time is . We are interested in additive or multiplicative approximation. A sublinear algorithm Alg for computing has -additive approximation if and only if

and has -multiplicative approximation if and only if

We usually refer to -approximation and is clear from the context if we refer to the additive or the multiplicative one.

2.1 Conditional Sampling as Computational Model

The standard sublinear model assumes that the input is stored in a random access memory that has no further structure. Since is symmetric in the input points, the only reasonable operation is to uniformly sample points from the input. Equivalently, the input can be provided by an oracle Sub that returns a tuple where is chosen uniformly at random from the set .

When the input has additional structure (i.e. points stored in a database), more complex queries can be performed. The conditional sampling model allows such queries of small description complexity to be performed. In particular, the algorithm is given access to an oracle that takes as input a function and returns a tuple with with chosen uniformly at random from the subset . If no such tuple exists the oracle returns . The function is represented as a sublinear sized binary circuit. All the results presented in this paper use polylogarithmic circuit sizes.

We assume that queries to the conditional sampling oracle Cond take time linear in the circuit size. Equivalently, we could assume constant time, as we are already paying linear cost in the size of the circuit to construct it.

2.2 -means Clustering

Let be distance metric function , i.e. represents the distance between and . Given a set of centers we define the distance of a point from to be . Now given a set of input points and a set of centers we define the cost of for to be . The -means problem is the problem of minimizing the squared cost over the choice of centers subject to the constraint . We assume that the diameter of the metric space is .

2.3 Minimum spanning tree in Euclidean space

Given a set of points in dimensions, the minimum spanning tree problem in Euclidean space ask to compute the a spanning tree on the points minimizing the sum of weights of the edges. The weight of an edge between two points is equal to their Euclidean distance.

We will focus on a simpler variant of the problem which is to compute the weight of the best possible spanning tree, i.e. estimate the quantity .

3 Basic Primitives

In this section, we describe some primitive operations that can be efficiently implemented in this model. We will use these primitives as black boxes in the algorithms for the combinatorial problems we consider. We make this separation as these primitives are commonly used building blocks and will make the presentation of our algorithms cleaner.

A lot of the algorithmic primitives are based on constructing random subsets of the domain and querying the random oracle Cond with a description of this set. A barrier is that such subsets have description complexity that is linear in the domain size. For this reason, we will use a pseudorandom set whose description is polylogarithmic in the domain size. The main tool to do this is Nisan’s pseudorandom generator [Nis90] which produces pseudorandom numbers that appear as perfectly random to algorithms running in polylogarithmic time.

Theorem 1 (Nisan’s Pseudorandom Generator [Nis90]).

Let and denote uniformly random binary sequences of length and respectively. There exists a map such that for any algorithm , with , where , it holds that

for .

Nisan’s pseudorandom generator is a simple recursive process that starts with random bits and generates a sequence of bits. The sequence is generated in blocks of size and every block can be computed given the seed of size using multiplications on bit numbers. The overall time and space complexity to compute the -th block of bits is and there exists a circuit of size that performs this computation.

Using Nisan’s theorem, we can easily obtain pseudorandom sets for conditional sampling. We are interested in random sets where every element appears with probability for some given function .

Corollary 1 (Pseudorandom Subset Construction).

Let be a random set, described by a circuit , that is generated by independently adding each element with probability , where is described by a circuit . For any , there exists a random set described by a -sized circuit such that

for all circuits and elements .

Proof.

The corollary is an application of Nisan’s pseudorandom generator for conditional sampling. A simple linear time algorithm that performs conditional sampling based on a random set as follows. We keep two variables, the number of elements that pass the criteria of selection which is initialized at value 0, and the selected element. For every element in the domain in order, we perform the following:

  1. Draw random bits and check whether the number .

  2. If yes, skip and continue in the next element.

  3. Otherwise if , increment and with probability change the selected element to .

Note that here, we have truncated the probabilities to accuracy, so the random set used is slightly different than . However, picking , we have that

for all circuits and elements .

To prove the statement, we will use Nisan’s pseudorandom generator to generate the sequence of bits for the algorithm. The algorithm requires only memory to store the two variables which is equal to . Moreover, the total number of random bits used is and thus by Nisan’s pseudorandom generator we can create a sequence of random bits based on a seed of size and give them to the algorithm. This sequence can be computed in blocks of size using a circuit of size . We align blocks of bits with points and thus the circuit gives for input the bits needed in the first step of the above algorithm. This implies that the circuit that takes the output of and compares them with satisfies:

for all circuits and elements . By triangular inequality, we get the desired error probability with respect to the circuit .

The total size of the circuit is which completes the proof. ∎

3.1 Point in Set and Support Estimation

3.1.1 Point in Set

The point in set function takes a set given as a circuit and returns one point or if there is no such point in the set of input points, i.e. . The notation that we use for this function is and takes as input the description of . Obviously the way to implement this function in the conditional sampling model is trivial. Since the point in set returns any point in the described set a random point also suffices. Therefore we just call the oracle and we return this as a result to .

We can test whether there is a unique point in a set by setting and querying . Similarly, if the set has points, we can get all points in the set in time by querying times, excluding every time the points that have been found.

3.1.2 Support Estimation

The support estimation function takes as input a description of a set and outputs an estimation for the size of the set . We call this function .

The first step is to define a random subset by independently adding every element with probability for some integer parameter that corresponds to a guess of the support size. Let be the description of . We will later use Corollary 1 to show that an approximate version of can be efficiently constructed. We then use the Point-In-Set primitive and we query . This tests whether which happens with probability

Using this query, we can distinguish whether or . The probabilities of these events are and respectively. The total variation distance in the two cases is

where for the second to last inequality we assumed 111The case can be trivially handled by listing few points from ..

We can therefore conclude that for we can distinguish with probability between and using queries of the form . Binary searching over possible ’s, we can compute an approximation of the support size by repeating times, as there are possible values for . A more efficient estimator, since we care about multiplicative approximation, only considers values for of the form . There are are possible such values, so doing a binary search over them takes iterations. Thus, overall, the total number of queries is .

To efficiently implement each query, we produce a circuit using Corollary 1 with parameter for error and a constant function . The only change is that at every comparison the probabilities and are accurate to within . Choosing implies that is still and thus the same analysis goes through. The circuit has size which implies that the total runtime for queries is as .

Using our conditional sampling oracle, we are able to obtain the following lemma:

Lemma 1.

There exists a procedure that takes as input the description of a set and computes an -multiplicative approximation of the size of using conditional samples in time .

3.1.3 Distinct Values

One function that is highly related to support estimation is the distinct values function, which we denote by DV. The input of this function is a description of a set together with a function described by a circuit . The output of DV is the total number of distinct values taken by on the subset , i.e.

To implement the distinct values function we perform support estimation on the range of the function . This is done as before by computing random sets and performing the queries , where is the circuit description of the set and the circuit takes value on input if and otherwise.

Lemma 2.

There exists a procedure that takes as input the description of a set and a function given as a circuit and computes an -multiplicative approximation to the number of distinct values of on the set in time . The number of conditional samples used is .

3.2 Point of Maximum Weight

The point of maximum weight function takes as input a description of a set together with a function given by a circuit . Let . The output of the function is the value . We call this function . Sometimes we are interested also in finding a point where this maximum is achieved, i.e. which we call .

This is simple to implement by binary search for the value . At every step, we make a guess for the answer and test whether there exists a point in the set . This requires queries and the runtime is .

Lemma 3.

There exists a procedure that takes as input the description of a set and a function given as a circuit and computes the value using conditional samples in time .

An alternative algorithm solves this task in queries with probability . The algorithm starts with the lowest possible value for the answer, i.e. . At every step, it asks the Cond oracle for a random point with . If such a point exists, the algorithm updates the estimate by setting . Otherwise, if no such point exists, the guessed value of is optimal. It is easy to see that every step, with probability , half of the points are discarded. Repeating times we get that the points are halfed with probability . Thus after steps, the points will be halfed times and the maximum will be identified with probability . Thus, the total number of queries is and we obtain the following lemma.

Lemma 4.

There exists a procedure that takes as input the description of a set and a function given as a circuit and computes the value using conditional samples in time with probability of error .

3.3 Sum of Weights of Points

The sum of weights of points function takes as input a description of a set together with a function . The output of the function is an approximation of the sum of all for every in , i.e. . We call this function .

To implement this function in the conditional sampling model, we first compute (Lemma 4). We then create sets for , by grouping together points whose values are close. Let denote the circuit description of every set . We can get an estimate for the overall sum as

To see why this is an accurate estimate we rewrite the summation in the following form:

(1)

To bound the error for the second term of (1), notice that for every and , we have that . Thus, the value is a -approximation to the sum . Since the primitive returns a -approximation to , we get that the second term of (1) is approximated by multiplicatively within .

The first term introduces an additive error of at most which implies that gives -multiplicative approximation to the sum of weights. Rescaling by a constant, we get the desired guarantee. Thus, we can get the estimate by using one query to the Max primitive and queries to SE. For the process to succeed with probability , we require that all of the SE queries to succeed with probability . Plugging in the corresponding guarantees of Lemmas 1 and 4, we obtain the following:

Lemma 5.

There exists a procedure that takes as input the description of a set and a function given by a circuit and computes an -multiplicative approximation of the value using conditional samples in time .

3.4 Weighted Sampling

The weighted sampling function gets as input a description of a set together with a function given as a circuit . The output of the function is a point in the set chosen with probability proportionally to the value . Therefore, we are interested in creating an oracle that outputs element with probability .

To implement the weighted sampling in the conditional sampling model, we use a similar idea as in support estimation. First we compute and then we define a random set that contains independently every element with probability

(2)

Let be the description of . We will later use Corollary 1 in order to build a pseudorandom set with small circuit description that approximately achieves the guarantees of .

Based on the random set , we describe Algorithm 1 which performs weighted sampling according to the function .

1:
2:while  and #iterations  do
3:     Construct the random set and as described by the equation (2)
4:     Check if there exists a unique point in the set .
5:     if such unique point exists then
6:         With probability , set      
7:return
Algorithm 1 Sampling elements according to their weight.

We argue the correctness of this algorithm. Given a purely random , we first show that at every iteration, the probability of selecting each point is proportional to its weight. This implies that the same will be true for the final distribution as we perform rejection sampling on outcomes.

The probability that in one iteration the algorithm will return the point is the probability that has been chosen in and that , i.e. it is the unique point of the input set that lies in set and was not filtered by . For every , this probability is equal to

and it is easy to see that this probability is proportional to as all other terms don’t depend on .

We now bound the probability of selecting a point at one iteration. This is equal to

which is at least for a small enough parameter chosen in our estimation Sum of the total sum of . Thus at every iteration, there is a constant probability of outputing a point. By repeating times we get that the algorithm outputs a point with probability .

Summarizing, if we assume a purely random set , the probability that the above procedure will fail in is at most plus the probability that the computation of the sum will fail which we can also make to be at most , for a total probability of failure. Since we need only a constant multiplicative approximation to the sum, using Lemma 5 the total number of queries that we need for the probability of failure to be at most is .

Since the random set can have very large description complexity, we use Corollary 1 to generate a pseudorandom set . If we apply the corollary for error we get that the total variation distance between the output distribution in one step when using with the distribution when using is at most:

Since we make at most queries the oracle Cond, we get that the total variation distance between the two output distributions is . Setting we get that this distance is at most . Computing the total runtime and number of samples, we obtain the following lemma.

Lemma 6.

There exists a procedure that takes as input the description of a set and a function given by a circuit and returns a point from a probability distribution that is at most -far in total variation distance from the probability distribution that selects each element proportionally to . The procedure fails with probability at most , uses conditional samples and takes time .

3.5 Distinct Elements Sampling – Sampling

The distinct elements sampling function gets as input a description of a set together with a function described by a circuit . It outputs samples from a distribution on the set such that the distribution of values is uniform over the image space . We thus want that for every , .

We first explain the implementation of the algorithm assuming access to true randomness. Assume therefore that we have a circuit that describes one purely random hash function . Then will produce a uniformly random element as long as the maximum element is unique. This means that if we call the procedure ArgMax to find a point and check that no point exists such that and then the result will be a point distributed with the correct distribution. Repeating times guarantees that we get a valid point with probability at least .

Therefore the only question is how to replace with a pseudorandom . We can apply Nisan’s pseudorandom generator. Consider an algorithm that for every point in order, draws a random sample uniformly at random from and checks if and whether is the largest value seen so far. This algorithm computes while only keeping track of the largest sample and the largest point . This algorithm uses bits of memory and random bits. Therefore we can apply Nisan’s theorem (Theorem 1) for space for and we can replace with that uses only random bits whose circuit representation is only .

This means that we can use the Lemma 4 and Theorem 1 to get the following lemma about the -sampling.

Lemma 7.

There exists a procedure that takes as input the description of a set and a function given by a circuit and returns a point from a probability distribution that is at most -far in total variation distance (for ) from a probability distribution that assigns probability to every set for . This procedure fails with probability at most , uses conditional samples and takes time .

4 -means clustering

In this section we describe how known algorithms for the -means clustering can be transformed to sublinear time algorithms in the case that we have access to conditional samples. The basic tool of this algorithms was introduced by Arthur and Vassilvitskii [AV07].

-sampling:

This technique provides some very simple algorithms that can easily get a constant factor approximation to the optimal -means clustering, like in the work of Aggarwal et. al. [ADK09]. Also if we allow exponential running time dependence on then we can also get a PTAS like in the work of Jaiswal et. al. [JKS14]. The drawback of this PTAS is that it works only for points in the dimensional euclidean space.

When working in arbitrary metric spaces, inspired by Aggarwal et. al. [ADK09] we use adaptive sampling to get a constant factor approximation to the -means problem. Now we describe how all these steps can be implemented in sublinear time using the primitives that we described in the previous section. The steps of the algorithm are:

  1. Pick a set of points according to -sampling.
    For steps, let denote the set of samples that we have chosen in the step. We pick the th point according to the following distribution

    Implementation: To implement this step we simply use the primitive where is the constant true circuit and . The circuit to implement the function has size where is the diameter of the space. Now since we can also implement the minimum using a tournament with only comparisons each of which has size . This means that the size of the circuit of is bounded by . Therefore we can now use the Lemma 6 and get that we need queries and running time to get the needed samples from a distribution that is in total variation distance from the correct distribution and has probability of error for each sample.

  2. Weight the points of according to the number of points that each one represents.
    For any point we set

    Implementation: To implement this step given the previous one we just iterate over all the points in and for each one of these points we compute the weight using the procedure Sum as described in Lemma 5. Similarly to the previous step we have that is the constant 1 circuit and is equal to 1 if the closest point to in is and zero otherwise. To describe this function we need as before sized circuit. Therefore for this step we need conditional samples and running time in order to get an multiplicative approximation of every .

  3. Solve the weighted -means problem in with the weighted points of P.
    This can be done using an arbitrary constant factor approximation algorithm for -means since the size of is and therefore the running time will be just which is already sublinear in .

To prove that this algorithm gets a constant factor approximation we use Theorem 1 and Theorem 4 of [ADK09]. From the Theorem 1 of [ADK09] and the fact that we sample from a distribution that is -close in total variation distance to the correct one we conclude that the set that we chose satisfies Theorem 1 of [ADK09] with probability of error at most . Then it is easy to see at Theorem 4 of [ADK09] that when we have a constant factor approximation of the weights, we lose only a constant factor in the approximation ratio. Therefore we can choose to be constant. Finally for the total probability of error to be constant, we have to pick to be constant. Combining all these with the description of the step that we have above we get the following result.

Theorem 2.

There exists an algorithm that computes an -approximation to the -means clustering and uses only conditional queries and has running time
.

Remark 1. The above algorithm could be extended to an arbitrary metric space where we are given a circuit that describes the distance metric function. In this case the running time will also depend on .

Remark 2. In this case that the points belong to dimensional space, we can also use the Find-k-means() algorithm by [JKS14] to get -approximation instead of constant. This algorithm iterates over a number of subsets of the input of specific size that have been selected using -sampling. Then from all these different solution it selects the one with the minimum cost. We can implement this algorithm using our WCond and Sum primitives to get a sublinear -multiplicative approximation algorithm that uses conditional samples and has running time .

5 Euclidean Minimum Spanning Tree

In this section we are going to discuss how to use the primitives that we described earlier in order to estimate the weight of the minimum spanning tree of points in euclidean space.

More specifically, suppose that we have a -dimensional discrete euclidean space and a set of points , where . We assume that which is a reasonable assumption to make when bounded precision arithmetic is used. This means that each coordinate of our points can be specified using bits.

In what follows, we are going to be using the following formula that relates the weight of an MST to the number of connected components of certain graphs. Let denote the maximum distance between any pair of points in . Moreover, let be the graph whose vertices correspond to points in and the is an edge between two vertices if and only if the distance of the corresponding points is at most . By we denote the number of connected components of the graph . In [CS04], it is shown that the following quantity leads to an -multiplicative approximation of the weight of the minimum spanning tree:

(3)

The quantity would be equal to the weight of the minimum spanning tree if all pairwise distances between points were for some .

In order to estimate the weight of the MST, we need to estimate the number of connected components for each graph . As shown in [FIS08], for every , we can equivalently focus on performing the estimation task after rounding the coordinates of all points to an arbitrary grid of size . This introduces a multiplicative error of at most which we can ignore by scaling by a constant.

We thus assume that every point is at a center of a grid cell when performing our estimation. We perform a sampling process which samples uniformly from the occupied grid cells (regardless of the number of points in each of them) and estimates the number of cells covered by the connected component of that the sampled cell belongs to. Comparing that estimate to an estimate for the total number of occupied grid cells, we obtain an estimate for the total number of connected components. In more detail, if the sampled component covers a fraction of the cells, the guess for the number of components is . For that estimator to be accurate, we need to make sure that the total expected number of occupied grid cells is comparable to the total number of components without blowing up in size exponentially with the dimension of the Euclidean space. We achieve this by choosing a uniformly random shifted grid. This random shift helps us avoid corner cases where a very small component spans a very large number of cells even when all its contained points are very close together. With a random shift, such cases have negligible probability.

We will first use the following lemma to get an upper bound on the number of occupied cells which holds with high probability:

Lemma 8.

Let be a -D curve of length , a grid of side length and random vector distributed uniformly over . Then, the expected number of grid cells of that contain some point of the curve is , where ”+” denotes the Minkowski sum of the two sets.

Proof.

Consider the grid shifted by a random vector to obtain a grid . We associate every point with a cell . Observe that a cell corresponding to a grid point intersects the curve if , or equivalently if . The expected number of occupied grid cells is thus equal to the expected number of grid points of , which lie in the Minkowski sum of and .

Note that each of the original grid points can move inside a