Optimal Algorithms and Lower Bounds for
Testing Closeness of Structured Distributions
We give a general unified method that can be used for closeness testing of a wide range of univariate structured distribution families. More specifically, we design a sample optimal and computationally efficient algorithm for testing the equivalence of two unknown (potentially arbitrary) univariate distributions under the -distance metric: Given sample access to distributions with density functions , we want to distinguish between the cases that and with probability at least . We show that for any , the optimal sample complexity of the -closeness testing problem is . This is the first sample algorithm for this problem, and yields new, simple closeness testers, in most cases with optimal sample complexity, for broad classes of structured distributions.
We study the problem of closeness testing (equivalence testing) between two unknown probability distributions. Given independent samples from a pair of distributions , we want to determine whether the two distributions are the same versus significantly different. This is a classical problem in statistical hypothesis testing [NP33, LR05] that has received considerable attention by the TCS community in the framework of property testing [RS96, GGR98]: given sample access to distributions , and a parameter , we want to distinguish between the cases that and are identical versus -far from each other in norm (statistical distance). Previous work on this problem focused on characterizing the sample size needed to test the identity of two arbitrary distributions of a given support size [BFR00, CDVV14]. It is now known that the optimal sample complexity (and running time) of this problem for distributions with support of size is .
The aforementioned sample complexity characterizes worst-case instances, and one might hope that drastically better results can be obtained for most natural settings, in particular when the underlying distributions are known a priori to have some “nice structure”. In this work, we focus on the problem of testing closeness for structured distributions. Let be a family over univariate distributions. The problem of closeness testing for is the following: Given sample access to two unknown distribution , we want to distinguish between the case that versus Note that the sample complexity of this testing problem depends on the underlying class , and we are interested in obtaining efficient algorithms that are sample optimal for .
We give a general algorithm that can be used for closeness testing of a wide range of structured distribution families. More specifically, we give a sample optimal and computationally efficient algorithm for testing the identity of two unknown (potentially arbitrary) distributions under a different metric between distributions – the so called -distance (see Section 2 for a formal definition). Here, is a positive integer that intuitively captures the number of “crossings” between the probability density functions .
Our main result (see Theorem 1) says the following: For any , and sample access to arbitrary univariate distributions , there exists a closeness testing algorithm under the -distance using samples. Moreover, this bound is information-theoretically optimal. We remark that our -testing algorithm applies to any pair of univariate distributions (over both continuous and discrete domains). The main idea in using this general algorithm for testing closeness of structured distributions in distance is this: if the underlying distributions belong to a structured distribution family , we can use the -distance as a proxy for the distance (for an appropriate value of the parameter ), and thus obtain an closeness tester for .
We note that -distance between distributions has been recently used to obtain sample optimal efficient algorithms for learning structured distributions [CDSS14, ADLS15], and for testing the identity of a structured distribution against an explicitly known distribution [DKN15] (e.g., uniformity testing). In both these settings, the sample complexity of the corresponding problem (learning/identity testing) with respect to the -distance is identified with the sample complexity of the problem under the distance for distributions of support . More specifically, the sample complexity of learning an unknown univariate distribution (over a continuous or discrete domain) up to -distance is [CDSS14] (independent of the domain size), which is exactly the sample complexity of learning a discrete distribution with support size up to error . Similarly, the sample complexity of uniformity testing of a univariate distribution (over a continuous or discrete domain) up to -distance is [DKN15] (again, independent of the domain size), which is identical to the sample complexity of uniformity testing of a discrete distribution with support size up to error [Pan08].
Rather surprisingly, this analogy is provably false for the closeness testing problem: we prove that the sample complexity of the closeness testing problem is , while closeness testing between distributions of support can be achieved with samples [CDVV14]. More specifically, our upper bound for closeness testing problem applies for all univariate probability distributions (both continuous and discrete). Our matching information–theoretic lower bound holds for continuous distributions, or discrete distributions of support size sufficiently large as a function of , which is the most interesting regime for our applications.
1.1 Related and Prior Work
In this subsection we review the related literature and compare our results with previous work.
Distribution Property Testing Testing properties of distributions [BFR00, BFR13] has developed into a mature research area within theoretical computer science. The paradigmatic problem in this field is the following: given sample access to one or more unknown probability distributions, determine whether they satisfy some global property or are “far” from satisfying the property. The goal is to obtain an algorithm for this task that is both statistically and computationally efficient, i.e., an algorithm with (information–theoretically) optimal sample size and polynomial runtime. See [GR00, BFR00, BFF01, Bat01, BDKR02, BKR04, Pan08, Val11, VV11, DDS13, ADJ11, LRR11, ILR12, CDVV14, VV14, DKN15] for a sample of works, and [Rub12] for a survey.
Shape Restricted Estimation Statistical estimation under shape restrictions – i.e., inference about a probability distribution under the constraint that its probability density function satisfies certain qualitative properties – is a classical topic in statistics [BBBB72]. Various structural restrictions have been studied in the literature, starting from monotonicity, unimodality, convexity, and concavity [Gre56, Bru58, Rao69, Weg70, HP76, Gro85, Bir87a, Bir87b, Fou97, CT04, JW09], and more recently focusing on structural restrictions such as log-concavity and -monotonicity [BW07, DR09, BRW09, GW09, BW10, KM10, Wal09, DW13, CS13, KS14, BD14, HW15]. The reader is referred to [GJ14] for a recent book on the topic.
Comparison with Prior Work Chan, Diakonikolas, Servedio, and Sun [CDSS14] proposed a general approach to learn univariate probability distributions whose densities are well approximated by piecewise polynomials. They designed an efficient agnostic learning algorithm for piecewise polynomial distributions, and as a corollary obtained efficient learners for various families of structured distributions. The approach of [CDSS14] uses the distance metric between distributions, but is otherwise orthogonal to ours. Batu et al. [BKR04] gave algorithms for closeness testing between two monotone distributions with sample complexity Subsequently, Daskalakis et al. [DDS13] improved and generalized this result to -modal distributions, obtaining a closeness tester with sample complexity . We remark that the approach of [DDS13] inherently yields an algorithm with sample complexity , which is sub-optimal.
The main ideas underlying this work are very different from those of [DDS13] and [DKN15]. The approach of [DDS13] involves constructing an adaptive interval decomposition of the domain followed by an application of a (known) closeness tester to the “reduced” distributions over those intervals. This approach incurs an extraneous term in the sample complexity, that is needed to construct the appropriate decomposition. The approach of [DKN15] considers several oblivious interval decompositions of the domain (i.e., without drawing any samples) and applies a “reduced” identity tester for each such decomposition. This idea yields sample–optimal bounds for identity testing against a known distribution. However, it crucially exploits the knowledge of the explicit distribution, and unfortunately fails in the setting where both distributions are unknown. We elaborate on these points in Section 2.3.
2 Our Results and Techniques
2.1 Basic Definitions
We will use to denote the probability density functions (or probability mass functions) of our distributions. If is discrete over support , we denote by the probability of element in the distribution. For two discrete distributions , their and distances are and . For and density functions , we have .
Fix a partition of the domain into disjoint intervals For such a partition , the reduced distribution corresponding to and is the discrete distribution over that assigns the -th “point” the mass that assigns to the interval ; i.e., for , . Let be the collection of all partitions of the domain into intervals. For and , we define the -distance between and by
2.2 Our Results
Our main result is an optimal algorithm and a matching information–theoretic lower bound for the problem of testing the equivalence between two unknown univariate distributions under the distance metric:
Theorem 1 (Main).
Given , an integer , and sample access to two distributions with probability density functions , there is a computationally efficient algorithm which uses samples from , and with probability at least distinguishes whether versus . Additionally, samples are information-theoretically necessary for this task.
Note that Theorem 1 applies to arbitrary univariate distributions (over both continuous and discrete domains). In particular, the sample complexity of the algorithm does not depend on the support size of the underlying distributions. We believe that the notion of testing under the distance is very natural, and well suited for (arbitrary) continuous distributions, where the notion of testing is (provably) impossible.
As a corollary of Theorem 1, we obtain sample–optimal algorithms for the closeness testing of various structured distribution families in a unified way. The basic idea is to use the distance as a “proxy” for the distance for an appropriate value of that depends on and . We have the following simple fact:
For a univariate distribution family and , let be the smallest integer such that for any it holds that . Then there exists an closeness testing algorithm for using samples.
Indeed, given sample access to , we apply the -closeness testing algorithm of Theorem 1 for the value of in the statement of the fact, and error . If , the algorithm will output “YES” with probability at least . If , then by the condition of Fact 2 we have that , and the algorithm will output “NO” with probability at least .
We remark that the value of in Fact 2 is a natural complexity measure for the difference between two probability density functions in the class . It follows from the definition of the distance that this value corresponds to the number of “essential” crossings between and – i.e., the number of crossings between the functions and that significantly affect their distance. Intuitively, the number of essential crossings – as opposed to the domain size – is, in some sense, the “right” parameter to characterize the sample complexity of closeness testing for .
The upper bound implied by the above fact is information-theoretically optimal for a wide range of structured distribution classes . In particular, our bounds apply to all the structured distribution families considered in [CDSS14, DKN15, ADLS15] including (arbitrary mixtures of) -flat (i.e., piecewise constant with pieces), -piecewise degree- polynomials, -monotone, monotone hazard rate, and log-concave distributions. For -flat distributions we obtain an closeness testing algorithm that uses samples, which is the first sample algorithm for the problem. For log-concave distributions, we obtain a sample size of matching the information–theoretic lower bound even for the case that one of the distributions is explicitly given [DKN15]. Table 1 summarizes our upper bounds for a selection of natural and well-studied distribution families. These results are obtained from Theorem 1 and Fact 2, via the appropriate structural approximation results [CDSS13, CDSS14].
|Distribution Family||Our upper bound||Previous work|
-mixture of log-concave
We would like to stress that our algorithm and its analysis are very different than previous results in the property testing literature. We elaborate on this point in the following subsection.
2.3 Our Techniques
In this subsection, we provide a high-level overview of our techniques in tandem with a comparison to prior work.
Our upper bound is achieved by an explicit, sample near-linear-time algorithm. A good starting point for considering this problem would be the testing algorithm of [DKN15], which deals with the case where is an explicitly known distribution. The basic idea of the testing algorithm in this case [DKN15] is to partition the domain into intervals in several different ways, and run a known tester on the reduced distributions (with respect to the intervals in the partition) as a black-box. At a high-level, these intervals partitions can be constructed by exploiting our knowledge of , in order to divide our domain into several equal mass intervals under . It can be shown that if and have large distance from each other, one of these partitions will be able to detect the difference.
Generalizing this algorithm to the case where is unknown turns out to be challenging, because there seems to be no way to find the appropriate interval partitions with samples. If we allowed ourselves to take samples from , we would be able to approximate an appropriate interval partition, and make the aforementioned approach go through. Alas, this would not lead to an sample algorithm. If we can only draw samples from our distributions, the best that we could hope to do would be to use our samples in order to partition the domain into interval regions. This, of course, is not going to be sufficient to allow an analysis along the lines of the above approach to work. In particular, if we partition our domain deterministically into intervals, it may well be the case that the reduced distributions over those intervals are identical, despite the fact that the original distributions have large distance. In essence, the differences between and may well cancel each other out on the chosen intervals.
However, it is important to note that our interval boundaries are not deterministic. This suggests that unless we get unlucky, the discrepancy between and will not actually cancel out in our partition. As a slight modification of this idea, instead of partitioning the domain into intervals (which we expect to have only samples each) and comparing the number of samples from versus in each, we sort our samples and test how many of them came from the same distribution as their neighbors (with respect to the natural ordering on the real line).
We intuitively expect that, if , the number of pairs of ordered samples drawn from the same distribution versus a different one will be the same. Indeed, this can be formalized and the completeness of this tester is simple to establish. The soundness analysis, however, is somewhat involved. We need to show that the expected value of the statistic that we compute is larger than its standard deviation. While the variance is easy to bound from above, bounding the expectation is quite challenging. To do so, we define a function, , that encodes how likely it is that the samples nearby point come from one distribution or the other. It turns out that satisfies a relatively nice differential equation, and relates in a clean way to the expectation of our statistic. From this, we can show that any discrepancy between and taking place on a scale too short to be detected by the above partitioning approach will yield a notable contribution to our expectation.
The analysis of our lower bound begins by considering a natural class of testers, namely those that take some number of samples from and , sort the samples (while keeping track of which distribution they came from) and return an output that depends only on the ordering of these samples. For such testers we exhibit explicit families of pairs of distributions that are hard to distinguish from being identical. There is a particular pattern that appears many times in these examples, where there is a small interval for which has an appropriate amount of probability mass, followed by an interval of , followed by another interval of . When the parameters are balanced correctly, it can be shown that when at most two samples are drawn from this subinterval, the distribution on their orderings is indistinguishable from the case where . By constructing distributions with many copies of the pattern, we essentially show that a tester of this form will not be able to be confident that , unless there are many of these small intervals from which it draws three or more samples. On the other hand, a simple argument shows that this is unlikely to be the case.
The above lower bound provides explicit distributions that are hard to distinguish from being identical by any tester in this limited class. To prove a lower bound against general testers, we proceed via a reduction: we show that an order–based tester can be derived from any general tester. It should be noted that this makes our lower bound in a sense non-constructive, as we do not know of any explicit families of distributions that are hard to distinguish from uniform for general testers. In order to perform this reduction, we show that for a general tester we can find some large subset of its domain such that if all samples drawn from and by the tester happen to lie in , then the output of the tester will depend only on the ordering of the samples. This essentially amounts to a standard result from Ramsey theory. Then, by taking any other problem, we can embed it into our new sample space by choosing new and that are the same up to an order-preserving rearrangement of the domain (which will also preserve distance), ensuring that they are supported only on .
3 Algorithm for Closeness Testing
In this section we provide the sample optimal closeness tester under the distance.
3.1 An -sample tester
In this subsection we give a tester with sample complexity that applies for . For simplicity, we focus on the case that we take samples from two unknown distributions with probability density functions . Our results are easily seen to extend to discrete probability distributions.
The algorithm Simple-Test-Identity-, on input two samples each of size drawn from two distributions with densities , an integer , and , correctly distinguishes the case that from the case , with probability at least .
First, it is straightforward to verify the claimed sample complexity, since the algorithm only draws samples in Step 1. To simplify the analysis we make essential use of the following simple claim:
We can assume without loss of generality that the pdf’s are continuous functions bounded from above by .
We start by showing we can assume that are at most . Let be arbitrary pdf’s. We consider the cumulative distribution function (CDF) of the mixture . Let , , be random variables. Since is non-decreasing, replacing and by and does not affect the algorithm (as the ordering on the samples remains the same). We claim that, after making this replacement, and are continuous distributions with probability density functions bounded by . In fact, we will show that the sum of their probability density functions is exactly . This is because for any ,
where the second equality is by the definition of a CDF. Thus, we can assume that and are bounded from above by .
To show that we can assume continuity, note that and can be approximated by continuous density functions and so that the errors are each at most . If our algorithm succeeds with the continuous densities and , it must also succeed for and . Indeed, since the distance between and and and is at most , a set of samples taken from or are statistically indistinguishable to samples taken from or . This proves that it is no loss of generality to assume that and are continuous. ∎
Note that the algorithm makes use of the well-known “Poissonization” approach. Namely, instead of drawing samples from and from , we draw samples from and sample from . The crucial properties of the Poisson distribution are that it is sharply concentrated around its mean and it makes the number of times different elements occur in the sample independent.
We now establish completeness. Note that our algorithm draws samples from or . If , then our process equivalently selects values from and then randomly and independently with equal probability decides whether or not each sample came from or from . Making these decisions one at a time in increasing order of points, we note that each adjacent pair of elements in randomly and independently contributes either a or a to . Therefore, the distribution of is exactly that of a sum of independent random variables. Therefore, has mean and variance . By Chebyshev’s inequality it follows that with probability at least . This proves completeness.
We now proceed to prove the soundness of our algorithm. Assuming that , we want to show that the value of is at most with probability at most . To prove this statement, we will again use Chebyshev’s inequality. In this case it suffices to show that for the inequality to be applicable. We begin with an important definition.
The importance of this function is demonstrated by the following lemma.
We have that:
Given an interval , we let be the contribution to coming from pairs of consecutive points of the larger of which is drawn from . We wish to approximate the expectation of . We let be the expected total number of points drawn from . We note that the contribution coming from cases where more than one point is drawn from is . We next consider the contribution under the condition that only one sample is drawn from . For this, we let and be the events that the largest element of preceding comes from or respectively. We have that the expected contribution to coming from events where exactly one element of is drawn from is:
Letting be the left endpoint of , this is
Letting be a partition of our domain into intervals, we find that
As the partition becomes iteratively more refined, these sums approach Riemann sums for the integral of
Therefore, taking a limit over partitions , we have that
We will also make essential use of the following technical lemma:
The function is differentiable with derivative
Consider the difference between and for some small . We note that where is if the sample of preceding came from , if the sample came from , and if no sample came before . Note that
Since and are continuous at , these four events happen with probabilities , , , , respectively. Therefore, taking an expectation we find that . This, and a similar relation relating to , proves that is differentiable with the desired derivative. ∎
To analyze the desired expectation, , we consider the quantity . Substituting from Lemma 7 above gives
Combining this with Lemma 6, we get
The second term in (1) above is , so we focus our attention to bound the first term from below. To do this, we consider intervals over which is “large” and show that they must produce some noticeable contribution to the first term. Fix such an interval . We want to show that is large somewhere in . Intuitively, we attempt to prove that on at least one of the endpoints of the interval, the value of is big. Since does not vary too rapidly, will be large on some large fraction of . Formally, we have the following lemma:
For , let be an interval with and . Then, there exists an such that
Suppose for the sake of contradiction that for all . Then, we have that
which yields the desired contradiction. ∎
We are now able to show that the contribution to coming from such an interval is large.
Let be an interval satisfying the hypotheses of Lemma 8. Then
By Lemma 8, is large at some point of the interval . Without loss of generality, we assume that . Let be the interval so that . Note that (since by assumption and thus ). Furthermore, note that since with probability at least , no samples from lie in , we have that for all in it holds , so . Therefore,
Since , there is a partition of into intervals so that By subdividing intervals further if necessary, we can guarantee that has at most intervals, and for each subinterval it holds . For each such interval , let . Note that .
By (1) we have that
We note that the second to last line above follows by Hölder’s inequality. It remains to bound from above the variance of .
We have that
We divide the domain into intervals , , each of total mass under the sum-distribution . Consider the random variable denoting the contribution to coming from pairs of adjacent samples in such that the right sample is drawn from . Clearly, and
To bound the first sum, note that the number of pairs of in an interval is no more than the number of samples drawn from , and the variance of is less than the expectation of the square of the number of samples from . Since the number of samples from is a Poisson random variables with parameter , we have . This shows that
To bound the sum of covariance, consider and conditioned on the samples drawn from intervals other than and . Note that if any sample is drawn from an intermediate interval, and are uncorrelated, and otherwise their covariance is at most . Since the probability that no sample is drawn from any intervening interval decreases exponentially with their separation, it follows that . This completes the proof. ∎
An application of Chebyshev’s inequality completes the analysis of the soundness and the proof of Proposition 3. ∎
3.2 The General Tester
In this section, we present a tester whose sample complexity is optimal (up to constant factors) for all values of and , thereby establishing the upper bound part of Theorem 1. Our general tester (Algorithm Test-Identity-) builds on the tester presented in the previous subsection (Algorithm Simple-Test-Identity-). It is not difficult to see that the latter algorithm can fail once becomes sufficiently small, if the discrepancy between and is concentrated on intervals of mass larger than . In this scenario, the tester Simple-Test-Identity- will not take sufficient advantage of these intervals. To obtain our enhanced tester Test-Identity-, we will need to combine Simple-Test-Identity- with an alternative tester when this is the case. Note that we can easily bin the distributions and into intervals of total mass approximately by taking random samples. Once we do this, we can use an identity tester similar to that in our previous work [DKN15] to detect the discrepancy in these intervals. In particular we show the following:
Let be discrete distributions over satisfying . There exists a testing algorithm with the following properties: On input , , and , the algorithm draws samples from and and with probability at least distinguishes between the cases and .
The above proposition says that the identity testing problem under the distance can be solved with samples when both distributions and are promised to be “nearly” uniform (in the sense that their norm is times that of the uniform distribution). To prove Proposition 11 we follow a similar approach as in [DKN15]: Starting from the identity tester of [CDVV14], we consider several oblivious interval decompositions of the domain into intervals of approximately the same mass, and apply a “reduced” identity tester for each such decomposition. The details of the analysis establishing Proposition 11 are postponed to Appendix A.
We are now ready to present our general testing algorithm:
Our main result for this section is the following:
Algorithm Test-Identity- draws samples from and with probability at least returns “YES” if and “NO” if .
First, it is easy to see that the sample complexity of the algorithm is . Recall that we can assume that are continuous pdf’s bounded from above by .
We start by establishing completeness. If , it is once again the case that and , so by Chebyshev’ s inequality, Step 4 will fail with probability at most . Next when taking our samples in Step 5(a), note that the expected samples size is and that the expected squared norms of the reduced distributions and are . Therefore, with probability at least , and satisfy the hypothesis of Proposition 11. Hence, this holds for all iterations with probability at least .
Conditioning on this event, since , the tester in Step 5(c) will return “YES” with probability at least on each iteration. Therefore, it returns “YES” on all iterations with probability at least . By a union bound, it follows that if , our algorithm returns “YES” with probability at least .
We now proceed to establish soundness. Suppose that . Then there exists a partition of the domain into intervals such that For an interval , let . We will call an small if there is a subinterval so that and . We will call large otherwise. Note that Therefore either or We analyze soundness separately in each of these cases.
Consider first the case that The analysis in this case is very similar to the soundness proof of Proposition 3 which we describe for the sake of completeness.
By definition, for each small interval , there exists a subinterval so that and . By Lemma 9, for such we have that and therefore, that Hence, we have that
On the other hand, Lemma 10 gives that , so for sufficiently large, Chebyshev’s inequality implies that with probability at least it holds . That is, our algorithm outputs “NO” with probability at least .
Now consider that case that We claim that the second part of our tester will detect the discrepancy between and with high constant probability. Once again, we can say that with probability at least the squared norms of the reduced distributions and are both and that the size of the reduced domain is . Thus, the conditions of Proposition 11 are satisfied on all iterations with probability at least . To complete the proof, we will show that with constant probability we have . To do this, we construct an explicit partition of our reduced domain into at most intervals so that with constant probability . This will imply that with probability at least that on at least one of our trials that .
More specifically, for each interval we place interval boundaries at the smallest and largest sample points taken from in Step 5(a) (ignoring them if fewer than two samples landed in ). Since we have selected at most points, this process defines a partition of the domain into at most intervals. We will show that the reduced distributions