Generalized Uniformity Testing
In this work, we revisit the problem of uniformity testing of discrete probability distributions. A fundamental problem in distribution testing, testing uniformity over a known domain has been addressed over a significant line of works, and is by now fully understood.
The complexity of deciding whether an unknown distribution is uniform over its unknown (and arbitrary) support, however, is much less clear. Yet, this task arises as soon as no prior knowledge on the domain is available, or whenever the samples originate from an unknown and unstructured universe. In this work, we introduce and study this generalized uniformity testing question, and establish nearly tight upper and lower bound showing that – quite surprisingly – its sample complexity significantly differs from the known-domain case. Moreover, our algorithm is intrinsically adaptive, in contrast to the overwhelming majority of known distribution testing algorithms.
noitemsep,topsep=3pt,parsep=2pt,partopsep=2pt \setenumerateitemsep=1pt,topsep=2pt,parsep=2pt,partopsep=2pt \setdescriptionitemsep=1pt \pdfstringdefDisableCommands \@testdefundefined
Property testing, as introduced in the seminal works of [RS96, GGR98], is the analysis and study of ultra-efficient and randomized decision algorithms, which must answer a promise problem yet cannot afford to query their whole input. A very successful and prolific area of theoretical computer science, property testing also gave rise to several subfields, notably that of distribution testing, where the input consists of independent samples from a probability distribution, and one must now verify if the underlying unknown distribution satisfies a given property of interest (cf. [Ron08, Ron09, Rub12, Can15, Gol17] for surveys on property and distribution testing).
One of the earliest and most studied questions in distribution testing is that of uniformity testing, where, given independent samples from an arbitrary probability distribution on a discrete domain , one has to decide whether (i) is uniform on , or (ii) is “far” (i.e., at total variation distance at least ) from the uniform distribution on . Arguably the most natural distribution testing problem, testing uniformity is also one of the most fundamental; algorithms for uniformity testing end up being crucial building blocks in many other distribution testing algorithms [BFF01, DK16, Gol16]. Fully understanding the sample complexity of the problem, as well as the possible trade-offs it entails, thus prompted a significant line of research.
Starting with the work of Goldreich and Ron [GR00] (which considered it in the context of testing expansion of graphs), uniformity testing was studied and analyzed in a series of work [BFF01, Pan08, VV17, DKN15, ADK15, DGPP16], which culminated with the tight sample complexity bound of for testing uniformity on a discrete domain of size . (Moreover, the corresponding algorithms are also efficient, running in time linear in the number of samples they take.)
Given this state of affairs, testing uniformity of discrete distributions appears to be fully settled; however, as often is the case, the devil is in the detail. Specifically, all the aforementioned results address the case where the domain is explicitly known, and the task is to find out whether is the uniform distribution on this domain. Yet, in many cases, samples (or data points) are drawn from the underlying distribution without such prior knowledge, and the relevant question is whether is uniform on its support – which is unknown, of arbitrary size, and can be completely unstructured.111In particular, one cannot without loss of generality assume that the support is the set of consecutive integers .
In this work, we focus on this latter question: in particular, we do not assume any a priori knowledge on the domain , besides its being discrete. Our goal is then the following: given independent samples from an arbitrary probability distribution on , we must distinguish between the case (i) is uniform on some subset of , and (ii) is far from every such uniform distribution. As we shall see, this is not merely a technicality: this new task is provably harder than the case where is known. Indeed, this difference intuitively stems from the uncertainty on where the support of lies, which prevents any reduction to the simple, known-domain case.
Furthermore, one crucial feature of the problem is that it intrinsically calls for adaptive algorithms. This is in sharp contrast to the overwhelming majority of distribution testing algorithms, which (essentially) draw a prespecified number of samples all at once, before processing them and outputting a verdict. This is because, in our case, an algorithm is provided only with the proximity parameter , and has no upper bound on the domain size nor on any other parameter of the problem. Therefore, it must keep on taking samples until it has “extracted” enough information – and is confident enough that it can stop and output an answer. (In this sense, our setting is closer in spirit to the line of work pioneered in Statistics by Ingster [Ing00, FL06] than to the “instance-optimal” setting of Valiant and Valiant [VV17, BCG17], as in the latter the algorithm is still provided with a massive parameter in the form of the full description of a reference probability distribution.)
1.1 Our Results
Given a discrete, possibly unbounded domain , we let denote the set of all probability distributions that are supported and uniform on some subset of , that is
where, for a given set , denote the uniform distribution on . In what follows, we write for the total variation distance between two distributions on .
There exists an algorithm which, given sample access to an arbitrary distribution over some unknown discrete domain , as well as parameter , satisfies the following.
If , then the algorithm outputs accept with probability at least ; while
if , then the algorithm outputs reject with probability at least .
Moreover, the algorithm takes samples in expectation, and is efficient (in the number of samples taken).
We note that if indeed is uniform, i.e., for some , then, for constant , the above complexity becomes – to be compared to the sample complexity of testing whether for a fixed . Our next result shows that this is not an artifact of our algorithm; namely, such a dependence is necessary, and testing the class of uniform distributions is strictly harder than testing any specific uniform distribution.
Fix any (non-uniform) distribution over , and let be its distance to . Then, given sample access to a distribution on , distinguishing with high constant probability between (i) is equal to up to a permutation of the domain and (ii) , requires samples. In particular, an algorithm that tests membership in with high probability and for any proximity parameter requires this many samples.
It is worth discussing the above statement in detail, as its interpretation can be slightly confusing. Specifically, it does not state that testing identity to any fixed, known distribution requires (indeed, by the results of [VV17, BCG17], such a statement would be false). What is stated is essentially that, even given the full description of , it is hard to distinguish between and a uniform distribution, after relabeling of the elements of the domain. Since the class of uniform distributions is invariant by such permutations, the last part of the theorem follows.
1.2 Overview and Techniques
The key intuition and driving idea of both our upper and lower bounds is the observation that, by very definition of the problem, there is no structure nor ordering of the domain to leverage. That is, the class of uniform distributions over is a “symmetric property” (broadly speaking, the actual labeling of the elements of the domain is irrelevant), and the domain itself can and should be thought of as a set of arbitrary points with no algebraic structure. Given this state of affairs, an algorithm should not be able to do much more than counting collisions, that is the number of pairs, or triples, or more generally -tuples of samples which happen to “hit” the same domain element.
Equivalently, these collision counts correspond to the moments (that is, -norms) of the distribution; following a line of works on symmetric properties of distributions ([GR00, RRSS09, Val11, VV11], to cite a few), we thus need to, and can only, focus on estimating these moments. To relate this to our property , we first need a simple connection between norms and uniformity of a distribution. However, while getting an exact characterization is not difficult (Section 2.3), we are interested in a robust characterization, in order to derive a correspondence between approximate equality between norms and distance to uniformity. This is what we obtain in Section 3.2: roughly speaking, if then must be close to a uniform distribution on elements.
This in turn allows us to design and analyze a simple and clean testing algorithm, which works in two stages: (i) estimate to sufficient accuracy; (ii) using this estimate, take enough samples to estimate as well; and accept if and only if .
Turning to the lower bound, the idea is once again to only use the available information: namely, if all that should matter are the -norms of the distribution, then two distributions with similar low-order norms should be hard to distinguish; so it would suffice to come up with a pair of uniform and far-from-uniform distributions with similar moments to establish our lower bound. Fortunately, this intuition – already present in [RRSS09] – was formalized and developed in an earlier work of Paul Valiant [Val11], which we thus can leverage for our purpose. Given this “Wishful Thinking Theorem” (see Theorem 2.1), what remains is to upper bound the discrepancy of the moments of our two candidate distributions to show that some specific quantity is very small. Luckily, this last step also can be derived from the aforementioned robust characterization, Section 3.2.
2.1 Definitions and notation
All throughout this paper, we write for the set of discrete probability distributions over domain , i.e. the set of all real-valued functions such that . Considering a probability distribution as the vector of its probability mass function (pmf), we write for its -norm, for any . A property of distributions over is then a subset , comprising all distributions that have the property.
As standard in distribution testing, we will measure the distance between two distributions on by their total variation distance
which takes value in . (This metric is sometimes referred to as statistical distance). Given a property and a distribution , we then write for the distance of to .
Finally, recall that a testing algorithm for a fixed property is a randomized algorithm which takes as input a proximity parameter , and is granted access to independent samples from an unknown distribution :
if , the algorithm outputs accept with probability at least ;
if for every , it outputs reject with probability at least .
That is, must accept if the unknown distribution has the property, and reject if it is -far from having it. The sample complexity of the algorithm is the number of samples it draws from the distribution in the worst case.
2.2 Useful results from previous work
We will heavily rely, for our lower bound, on the “Wishful Thinking Theorem” due to Paul Valiant [Val11], which applies to testing symmetric properties of distributions (that is, properties that are invariant under relabeling of the domain, as happens to be). Intuitively, this theorem ensures that “if the low-degree moments ( norms) of two distributions match, then these distributions (up to relabeling) are hard to distinguish.”
Theorem 2.1 (Wishful Thinking Theorem [Val11, Theorem 4.10], restated).
Given a positive integer and two distributions , it is impossible to test in samples any symmetric property that holds for and does not hold for , provided that following conditions hold:
letting , be the -based moments of (defined below),
where , for .
(We observe that we only reproduced here one of the three sufficient conditions given in the original, more general theorem; as this will be the only one we need.)
2.3 Some structural results
We here state and establish some simple yet useful results. The first relates uniformity of a distribution to the -norms of its probability mass function, while the second provides inequalities between these norms.
Let . Then, if and only if .
If , it is immediate to see that . We thus consider the converse implication. By the Cauchy–Schwarz inequality,
with equality if, and only if, and are linearly dependent. Thus, implies that there exist non-zero such that for all , or equivalently that for all . This, in turn, implies that is uniform on a subset of elements. ∎
For any vector such that , we have
for all . In particular, for any distribution , we have for all (and, thus, for instance, ).
The inequality is trivially true for , and, so, we henceforth assume . Let be such a vector: we wish to show that , or equivalently . Set , and so that with . Observing that , we then apply Hölder’s inequality:
concluding the proof. ∎
3 The Upper Bound
Our algorithm for testing uniformity first estimates the norm of the input distribution and uses this estimate to obtain a surrogate value for the size of the support set for the distribution. In the case the input distribution is a uniform distribution, the norm estimate indeed provides a good approximation to the size of the support set. Our algorithm for the norm estimation is presented in the following section, followed by our algorithm for testing uniformity.
3.1 Estimating the norm of a distribution
In this section, we present an algorithm that, given independent samples from a distribution over , estimates . Note that a similar result was presented in Batu et al. [BFR13] in the case when the size of the domain is bounded and known to the algorithm. Furthermore, an algorithm based on the same ideas have been presented by Batu et al. [BDKR05] to estimate the entropy of a distribution that is uniform on a subset of its domain. The algorithm is presented below in Algorithm 1.
Algorithm Estimate--norm, given independent samples from a distribution over and , outputs a value such that
with probability at least . Whenever the algorithm produces an estimate satisfying (1) above, the number of samples taken by the algorithm is . Moreover, the algorithm takes samples in expectation.
Let be the random variable that denotes the number of samples that were taken by the algorithm until pairwise collisions are observed. We will show that, with constant probability, is close to its expected value nearly .
Consider a set of samples from . For , let be an indicator random variable denoting a collision between th and th samples. Let be the total number of collisions among the samples.
For any , . Therefore, . We will also need an upper bound on the variance to show that the collisions are not observed too early or too late.
The terms of the last summation above can be grouped according to the cardinality of the set .
If , then . There are such terms.
If , then . There are such terms.
If , then . There are such terms.
Hence, we can bound the variance of as follows.
where the inequality arises from .
The probability that the output of the algorithm is less than (that is, an underestimation) is bounded from above by the probability of the random variable taking a value such that . Analogously, the probability of an overestimation is bounded above by the probability of the random variable taking a value such that .
Let be the smallest integer such that , so that . Then,
|(, or .)|
for , where follows from the choice of .
To upper bound the probability of underestimation, take to be largest integer such that (so that , i.e. ).222In particular, this implies , from which . Then,
|(Note that ε)|
for , where follows from the choice of .
By the union bound, overestimation or underestimation happens with probability at most 1/4. Finally, in the event that we have a good estimation, we have that the number of samples satisfy
Therefore, we have that .
To bound the expected number of samples, we consider two cases (recall that the asymptotics here are taken, unless specified otherwise, while viewing as a sequence of distributions and letting ):
if (i.e., ), then we denote by the element such that . It follows from properties of the negative binomial distribution that the expected number of draws necessary to see different draws of (and thus collisions) is , so that .
Note that the sample complexity of Algorithm Estimate--norm is tight for near-uniform distributions (at least, in terms of dependency on ). Consider a distribution on elements with probability values in for some small . Even though can have sufficiently high and should be distinguished from the uniform distribution on elements, there will be no repetition in the sample until samples are taken. The following lemma generalizes this argument.
For any distribution and , estimation of within a multiplicative factor of requires samples from .
Take any distribution . We first consider the case . Fix any element such that (we can assume for simplicity one exists; otherwise, since we can find, for any , such that , we can repeat the argument below for an arbitrarily small ), and let . Then, we define the distribution on as the mixture
which satisfies , and
the last equality from our choice of . Since , any algorithm that estimates the squared norm of an unknown distribution can be used to distinguish between and . However, from the very definition of total variation distance, distinguishing between and requires samples. Since
(as ) we get a lower bound of .
We now turn to the case . The construction will be similar, but setting , and spreading the probability uniformly on elements outside the support of , instead of just one. It is straightforward to check that in this case, the distribution we defined is such that
so again, by the same argument, any algorithm which can approximate to can be used to distinguish between and , and thus requires samples. ∎
We emphasize that the above theorem is on an instance-by-instance basis, and applies to every probability distribution . In contrast, it is not hard to see that for some distributions , a lower bound of holds: this follows from instance from [AOST17, Theorem 15]. This latter bound, however, cannot hold for every probability distribution, as one can see e.g. from a (trivial) distribution supported on a single element, for which -norm estimation can be done with samples.
3.2 Testing Uniformity
In this section, we present our algorithm for testing uniformity of a distribution. We first give a brief overview of the algorithm. The algorithm first estimates the norm of the input distribution and uses this value to obtain an estimate on the support size of the distribution. Then, the algorithm tries to distinguish a uniform distribution from a distribution that is far from any uniform distribution by using the number of 3-way collisions in a freshly taken sample set. For two distributions with the same norm, where one is a uniform distribution and the other is far from being uniform, the latter is expected to produce more 3-way collisions in a large enough sample set. The algorithm keeps taking samples up to a number based on the support-size estimate and keeps track of the 3-way collisions in the sample set to decide whether to accept or reject the input distribution.
The following lemma formalizes the intuition that if the and the norm of a distribution is close to those of the uniform distribution on elements, then the distribution is close to being uniform.
Let be a distribution over and such that
for some . Then, the distance of to can be upper bounded as
Note that the condition on the implies that “ought to be” distributed roughly uniformly over elements, or otherwise would deviate significantly enough from uniformity to impact its norm. The condition on further strengthens how evenly is distributed, ensuring that this latter case cannot happen. Below we formalize this intuition and, in particular, use the conditions on the norms to upper bound the total mass on the items that have probability significantly larger than .
Let be a random variable such that takes value with probability , for each element in the support set of . Then, , which implies
We now derive an upper bound on the distance . We first obtain an upper bound on the total weight of elements with probability significantly above or below . Then, we can proceed to compare the distribution to a uniform distribution with support size close to .
First, we can bound the total probability mass of items such that or by looking at the probability of a large deviation of from its expectation. In particular,
Note that the second inequality above follows from that when , by the concavity of the function and for .
We now have established that a probability mass of at least of is placed on elements with individual probabilities in the interval . Call this set . Thus, we have that
Now consider the uniform distribution on the set . Since , it suffices to upper bound the latter. Given that
for any , we have that
Finally, we can conclude that
establishing the lemma. ∎
The algorithm for testing uniformity is presented below in Algorithm 2.
Note that, for a uniform distribution, norm estimation will give a reliable estimate for the support size. Then, we will show that samples will be unlikely to produce more than 3-way collision. On the other hand, for a distribution that is far from a uniform distribution, the support size estimation in the algorithm will be an underestimation. In additions, the norm of such a distribution will be higher than that of the uniform distribution with that estimated support size. As a result, the algorithm will observe more than 3-way collisions in the subsequent samples with high probability as an evidence that the input distribution is not uniform.
Algorithm Test-Uniformity, given independent samples from a distribution over and , accepts if and rejects such that , with probability at least 3/4. The sample complexity of the algorithm is .
In the proof, we will need simple distributional properties of the number of 3-way collisions, analogous to the arguments in the proof of Section 3.1. Let be the total number of 3-way collisions in samples from a distribution . Then, we have that