Testing Identity of Structured Distributions

# Testing Identity of Structured Distributions

Ilias Diakonikolas
University of Edinburgh
ilias.d@ed.ac.uk.
Supported by EPSRC grant EP/L021749/1, a Marie Curie Career Integration Grant, and a SICSA grant.
Daniel M. Kane
University of California, San Diego
dakane@cs.ucsd.edu.
Supported in part by an NSF Postdoctoral Fellowship.
University of Edinburgh
v.nikishkin@sms.ed.ac.uk.
Supported by a University of Edinburgh PCD Scholarship.
###### Abstract

We study the question of identity testing for structured distributions. More precisely, given samples from a structured distribution over and an explicit distribution over , we wish to distinguish whether versus is at least -far from , in distance. In this work, we present a unified approach that yields new, simple testers, with sample complexity that is information-theoretically optimal, for broad classes of structured distributions, including -flat distributions, -modal distributions, log-concave distributions, monotone hazard rate (MHR) distributions, and mixtures thereof.

## 1 Introduction

How many samples do we need to verify the identity of a distribution? This is arguably the single most fundamental question in statistical hypothesis testing [NP33], with Pearson’s chi-squared test [Pea00] (and variants thereof) still being the method of choice used in practice. This question has also been extensively studied by the TCS community in the framework of property testing [RS96, GGR98]: Given sample access to an unknown distribution over a finite domain , an explicit distribution over , and a parameter , we want to distinguish between the cases that and are identical versus -far from each other in norm (statistical distance). Previous work on this problem focused on characterizing the sample size needed to test the identity of an arbitrary distribution of a given support size. After more than a decade of study, this “worst-case” regime is well-understood: there exists a computationally efficient estimator with sample complexity  [VV14] and a matching information-theoretic lower bound [Pan08].

While it is certainly a significant improvement over naive approaches and is tight in general, the bound of is still impractical, if the support size is very large. We emphasize that the aforementioned sample complexity characterizes worst-case instances, and one might hope that drastically better results can be obtained for most natural settings. In contrast to this setting, in which we assume nothing about the structure of the unknown distribution , in many cases we know a priori that the distribution in question has some “nice structure”. For example, we may have some qualitative information about the density , e.g., it may be a mixture of a small number of log-concave distributions, or a multi-modal distribution with a bounded number of modes. The following question naturally arises: Can we exploit the underlying structure in order to perform the desired statistical estimation task more efficiently?

One would optimistically hope for the answer to the above question to be “YES.” While this has been confirmed in several cases for the problem of learning (see e.g.,  [DDS12a, DDS12b, DDO13, CDSS14]), relatively little work has been done for testing properties of structured distributions. In this paper, we show that this is indeed the case for the aforementioned problem of identity testing for a broad spectrum of natural and well-studied distribution classes. To describe our results in more detail, we will need some terminology.

Let be a class of distributions over . The problem of identity testing for is the following: Given sample access to an unknown distribution , and an explicit distribution 111It is no loss of generality to assume that ; otherwise the tester can output “NO” without drawing samples., we want to distinguish between the case that versus We emphasize that the sample complexity of this testing problem depends on the underlying class , and we believe it is of fundamental interest to obtain efficient algorithms that are sample optimal for . One approach to solve this problem is to learn up to distance and check that the hypothesis is -close to . Thus, the sample complexity of identity testing for is bounded from above by the sample complexity of learning (an arbitrary distribution in) . It is natural to ask whether a better sample size bound could be achieved for the identity testing problem, since this task is, in some sense, less demanding than the task of learning.

In this work, we provide a comprehensive picture of the sample and computational complexities of identity testing for a broad class of structured distributions. More specifically, we propose a unified framework that yields new, simple, and provably optimal identity testers for various structured classes ; see Table 1 for an indicative list of distribution classes to which our framework applies. Our approach relies on a single unified algorithm that we design, which yields highly efficient identity testers for many shape restricted classes of distributions.

As an interesting byproduct, we establish that, for various structured classes , identity testing for is provably easier than learning. In particular, the sample bounds in the third column of Table 1 from [CDSS14] also apply for learning the corresponding class , and are known to be information-theoretically optimal for the learning problem.

Our main result (see Theorem 1 and Proposition 2 in Section 2) can be phrased, roughly, as follows: Let be a class of univariate distributions such that any pair of distributions have “essentially” at most crossings, that is, points of the domain where changes its sign. Then, the identity problem for can be solved with samples. Moreover, this bound is information-theoretically optimal.

By the term “essentially” we mean that a constant fraction of the contribution to is due to a set of crossings – the actual number of crossings can be arbitrary. For example, if is the class of -piecewise constant distributions, it is clear that any two distributions in have crossings, which gives us the first line of Table 1. As a more interesting example, consider the class of log-concave distributions over . While the number of crossings between can be , it can be shown (see Lemma 17 in [CDSS14]) that the essential number of crossings is , which gives us the third line of the table. More generally, we obtain asymptotic improvements over the standard bound for any class such that the essential number of crossings is . This condition applies for any class that can be well-approximated in distance by piecewise low-degree polynomials (see Corollary 3 for a precise statement).

### 1.1 Related and Prior Work

In this subsection we review the related literature and compare our results with previous work.

Distribution Property Testing The area of distribution property testing, initiated in the TCS community by the work of Batu et al. [BFR00, BFR13], has developed into a very active research area with intimate connections to information theory, learning and statistics. The paradigmatic algorithmic problem in this area is the following: given sample access to an unknown distribution over an -element set, we want to determine whether has some property or is “far” (in statistical distance or, equivalently, norm) from any distribution having the property. The overarching goal is to obtain a computationally efficient algorithm that uses as few samples as possible – certainly asymptotically fewer than the support size , and ideally much less than that. See [GR00, BFR00, BFF01, Bat01, BDKR02, BKR04, Pan08, Val11, VV11, DDS13, ADJ11, LRR11, ILR12] for a sample of works and [Rub12] for a survey.

One of the first problems studied in this line of work is that of “identity testing against a known distribution”: Given samples from an unknown distribution and an explicitly given distribution distinguish between the case that versus the case that is -far from in norm. The problem of uniformity testing – the special case of identity testing when is the uniform distribution – was first considered by Goldreich and Ron [GR00] who, motivated by a connection to testing expansion in graphs, obtained a uniformity tester using samples. Subsequently, Paninski gave the tight bound of  [Pan08] for this problem. Batu et al. [BFF01] obtained an identity testing algorithm against an arbitrary explicit distribution with sample complexity . The tight bound of for the general identity testing problem was given only recently in [VV14].

Shape Restricted Statistical Estimation The area of inference under shape constraints – that is, inference about a probability distribution under the constraint that its probability density function (pdf) satisfies certain qualitative properties – is a classical topic in statistics starting with the pioneering work of Grenander [Gre56] on monotone distributions (see [BBBB72] for an early book on the topic). Various structural restrictions have been studied in the statistics literature, starting from monotonicity, unimodality, and concavity [Gre56, Bru58, Rao69, Weg70, HP76, Gro85, Bir87a, Bir87b, Fou97, CT04, JW09], and more recently focusing on structural restrictions such as log-concavity and -monotonicity [BW07, DR09, BRW09, GW09, BW10, KM10].

Shape restricted inference is well-motivated in its own right, and has seen a recent surge of research activity in the statistics community, in part due to the ubiquity of structured distributions in the natural sciences. Such structural constraints on the underlying distributions are sometimes direct consequences of the studied application problem (see e.g., Hampel [Ham87], or Wang et al. [WWW05]), or they are a plausible explanation of the model under investigation (see e.g.,  [Reb05] and references therein for applications to economics and reliability theory). We also point the reader to the recent survey  [Wal09] highlighting the importance of log-concavity in statistical inference. The hope is that, under such structural constraints, the quality of the resulting estimators may dramatically improve, both in terms of sample size and in terms of computational efficiency.

We remark that the statistics literature on the topic has focused primarily on the problem of density estimation or learning an unknown structured distribution. That is, given samples from a distribution promised to belong to some distribution class , we would like to output a hypothesis distribution that is a good approximation to . In recent years, there has been a flurry of results in the TCS community on learning structured distributions, with a focus on both sample complexity and computational complexity, see [KMR94, FOS05, BS10, KMV10, MV10, DDS12a, DDS12b, CDSS13, DDO13, CDSS14] for some representative works.

Comparison with Prior Work In recent work, Chan, Diakonikolas, Servedio, and Sun [CDSS14] proposed a general approach to learn univariate probability distributions that are well approximated by piecewise polynomials. [CDSS14] obtained a computationally efficient and sample near-optimal algorithm to agnostically learn piecewise polynomial distributions, thus obtaining efficient estimators for various classes of structured distributions. For many of the classes considered in Table 1 the best previously known sample complexity for the identity testing problem for is identified with the sample complexity of the corresponding learning problem from [CDSS14]. We remark that the results of this paper apply to all classes considered in [CDSS14], and are in fact more general as our condition (any have a bounded number of “essential” crossings) subsumes the piecewise polynomial condition (see discussion before Corollary 3 in Section 2). At the technical level, in contrast to the learning algorithm of [CDSS14], which relies on a combination of linear programming and dynamic programming, our identity tester is simple and combinatorial.

In the context of property testing, Batu, Kumar, and Rubinfeld [BKR04] gave algorithms for the problem of identity testing of unimodal distributions with sample complexity More recently, Daskalakis, Diakonikolas, Servedio, Valiant, and Valiant [DDS13] generalized this result to -modal distributions obtaining an identity tester with sample complexity . We remark that for the class of -modal distributions our approach yields an identity tester with sample complexity , matching the lower bound of [DDS13]. Moreover, our work yields sample optimal identity testing algorithms not only for -modal distributions, but for a broad spectrum of structured distributions via a unified approach.

It should be emphasized that the main ideas underlying this paper are very different from those of  [DDS13]. The algorithm of [DDS13] is based on the fact from [Bir87a] that any -modal distribution is -close in norm to a piecewise constant distribution with intervals. Hence, if the location and the width of these “flat” intervals were known in advance, the problem would be easy: The algorithm could just test identity between the “reduced” distributions supported on these intervals, thus obtaining the optimal sample complexity of . To circumvent the problem that this decomposition is not known a priori, [DDS13] start by drawing samples from the unknown distribution to construct such a decomposition. There are two caveats with this strategy: First, the number of samples used to achieve this is and the number of intervals of the constructed decomposition is significantly larger than , namely . As a consequence, the sample complexity of identity testing for the reduced distributions on support is

In conclusion, the approach of [DDS13] involves constructing an adaptive interval decomposition of the domain followed by a single application of an identity tester to the reduced distributions over those intervals. At a high-level our novel approach works as follows: We consider several oblivious interval decompositions of the domain (i.e., without drawing any samples from ) and apply a “reduced” identity tester for each such decomposition. While it may seem surprising that such an approach can be optimal, our algorithm and its analysis exploit a certain strong property of uniformity testers, namely their performance guarantee with respect to the norm. See Section 2 for a detailed explanation of our techniques.

Finally, we comment on the relation of this work to the recent paper [VV14]. In [VV14], Valiant and Valiant study the sample complexity of the identity testing problem as a function of the explicit distribution. In particular, [VV14] makes no assumptions about the structure of the unknown distribution , and characterizes the sample complexity of the identity testing problem as a function of the known distribution The current work provides a unified framework to exploit structural properties of the unknown distribution , and yields sample optimal identity testers for various shape restrictions. Hence, the results of this paper are orthogonal to the results of [VV14].

## 2 Our Results and Techniques

### 2.1 Basic Definitions

We start with some notation that will be used throughout this paper. We consider discrete probability distributions over , which are given by probability density functions such that , where is the probability of element in distribution . By abuse of notation, we will sometimes use to denote the distribution with density function . We emphasize that we view the domain as an ordered set. Throughout this paper we will be interested in structured distribution families that respect this ordering.

The (resp. ) norm of a distribution is identified with the (resp. ) norm of the corresponding density function, i.e., and . The (resp. ) distance between distributions and is defined as the (resp. ) norm of the vector of their difference, i.e., and . We will denote by the uniform distribution over .

Interval partitions and -distance Fix a partition of into disjoint intervals For such a collection we will denote its cardinality by , i.e., For an interval , we denote by its cardinality or length, i.e., if , with , then The reduced distribution corresponding to and is the distribution over that assigns the th “point” the mass that assigns to the interval ; i.e., for , .

We now define a distance metric between distributions that will be crucial for this paper. Let be the collection of all partitions of into intervals, i.e., if and only if is a partition of into intervals . For and , , we define the -distance between and by

 ∥p−q∥Ak\lx@stackrel\footnotesizedef=maxI=(Ii)ki=1∈Jkk∑i=1|p(Ii)−q(Ii)|=maxI∈Jk∥pIr−qIr∥1.

We remark that the -distance between distributions222We note that the definition of -distance in this work is slightly different than [DL01, CDSS14], but is easily seen to be essentially equivalent. In particular, [CDSS14] considers the quantity , where is the collection of all unions of at most intervals in . It is a simple exercise to verify that , which implies that the two definitions are equivalent up to constant factors for the purpose of both upper and lower bounds. is well-studied in probability theory and statistics. Note that for any pair of distributions , and any with , we have that , and the two metrics are identical for . Also note that , where is the Kolmogorov metric (i.e., the distance between the CDF’s).

Discussion The well-known Vapnik-Chervonenkis (VC) inequality (see e.g., [DL01, p.31]) provides the information-theoretically optimal sample size to learn an arbitrary distribution over in this metric. In particular, it implies that iid draws from suffice in order to learn within -distance (with probability at least ). This fact has recently proved useful in the context of learning structured distributions: By exploiting this fact, Chan, Diakonikolas, Servedio, and Sun [CDSS14] recently obtained computationally efficient and near-sample optimal algorithms for learning various classes of structured distributions with respect to the distance.

It is thus natural to ask the following question: What is the sample complexity of testing properties of distributions with respect to the -distance? Can we use property testing algorithms in this metric to obtain sample-optimal testing algorithms for interesting classes of structured distributions with respect to the distance? In this work we answer both questions in the affirmative for the problem of identity testing.

### 2.2 Our Results

Our main result is an optimal algorithm for the identity testing problem under the -distance metric:

###### Theorem 1 (Main).

Given , an integer with , sample access to a distribution over , and an explicit distribution over , there is a computationally efficient algorithm which uses samples from , and with probability at least distinguishes whether versus . Additionally, samples are information-theoretically necessary.

The information-theoretic sample lower bound of can be easily deduced from the known lower bound of for uniformity testing over under the norm [Pan08]. Indeed, if the underlying distribution over is piecewise constant with pieces, and is the uniform distribution over , we have Hence, our -uniformity testing problem in this case is at least as hard as -uniformity testing over support of size .

The proof of Theorem 1 proceeds in two stages: In the first stage, we reduce the identity testing problem to uniformity testing without incurring any loss in the sample complexity. In the second stage, we use an optimal uniformity tester as a black-box to obtain an sample algorithm for uniformity testing. We remark that the uniformity tester is not applied to the distribution directly, but to a sequence of reduced distributions , for an appropriate collection of interval partitions . See Section 2.3 for a detailed intuitive explanation of the proof.

We remark that an application of Theorem 1 for , yields a sample optimal identity tester (for an arbitrary distribution ), giving a new algorithm matching the recent tight upper bound in [VV14]. Our new identity tester is arguable simpler and more intuitive, as it only uses an uniformity tester in a black-box manner.

We show that Theorem 1 has a wide range of applications to the problem of identity testing for various classes of natural and well-studied structured distributions. At a high level, the main message of this work is that the distance can be used to characterize the sample complexity of identity testing for broad classes of structured distributions. The following simple proposition underlies our approach:

###### Proposition 2.

For a distribution class over and , let be the smallest integer such that for any it holds that . Then there exists an identity testing algorithm for using samples.

The proof of the proposition is straightforward: Given sample access to and an explicit description of , we apply the -identity testing algorithm of Theorem 1 for the value of in the statement of the proposition, and error . If , the algorithm will output “YES” with probability at least . If , then by the condition of Proposition 2 we have that , and the algorithm will output “NO” with probability at least . Hence, as long as the underlying distribution satisfies the condition of Proposition 2 for a value of , Theorem 1 yields an asymptotic improvement over the sample complexity of .

We remark that the value of in the proposition is a natural complexity measure for the difference between two probability density functions in the class . It follows from the definition of the distance that this value corresponds to the number of “essential” crossings between and – i.e., the number of crossings between the functions and that significantly affect their distance. Intuitively, the number of essential crossings – as opposed to the domain size – is, in some sense, the “right” parameter to characterize the sample complexity of identity testing for . As we explain below, the upper bound implied by the above proposition is information-theoretically optimal for a wide range of structured distribution classes .

More specifically, our framework can be applied to all structured distribution classes that can be well-approximated in distance by piecewise low-degree polynomials. We say that a distribution over is -piecewise degree- if there exists a partition of into intervals such that is a (discrete) degree- polynomial within each interval. Let denote the class of all -piecewise degree- distributions over . We say that a distribution class is -close in to if for any there exists such that It is easy to see that any pair of distributions have at most crossings, which implies that , for (see e.g., Proposition 6 in [CDSS14]). We therefore obtain the following:

###### Corollary 3.

Let be a distribution class over and . Consider parameters and such that is -close in to . Then there exists an identity testing algorithm for using samples.

Note that any pair of values satisfying the condition above suffices for the conclusion of the corollary. Since our goal is to minimize the sample complexity, for a given class , we would like to apply the corollary for values and satisfying the above condition and are such that the product is minimized. The appropriate choice of these values is crucial, and is based on properties of the underlying distribution family. Observe that the sample bound of is tight in general, as follows by selecting . This can be deduced from the general lower bound of for uniformity testing, and the fact that for , any distribution over support can be expressed as a -piecewise degree- distribution.

The concrete testing results of Table 1 are obtained from Corollary 3 by using known existential approximation theorems [Bir87a, CDSS13, CDSS14] for the corresponding structured distribution classes. In particular, we obtain efficient identity testers, in most cases with provably optimal sample complexity, for all the structured distribution classes studied in [CDSS13, CDSS14] in the context of learning. Perhaps surprisingly, our upper bounds are tight not only for the class of piecewise polynomials, but also for the specific shape restricted classes of Table 1. The corresponding lower bounds for specific classes are either known from previous work (as e.g., in the case of -modal distributions [DDS13]) or can be obtained using standard constructions.

Finally, we remark that the results of this paper can be appropriately generalized to the setting of testing the identity of continuous distributions over the real line. More specifically, Theorem 1 also holds for probability distributions over . (The only additional assumption required is that the explicitly given continuous pdf can be efficiently integrated up to any additive accuracy.) In fact, the proof for the discrete setting extends almost verbatim to the continuous setting with minor modifications. It is easy to see that both Proposition 2 and Corollary 3 hold for the continuous setting as well.

### 2.3 Our Techniques

We now provide a detailed intuitive explanation of the ideas that lead to our main result, Theorem 1. Given sample access to a distribution and an explicit distribution , we want to test whether versus . By definition we have that . So, if the “optimal” partition maximizing this expression was known a priori, the problem would be easy: Our algorithm could then consider the reduced distributions and , which are supported on sets of size , and call a standard -identity tester to decide whether versus . (Note that for any given partition of into intervals and any distribution , given sample access to one can simulate sample access to the reduced distribution .) The difficulty, of course, is that the optimal -partition is not fixed, as it depends on the unknown distribution , thus it is not available to the algorithm. Hence, a more refined approach is necessary.

Our starting point is a new, simple reduction of the general problem of identity testing to its special case of uniformity testing. The main idea of the reduction is to appropriately “stretch” the domain size, using the explicit distribution , in order to transform the identity testing problem between and into a uniformity testing problem for a (different) distribution (that depends on and ). To show correctness of this reduction we need to show that it preserves the distance, and that we can sample from given samples from .

We now proceed with the details. Since is given explicitly in the input, we assume for simplicity that each is a rational number, hence there exists some (potentially large) such that , where and 333We remark that this assumption is not necessary: For the case of irrational ’s we can approximate them by rational numbers up to sufficient accuracy and proceed with the approximate distribution . This approximation step does not preserve perfect completeness; however, we point out that our testers have some mild robustness in the completeness case, which suffices for all the arguments to go through. Given sample access to and an explicit over , we construct an instance of the uniformity testing problem as follows: Let be the uniform distribution over and let be the distribution over obtained from by subdividing the probability mass of , , equally among new consecutive points. It is clear that this reduction preserves the distance, i.e., The only remaining task is to show how to simulate sample access to , given samples from . Given a sample from , our sample for is selected uniformly at random from the corresponding set of many new points. Hence, we have reduced the problem of identity testing between and in distance, to the problem of uniformity testing of in distance. Note that this reduction is also computationally efficient, as it only requires pre-computation to specify the new intervals.

For the rest of this section, we focus on the problem of uniformity testing. For notational convenience, we will use to denote the unknown distribution and to denote the uniform distribution over . The rough idea is to consider an appropriate collection of interval partitions of and call a standard -uniformity tester for each of these partitions. To make such an approach work and give us a sample optimal algorithm for our -uniformity testing problem we need to use a subtle and strong property of uniformity testing, namely its performance guarantee under the norm. We elaborate on this point below.

For any partition of into intervals by definition we have that Therefore, if , we will also have . The issue is that can be much smaller than ; in fact, it is not difficult to construct examples where and In particular, it is possible for the points where is larger than , and where it is smaller than to cancel each other out within each interval in the partition, thus making the partition useless for distinguishing from . In other words, if the partition is not “good”, we may not be able to detect any existing discrepancy. A simple, but suboptimal, way to circumvent this issue is to consider a partition of into intervals of the same length. Note that each such interval will have probability mass under the uniform distribution . If the constant in the big- is appropriately selected, say , it is not hard to show that ; hence, we will necessarily detect a large discrepancy for the reduced distribution. By applying the optimal uniformity tester this approach will require samples.

A key tool that is essential in our analysis is a strong property of uniformity testing. An optimal uniformity tester for can distinguish between the uniform distribution and the case that using samples. However, a stronger guarantee is possible: With the same sample size, we can distinguish the uniform distribution from the case that . We emphasize that such a strong guarantee is specific to uniformity testing, and is provably not possible for the general problem of identity testing. In previous work, Goldreich and Ron [GR00] gave such an guarantee for uniformity testing, but their algorithm uses samples. Paninski’s uniformity tester works for the norm, and it is not known whether it achieves the desired property. As one of our main tools we show the following guarantee, which is optimal as a function of and :

###### Theorem 4.

Given and sample access to a distribution over , there is an algorithm Test-Uniformity- which uses samples from , runs in time linear in its sample size, and with probability at least distinguishes whether versus .

To prove Theorem 4 we show that a variant of Pearson’s chi-squared test [Pea00] – which can be viewed as a special case of the recent “chi-square type” testers in [CDVV14, VV14] – has the desired guarantee. While this tester has been (implicitly) studied in [CDVV14, VV14], and it is known to be sample optimal with respect to the norm, it has not been previously analyzed for the norm. The novelty of Theorem 4 lies in the tight analysis of the algorithm under the distance, and is presented in Appendix A.

Armed with Theorem 4 we proceed as follows: We consider a set of different partitions of the domain into intervals. For the partition consists of many intervals , , i.e., . For a fixed value of , all intervals in have the same length, or equivalently, the same probability mass under the uniform distribution. Then, for any fixed , we have for all (Observe that, by our aforementioned reduction to the uniform case, we may assume that the domain size is a multiple of , and thus that it is possible to evenly divide into such intervals of the same length).

Note that if , then for all , it holds . Recalling that all intervals in have the same probability mass under , it follows that , i.e., is the uniform distribution over its support. So, if , for any partition we have . Our main structural result (Lemma 6) is a robust inverse lemma: If is far from uniform in distance then, for at least one of the partitions , the reduced distribution will be far from uniform in distance. The quantitative version of this statement is quite subtle. In particular, we start from the assumption of being -far in distance and can only deduce “far” in distance. This is absolutely critical for us to be able to obtain the optimal sample complexity.

The key insight for the analysis comes from noting that the optimal partition separating from in distance cannot have too many parts. Thus, if the “highs” and “lows” cancel out over some small intervals, they must be very large in order to compensate for the fact that they are relatively narrow. Therefore, when and differ on a smaller scale, their discrepancy will be greater, and this compensates for the fact that the partition detecting this discrepancy will need to have more intervals in it.

In Section 3 we present our sample optimal uniformity tester under the distance, thereby establishing Theorem 1.

## 3 Testing Uniformity under the Ak-norm

Algorithm Test-Uniformity- Input: sample access to a distribution over , with , and . Output: “YES” if ; “NO” if Draw a sample of size from . Fix such that Consider the collection of partitions of into intervals; the partition consists of many intervals with , where . For : Consider the reduced distributions and . Use the sample to simulate samples to . Run Test-Uniformity- for for a sufficiently small constant and , i.e., test whether versus If all the testers in Step 3(b) output “YES”, then output “YES”; otherwise output “NO”.

###### Proposition 5.

The algorithm Test-Uniformity-, on input a sample of size drawn from a distribution over , and an integer with , correctly distinguishes the case that from the case that , with probability at least .

###### Proof.

First, it is straightforward to verify the claimed sample complexity, as the algorithm only draws samples in Step 1. Note that the algorithm uses the same set of samples for all testers in Step 3(b). By Theorem 4, the tester Test-Uniformity-, on input a set of samples from distinguishes the case that from the case that with probability at least . From our choice of parameters it can be verified that , hence we can use the same sample as input to these testers for all . In fact, it is easy to see that , which implies that the overall algorithm runs in sample-linear time. Since each tester in Step 3(b) has error probability , by a union bound over all , the total error probability is at most Therefore, with probability at least all the testers in Step 3(b) succeed. We will henceforth condition on this “good” event, and establish the completeness and soundness properties of the overall algorithm under this conditioning.

We start by establishing completeness. If , then for any partition , , we have that . By our aforementioned conditioning, all testers in Step 3(b) will output “YES”, hence the overall algorithm will also output “YES”, as desired.

We now proceed to establish the soundness of our algorithm. Assuming that , we want to show that the algorithm Test-Uniformity- outputs “NO” with probability at least . Towards this end, we prove the following structural lemma:

###### Lemma 6.

There exists a constant such that the following holds: If , there exists with such that

Given the lemma, the soundness property of our algorithm follows easily. Indeed, since all testers Test-Uniformity- of Step 3(b) are successful by our conditioning, Lemma  6 implies that at least one of them outputs “NO”, hence the overall algorithm will output “NO”. ∎

The proof of Lemma 6 in its full generality is quite technical. For the sake of the intuition, in the following subsection (Section 3.1) we provide a proof of the lemma for the important special case that the unknown distribution is promised to be -flat, i.e., piecewise constant with pieces. This setting captures many of the core ideas and, at the same time, avoids some of the necessary technical difficulties of the general case. Finally, in Section 3.2 we present our proof for the general case.

### 3.1 Proof of Structural Lemma: k-flat Case

For this special case we will prove the lemma for . Since is -flat there exists a partition of into intervals so that is constant within each such interval. This in particular implies that , where . For let us denote by the value of within interval , that is, for all and we have . For notational convenience, we sometimes use to denote the value of within interval . By assumption we have that

Throughout the proof, we work with intervals such that We will henceforth refer to such intervals as troughs and will denote by the corresponding set of indices, i.e., . For each trough we define its depth as and its width as Note that the width of is identified with the probability mass that the uniform distribution assigns to it. The discrepancy of a trough is defined by and corresponds to the contribution of to the distance between and .

It follows from Scheffe’s identity that half of the contribution to comes from troughs, namely An important observation is that we may assume that all troughs have width at most at the cost of potentially doubling the total number of intervals. Indeed, it is easy to see that we can artificially subdivide “wider” troughs so that each new trough has width at most . This process comes at the expense of at most doubling the number of troughs. Let us denote by this set of (new) troughs, where and each is a subset of some , We will henceforth deal with the set of troughs each of width at most . By construction, it is clear that

 ∥q−p∥T′1\lx@stackrel\footnotesizedef=∑j∈T′Discr(˜Ij)=∥q−p∥T1≥ε/2. (1)

At this point we note that we can essentially ignore troughs with small discrepancy. Indeed, the total contribution of intervals with to the LHS of (1) is at most . Let be the subset of corresponding to troughs with discrepancy at least , i.e., if and only if and Then, we have that

 ∥q−p∥T∗1\lx@stackrel\footnotesizedef=∑j∈T∗Discr(˜Ij)≥2ε/5. (2)

Observe that for any interval it holds . Note that this part of the argument depends critically on considering only troughs. Hence, for we have that

 ε/(20k)≤0pt(˜Ij)≤1/k. (3)

Thus far we have argued that a constant fraction of the contribution to comes from troughs whose width satisfies (3). Our next crucial claim is that each such trough must have a “large” overlap with one of the intervals considered by our algorithm Test-Uniformity-. In particular, consider a trough . We claim that there exists and such that and so that . To see this we first pick a so that Since the have width less than half that of , must intersect at least three of these intervals. Thus, any but the two outermost such intervals will be entirely contained within , and furthermore has width

Since the interval is a “domain point” for the reduced distribution , the error between and incurred by this element is at least , and the corresponding error is at least where the inequality follows from the fact that . Hence, we have that

 ∥qI(j+1)r−Uℓj+1∥22≥ε/(320k)⋅Discr(J). (4)

As shown above, for every trough there exists a level such that (4) holds. Hence, summing (4) over all levels we obtain

 j0−1∑j=0∥qI(j+1)r−Uℓj+1∥22≥ε/(320k)⋅∑j∈T∗Discr(˜Ij)≥ε2/(800k), (5)

where the second inequality follows from (2). Note that

 j0−1∑j=0γ2j≤j0−1∑j=0ε2⋅23j/4802⋅k2j=ε26400kj0−1∑j=02−j/4<ε2/(800k).

Therefore, by the above, we must have that for some . This completes the proof of Lemma 6 for the special case of being -flat.

### 3.2 Proof of Structural Lemma: General Case

To prove the general version of our structural result for the distance, we will need to choose an appropriate value for the universal constant . We show that it is sufficient to take