Testing k-Modal Distributions: Optimal Algorithms via Reductions

# Testing k-Modal Distributions: Optimal Algorithms via Reductions

MIT
costis@csail.mit.edu. Research supported by NSF CAREER award CCF-0953960 and by a Sloan Foundation Fellowship.
Ilias Diakonikolas
UC Berkeley
ilias@cs.berkeley.edu. Research supported by a Simons Foundation Postdoctoral Fellowship. Some of this work was done while at Columbia University, supported by NSF grant CCF-0728736, and by an Alexander S. Onassis Foundation Fellowship.
Rocco A. Servedio
Columbia University
rocco@cs.columbia.edu. Supported by NSF grants CCF-0347282 and CCF-0523664.
Gregory Valiant
UC Berkeley
gregory.valiant@gmail.com. Supported by an NSF graduate research fellowship.
Paul Valiant
UC Berkeley
pvaliant@gmail.com. Supported by an NSF postdoctoral research fellowship.
###### Abstract

We give highly efficient algorithms, and almost matching lower bounds, for a range of basic statistical problems that involve testing and estimating the (total variation) distance between two -modal distributions and over the discrete domain . More precisely, we consider the following four problems: given sample access to an unknown -modal distribution ,

Testing identity to a known or unknown distribution:

1. Determine whether (for an explicitly given -modal distribution ) versus is -far from ;

2. Determine whether (where is available via sample access) versus is -far from ;

Estimating distance (“tolerant testing”) against a known or unknown distribution:

1. Approximate to within additive where is an explicitly given -modal distribution ;

2. Approximate to within additive where is available via sample access.

For each of these four problems we give sub-logarithmic sample algorithms, that we show are tight up to additive and multiplicative factors. Thus our bounds significantly improve the previous results of [BKR04], which were for testing identity of distributions (items (1) and (2) above) in the special cases (monotone distributions) and (unimodal distributions) and required samples.

As our main conceptual contribution, we introduce a new reduction-based approach for distribution-testing problems that lets us obtain all the above results in a unified way. Roughly speaking, this approach enables us to transform various distribution testing problems for -modal distributions over to the corresponding distribution testing problems for unrestricted distributions over a much smaller domain where

## 1 Introduction

Given samples from a pair of unknown distributions, the problem of “identity testing”—that is, distinguishing whether the two distributions are the same versus significantly different—and, more generally, the problem of estimating the distance between the distributions, is perhaps the most fundamental statistical task. Despite a long history of study, by both the statistics and computer science communities, the sample complexities of these basic tasks were only recently established. Identity testing, given samples from a pair of distributions of support , can be done using samples [BFR00], and this upper bound is optimal up to factors [Val08a]. Estimating the distance (“tolerant testing”) between distributions of support requires samples, and this is tight up to constant factors [VV11a, VV11b]. The variants of these problems when one of the two distributions is explicitly given require samples for identity testing [BFF01] and samples for distance estimation [VV11a, VV11b] respectively.

While it is surprising that these tasks can be performed using a sublinear number of samples, for many real-world applications using , , or samples is still impractical. As these bounds characterize worst-case instances, one might hope that drastically better performance may be possible for many settings typically encountered in practice. Thus, a natural research direction, which we pursue in this paper, is to understand how structural properties of the distributions in question may be leveraged to yield improved sample complexities.

In this work we consider monotone, unimodal, and more generally -modal distributions. Monotone, unimodal, and bimodal distributions abound in the natural world. The distribution of many measurements—heights or weights of members of a population, concentrations of various chemicals in cells, parameters of many atmospheric phenomena–often belong to this class of distributions. Because of their ubiquity, much work in the natural sciences rests on the analysis of such distributions (for example, on November 1, 2011 a Google Scholar search for the exact phrase “bimodal distribution” in the bodies of papers returned more than 90,000 hits). Though perhaps not as pervasive, -modal distributions for larger values of commonly arise as mixtures of unimodal distributions and are natural objects of study. On the theoretical side, motivated by the many applications, monotone, unimodal, and -modal distributions have been intensively studied in the probability and statistics literatures for decades, see e.g. [Gre56, Rao69, BBBB72, CKC83, Gro85, Bir87a, Bir87b, Kem91, Fou97, CT04, JW09].

### 1.1 Our results.

Our main results are algorithms, and nearly matching lower bounds, that give a complete picture of the sample complexities of identity testing and estimating distance for monotone and -modal distributions. We obtain such results in both the setting where the two distributions are given via samples, and the setting where one of the distributions is given via samples and the other is described explicitly.

All our results have the nature of a reduction: performing these tasks on -modal distributions over turns out to have almost exactly the same sample complexities as performing the corresponding tasks on arbitrary distributions over . For any small constant (or even ) and arbitrarily small constant , all our results are tight to within either or factors. See Table 1 for the new sample complexity upper and lower bounds for the monotone and -modal tasks; see Section 2 for the (exponentially higher) sample complexities of the general-distribution tasks on which our results rely. While our main focus is on sample complexity rather than running time, we note that all of our algorithms run in bit operations (note that even reading a single sample from a distribution over takes bit operations).

We view the equivalence between the sample complexity of each of the above tasks on a monotone or unimodal distribution of domain and the sample complexity of the same task on an unrestricted distribution of domain as a surprising result, because such an equivalence fails to hold for related estimation tasks. For example, consider the task of distinguishing whether a distribution on is uniform versus far from uniform. For general distributions this takes samples, so one might expect the corresponding problem for monotone distributions to need samples; in fact, however, one can test this with a constant number of samples, by simply comparing the empirically observed probability masses of the left and right halves of the domain. An example in the other direction is the problem of finding a constant additive estimate for the entropy of a distribution. On domains of size this can be done in samples, and thus one might expect to be able to estimate entropy for monotone distributions on using samples. Nevertheless, it is not hard to see that samples are required.

The reduction-like techniques which we use to establish both our algorithmic results and our lower bounds (discussed in more detail in Section 1.2 below) reveal an unexpected relationship between the class of -modal distributions of support and the class of general distributions of support . We hope that this reduction-based approach may provide a framework for the discovery of other relationships that will be useful in future work in the extreme sublinear regime of statistical property estimation and property testing.

Comparison with prior work. Our results significantly extend and improve upon the previous algorithmic results of Batu et al [BKR04] for identity testing of monotone or unimodal () distributions, which required samples. More recently, [DDS11] established the sample complexity of learning -modal distributions to be essentially . Such a learning algorithm easily yields a testing algorithm with the same sample complexity for all four variants of the testing problem (one can simply run the learner twice to obtain hypotheses and that are sufficiently close to and respectively, and output accordingly).

While the [DDS11] result can be applied to our testing problems (though giving suboptimal results), we stress that the ideas underlying [DDS11] and this paper are quite different. The [DDS11] paper learns a -modal distribution by using a known algorithm for learning monotone distributions [Bir87b] times in a black-box manner; the notion of reducing the domain size—which we view as central to the results and contributions of this paper—is nowhere present in [DDS11]. By contrast, the focus in this paper is on introducing the use of reductions as a powerful (but surprisingly, seemingly previously unused) tool in the development of algorithms for basic statistical tasks on distributions, which, at least in this case, is capable of giving essentially optimal upper and lower bounds for natural restricted classes of distributions.

### 1.2 Techniques.

Our main conceptual contribution is a new reduction-based approach that lets us obtain all our upper and lower bounds in a clean and unified way. The approach works by reducing the monotone and -modal distribution testing problems to general distribution testing and estimation problems over a much smaller domain, and vice versa. For the monotone case this smaller domain is essentially of size , and for the -modal case the smaller domain is essentially of size By solving the general distribution problems over the smaller domain using known results we get a valid answer for the original (monotone or -modal) problems over domain . More details on our algorithmic reduction are given in Section A.

Conversely, our lower bound reduction lets us reexpress arbitrary distributions over a small domain by monotone (or unimodal, or -modal, as required) distributions over an exponentially larger domain, while preserving many of their features with respect to the distance. Crucially, this reduction allows one to simulate drawing samples from the larger monotone distribution given access to samples from the smaller distribution, so that a known impossibility result for unrestricted distributions on may be leveraged to yield a corresponding impossibility result for monotone (or unimodal, or -modal) distributions on the exponentially larger domain.

The inspiration for our results is an observation of Birgé [Bir87b] that given a monotone-decreasing probability distribution over , if one subdivides into an exponentially increasing series of consecutive sub-intervals, the th having size , then if one replaces the probability mass on each interval with a uniform distribution on that interval, the distribution changes by only in total variation distance. Further, given such a subdivision of the support into intervals, one may essentially treat the original monotone distribution as essentially a distribution over these intervals, namely a distribution of support . In this way, one may hope to reduce monotone distribution testing or estimation on to general distribution testing or estimation on a domain of size , and vice versa. See Section B for details.

For the monotone testing problems the partition into subintervals is constructed obliviously (without drawing any samples or making any reference to or of any sort) – for a given value of the partition is the same for all non-increasing distributions. For the -modal testing problems, constructing the desired partition is significantly more involved. This is done via a careful procedure which uses samples111Intuitively, the partition must be finer in regions of higher probability density; for non-increasing distributions (for example) this region is at the left side of the domain, but for general -modal distributions, one must draw samples to discover the high-probability regions. from and and uses the oblivious decomposition for monotone distributions in a delicate way. This construction is given in Section C.

## 2 Notation and Preliminaries

### 2.1 Notation.

We write to denote the set , and for integers we write to denote the set . We consider discrete probability distributions over , which are functions such that . For we write to denote . We use the notation for the cumulative distribution function (cdf) corresponding to , i.e. is defined by .

A distribution over is non-increasing (resp. non-decreasing) if (resp. ), for all ; is monotone if it is either non-increasing or non-decreasing. Thus the “orientation” of a monotone distribution is either non-decreasing (denoted ) or non-increasing (denoted ).

We call a nonempty interval a max-interval of if for all and . Analogously, a min-interval of is an interval with for all and . We say that is -modal if it has at most max-intervals and min-intervals. We note that according to our definition, what is usually referred to as a bimodal distribution is a -modal distribution.

Let be distributions over with corresponding cdfs . The total variation distance between and is The Kolmogorov distance between and is defined as Note that

Finally, a sub-distribution is a function which satisfies For a distribution over and , the restriction of to is the sub-distribution defined by if and otherwise. Likewise, we denote by the conditional distribution of on , i.e. if and otherwise.

### 2.2 Basic tools from probability.

We will require the Dvoretzky-Kiefer-Wolfowitz (DKW) inequality ([DKW56]) from probability theory. This basic fact says that samples suffice to learn any distribution within error with respect to the Kolmogorov distance. More precisely, let be any distribution over Given independent samples drawn from the empirical distribution is defined as follows: for all , . The DKW inequality states that for , with probability the empirical distribution will be -close to in Kolmogorov distance. This sample bound is asymptotically optimal and independent of the support size.

###### Theorem 1 ([Dkw56, Mas90]).

For all , it holds:

Another simple result that we will need is the following, which is easily verified from first principles:

###### Observation 1.

Let be an interval and let denote the uniform distribution over Let denote a non-increasing distribution over . Then for every initial interval of , we have

### 2.3 Testing and estimation for arbitrary distribution

Our testing algorithms work by reducing to known algorithms for testing arbitrary distributions over an -element domain. We will use the following well known results:

###### Theorem 2 (testing identity, known distribution [Bff+01]).

Let be an explicitly given distribution over . Let be an unknown distribution over that is accessible via samples. There is a testing algorithm Test-Identity-Known that uses samples from and has the following properties:

• If then with probability at least the algorithm outputs “accept;” and

• If then with probability at least the algorithm outputs “reject.”

###### Theorem 3 (testing identity, unknown distribution [Bfr+10]).

Let and both be unknown distributions over that are accessible via samples. There is a testing algorithm Test-Identity-Unknown that uses samples from and and has the following properties:

• If then with probability at least the algorithm outputs “accept;” and

• If then with probability at least the algorithm outputs “reject.”

###### Theorem 4 (L1 estimation [VV11b]).

Let be an unknown distribution over that is accessible via samples, and let be a distribution over that is either explicitly given, or accessible via samples. There is an estimator -Estimate that, with probability at least , outputs a value in the interval The algorithm uses samples.

## 3 Testing and Estimating Monotone Distributions

### 3.1 Oblivious decomposition of monotone distributions

Our main tool for testing monotone distributions is an oblivious decomposition of monotone distributions that is a variant of a construction of Birgé [Bir87b]. As we will see it enables us to reduce the problem of testing a monotone distribution to the problem of testing an arbitrary distribution over a much smaller domain.

Before stating the decomposition, some notation will be helpful. Fix a distribution over and a partition of into disjoint intervals The flattened distribution corresponding to and is the distribution over defined as follows: for and , . That is, is obtained from by averaging the weight that assigns to each interval over the entire interval. The reduced distribution corresponding to and is the distribution over that assigns the th point the weight assigns to the interval ; i.e., for , we have . Note that if is non-increasing then so is , but this is not necessarily the case for .

The following simple lemma, proved in Section A, shows why reduced distributions are useful for us:

###### Definition 1.

Let be a distribution over and let be a partition of into disjoint intervals. We say that is a -flat decomposition of if

###### Lemma 2.

Let be a partition of into disjoint intervals. Suppose that and are distributions over such that is both a -flat decomposition of and is also a -flat decomposition of . Then Moreover, if then .

We now state our oblivious decomposition result for monotone distributions:

###### Theorem 5 ([Bir87b]).

(oblivious decomposition) Fix any and . The partition of in which the th interval has size has the following properties: , and is a -flat decomposition of for any non-increasing distribution over .

There is an analogous version of Theorem 5, asserting the existence of an “oblivious” partition for non-decreasing distributions (which is of course different from the “oblivious” partition for non-increasing distributions of Theorem 5); this will be useful later.

While our construction is essentially that of Birgé, we note that the version given in [Bir87b] is for non-increasing distributions over the continuous domain , and it is phrased rather differently. Adapting the arguments of [Bir87b] to our discrete setting of distributions over is not conceptually difficult but requires some care. For the sake of being self-contained we provide a self-contained proof of the discrete version, stated above, that we require in Appendix E.

### 3.2 Efficiently testing monotone distributions

Now we are ready to establish our upper bounds on testing monotone distributions (given in the first four rows of Table 1). All of the algorithms are essentially the same: each works by reducing the given monotone distribution testing problem to the same testing problem for arbitrary distributions over support of size using the oblivious decomposition from the previous subsection. For concreteness we explicitly describe the tester for the “testing identity, is known” case below, and then indicate the small changes that are necessary to get the testers for the other three cases.

Test-Identity-Known-Monotone Inputs: ; sample access to non-increasing distribution over ; explicit description of non-increasing distribution over Let , with , be the partition of given by Theorem 5, which is a -flat decomposition of for any non-increasing distribution . Let denote the reduced distribution over obtained from using as defined in Section A. Draw samples from , where is the reduced distribution over obtained from using as defined in Section A. Output the result of Test-Identity-Known on the samples from Step 3.

We now establish our claimed upper bound for the “testing identity, is known” case. We first observe that in Step 3, the desired samples from can easily be obtained by drawing samples from and converting each one to the corresponding draw from in the obvious way. If then , and Test-Identity-Known-Monotone outputs “accept” with probability at least by Theorem 2. If , then by Lemma 2, Theorem 5 and the triangle inequality, we have that , so Test-Identity-Known-Monotone outputs “reject” with probability at least by Theorem 2. For the “testing identity, is unknown” case, the the algorithm Test-Identity-Unknown-Monotone is very similar to Test-Identity-Known-Monotone. The differences are as follows: instead of Step 2, in Step 3 we draw samples from and the same number of samples from ; and in Step 4, we run Test-Identity-Unknown using the samples from Step 3. The analysis is exactly the same as above (using Theorem 3 in place of Theorem 2).

We now describe the algorithm -Estimate-Known-Monotone for the “tolerant testing, is known” case. This algorithm takes values and as input, so the partition defined in Step 1 is a -flat decomposition of for any non-increasing In Step 3 the algorithm draws samples and runs -Estimate in Step 4. If then by the triangle inequality we have that and -Estimate-Known-Monotone outputs a value within the prescribed range with probability at least by Theorem 4. The algorithm -Estimate-Unknown-Monotone and its analysis are entirely similar.

## 4 From Monotone to k-modal

In this section we establish our main positive testing results for -modal distributions, the upper bounds stated in the final four rows of Table 1. In the previous section, we were able to use the oblivious decomposition to yield a partition of into relatively few intervals, with the guarantee that the corresponding flattened distribution is close to the true distribution. The main challenge in extending these results to unimodal or -modal distributions, is that in order to make the analogous decomposition, one must first determine–by taking samples from the distribution–which regions are monotonically increasing vs decreasing. Our algorithm Construct-Flat-Decomposition performs this task with the following guarantee:

###### Lemma 3.

Let be a -modal distribution over . Algorithm Construct-Flat-Decomposition draws samples from and outputs a -flat decomposition of with probability at least , where .

The bulk of our work in Section C is to describe Construct-Flat-Decomposition and prove Lemma 3, but first we show how Lemma 3 yields our claimed testing results for -modal distributions. As in the monotone case all four algorithms are essentially the same: each works by reducing the given -modal distribution testing problem to the same testing problem for arbitrary distributions over One slight complication is that the partition obtained for distribution will generally differ from that for . In the monotone distribution setting, the partition was oblivious to the distributions, and thus this concern did not arise. Naively, one might hope that the flattened distribution corresponding to any refinement of a partition will be at least as good as the flattened distribution corresponding to the actual partition. This hope is easily seen to be strictly false, but we show that it is true up to a factor of 2, which suffices for our purposes.

The following terminology will be useful: Let and be two partitions of into and intervals respectively. The common refinement of and is the partition of into intervals obtained from and in the obvious way, by taking all possible nonempty intervals of the form It is clear that is both a refinement of and of and that the number of intervals in is at most We prove the following lemma in Section A:

###### Lemma 4.

Let be any distribution over , let be a -flat decomposition of , and let be a refinement of Then is a -flat decomposition of .

We describe the Test-Identity-Known-kmodal algorithm below.

Test-Identity-Known-kmodal Inputs: ; sample access to -modal distributions over Run Construct-Flat-Decomposition and let , be the partition that it outputs. Run Construct-Flat-Decomposition and let , be the partition that it outputs. Let be the common refinement of and and let be the number of intervals in Let denote the reduced distribution over obtained from using as defined in Section A. Draw samples from , where is the reduced distribution over obtained from using as defined in Section A. Run Test-Identity-Known using the samples from Step 3 and output what it outputs.

We note that Steps 2, 3 and 4 of Test-Identity-Known-kmodal are the same as the corresponding steps of Test-Identity-Known-Monotone. For the analysis of Test-Identity-Known-kmodal, Lemmas 3 and 4 give us that with probability , the partition obtained in Step 1 is both a -flat and -flat decomposition of ; we condition on this going forward. From this point on the analysis is essentially identical to the analysis for Test-Identity-Known-Monotone and is omitted.

The modifications required to obtain algorithms Test-Identity-Unknown-kmodal, -Estimate-Known-kmodal and -Estimate-Unknown-kmodal, and the analysis of these algorithms, are completely analogous to the modifications and analyses of Section 3.2 and are omitted.

### 4.1 The Construct-Flat-Decomposition algorithm.

We present Construct-Flat-Decomposition followed by an intuitive explanation. Note that it employs a procedure Orientation, which uses no samples and is presented and analyzed in Section 4.2.

Construct-Flat-Decomposition Inputs: ; sample access to -modal distribution over Initialize Fix . Draw samples from and let denote the resulting empirical distribution (which by Theorem 1 has with probability at least ). Greedily partition the domain into atomic intervals as follows: , where . For , if , then , where is defined as follows: If , then , otherwise, . Construct a set of moderate intervals, a set of heavy points, and a set of negligible intervals as follows: For each atomic interval , if then is declared to be a moderate interval; otherwise we have and we declare to be a heavy point. If then we declare to be a negligible interval. For each interval which is a heavy point, add to Add each negligible interval to For each moderate interval , run procedure Orientation; let be its output. If then add to If then let be the partition of given by Theorem 5 which is a -flat decomposition of for any non-increasing distribution over Add all the elements of to If then let be the partition of given by the dual version of Theorem 5, which is a -flat decomposition of for any non-decreasing distribution over Add all the elements of to Output the partition of .

Roughly speaking, when Construct-Flat-Decomposition constructs a partition , it initially breaks up into two types of intervals. The first type are intervals that are “okay” to include in a flat decomposition, either because they have very little mass, or because they consist of a single point, or because they are close to uniform. The second type are intervals that are “not okay” to include in a flat decomposition – they have significant mass and are far from uniform – but the algorithm is able to ensure that almost all of these are monotone distributions with a known orientation. It then uses the oblivious decomposition of Theorem 5 to construct a flat decomposition of each such interval. (Note that it is crucial that the orientation is known in order to be able to use Theorem 5.)

In more detail, Construct-Flat-Decomposition works as follows. The algorithm first draws a batch of samples from and uses them to construct an estimate of the CDF of (this is straightforward using the DKW inequality). Using the algorithm partitions into a collection of disjoint intervals in the following way:

• A small collection of the intervals are “negligible”; they collectively have total mass less than under . Each negligible interval will be an element of the partition

• Some of the intervals are “heavy points”; these are intervals consisting of a single point that has mass under . Each heavy point will also be an element of the partition

• The remaining intervals are “moderate” intervals, each of which has mass under .

It remains to incorporate the moderate intervals into the partition that is being constructed. This is done as follows: using , the algorithm comes up with a “guess” of the correct orientation (non-increasing, non-decreasing, or close to uniform) for each moderate interval. Each moderate interval where the “guessed” orientation is “close to uniform” is included in the partition Finally, for each moderate interval where the guessed orientation is “non-increasing” or “non-decreasing”, the algorithm invokes Theorem 5 on to perform the oblivious decomposition for monotone distributions, and the resulting sub-intervals are included in . The analysis will show that the guesses are almost always correct, and intuitively this should imply that the that is constructed is indeed a -flat decomposition of .

### 4.2 The Orientation algorithm.

The Orientation algorithm takes as input an explicit distribution of a distribution over and an interval Intuitively, it assumes that is close (in Kolmogorov distance) to a monotone distribution , and its goal is to determine the orientation of : it outputs either , or (the last of which means “close to uniform”). The algorithm is quite simple; it checks whether there exists an initial interval of on which ’s weight is significantly different from (the weight that the uniform distribution over assigns to ) and bases its output on this in the obvious way. A precise description of the algorithm (which uses no samples) is given below.

Orientation Inputs: explicit description of distribution over ; interval If (i.e. for some ) then return “”, otherwise continue. If there is an initial interval of that satisfies then halt and output “”. Otherwise, If there is an initial interval of that satisfies then halt and output “”. Otherwise, Output “”.

We proceed to analyze Orientation. We show that if is far from uniform then Orientation outputs the correct orientation for it. We also show that whenever Orientation does not output “”, whatever it outputs is the correct orientation of . The proof is given in Section C.3.

###### Lemma 5.

Let be a distribution over and let interval be such that is monotone. Suppose , and suppose that for every interval we have that Then

1. If is non-decreasing and is -far from the uniform distribution over , then Orientation outputs “”;

2. if Orientation outputs “” then is non-decreasing;

3. if is non-increasing and is -far from the uniform distribution over , then Orientation outputs “”;

4. if Orientation outputs “” then is non-increasing.

## 5 Lower Bounds

Our algorithmic results follow from a reduction which shows how one can reduce the problem of testing properties of monotone or -modal distributions to the task of testing properties of general distributions over a much smaller support. Our approach to proving lower bounds is complementary; we give a canonical scheme for transforming “lower bound instances” of general distributions to related lower bound instances of monotone distributions with much larger supports.

A generic lower bound instance for distance estimation has the following form: there is a distribution over pairs of distributions, , with the information theoretic guarantee that, given independent samples from distributions and , with it is impossible to distinguish the case that versus with any probability greater than , where the probability is taken over both the selection of and the choice of samples. In general, such information theoretic lower bounds are difficult to prove. Fortunately, as mentioned above, we will be able to prove lower bounds for monotone and -modal distributions by leveraging the known lower bound constructions in a black-box fashion.

Definitions 2 and  3, given below, define a two-stage transformation of a generic distribution into a related -modal distribution over a much larger support. This transformation preserves total variation distance: for any pair of distributions, the variation distance between their transformations is identical to the variation distance between the original distributions. Additionally, we ensure that given access to independent samples from an original input distribution, one can simulate drawing samples from the related -modal distribution yielded by the transformation. Given any lower–bound construction for general distributions, the above transformation will yield a lower–bound instance for -modal distributions (so monotone distributions correspond to ) defined by selecting a pair of distributions then outputting the pair of transformed distributions. This transformed ensemble of distributions is a lower–bound instance, for if some algorithm could successfully test pairs of -modal distributions from then that algorithm could be used to test pairs from , by simulating samples drawn from the transformed versions of the distributions. The following proposition, proved in Section D, summarizes the above discussion:

###### Proposition 6.

Let be a distribution over pairs of distributions supported on such that given samples from distributions with no algorithm can distinguish whether versus with probability greater than (over both the draw of from and the draw of samples from ). Let be the respective maximum and minimum probabilities with which any element arises in distributions that are supported in . Then there exists a distribution over pairs of -modal distributions supported on such that no algorithm, when given samples from distributions with , can distinguish whether versus with success probability greater than .

Before proving this proposition, we state various corollaries which result from applying the Proposition to known lower-bound constructions for general distributions. The first is for the “testing identity, is unknown” problem:

###### Corollary 7.

There exists a constant such that for sufficiently large and , there is a distribution over pairs of -modal distributions over , such that no algorithm, when given samples from a pair of distributions can distinguish the case that from the case with probability at least .

This Corollary gives the lower bounds stated in lines 2 and 6 of Table 1. It follows from applying Proposition 6 to a (trivially modified) version of the lower bound construction given in [BFR00, Val08b], summarized by the following theorem:

###### Theorem 6 ([Bfr+00, Val08b]).

There exists a constant such that for sufficiently large , there is a distribution over pairs of distributions over , such that for any the maximum probability with which any element occurs in or is and the minimum probability is . Additionally, no algorithm, when given samples from can distinguish whether from with probability at least .

Our second corollary is for estimation, in the case that one of the distributions is explicitly given. This trivially also yields an equivalent lower bound for the setting in which both distributions are given via samples.

###### Corollary 8.

For any with there exists a constant , such that for any sufficiently large and , there exists a -modal distribution of support , and a distribution over -modal distributions over , such that no algorithm, when given samples from a distribution can distinguish the case that versus with probability at least .

This Corollary gives the lower bounds claimed in lines 3, 4, 7 and 8 of Table 1. It follows from applying Proposition 6 to the lower bound construction given in [VV11a], summarized by the following theorem:

###### Theorem 7 ([VV11a]).

For any with there exists a constant , such that for any sufficiently large , there is a distribution over distributions with support , such that for any the maximum probability with which any element occurs in is and the minimum probability is . Additionally, no algorithm, when given samples from can distinguish whether versus with probability at least , where denotes the uniform distribution over .

Note that the above theorem can be expressed in the language of Proposition 6 by defining the distribution over pairs of distributions which chooses a distribution according to for the first distribution of each pair, and always selects for the second distribution of each pair.

Our third corollary, which gives the lower bounds claimed in lines 1 and 5 of Table 1, is for the “testing identity, is known” problem:

###### Corollary 9.

For any , there is a constant such that for sufficiently large and , there is a -modal distribution with support , and a distribution over -modal distributions of support such that no algorithm, when given samples from a distribution can distinguish the case that from the case with probability at least .

The above corollary follows from applying Proposition 6 to the following trivially verified lower bound construction:

###### Fact 10.

Let be the ensemble of distributions of support defined as follows: with probability is the uniform distribution on support , and with probability , assigns probability to a random half of the domain elements, and probability to the other half of the domain elements. No algorithm, when given fewer than samples from a distribution can distinguish between versus with probability greater than .

As noted previously (after Theorem 7), this fact can also be expressed in the language of Proposition 6.

## 6 Conclusions

We have introduced a simple new approach for tackling distribution testing problems for restricted classes of distributions, by reducing them to general-distribution testing problems over a smaller domain. We applied this approach to get new testing results for a range of distribution testing problems involving monotone and -modal distributions, and established lower bounds showing that all our new algorithms are essentially optimal.

A general direction for future work is to apply our reduction method to obtain near-optimal testing algorithms for other interesting classes of distributions. This will involve constructing flat decompositions of various types of distributions using few samples, which seems to be a natural and interesting algorithmic problem. A specific goal is to develop a more efficient version of our Construct-Flat-Decomposition algorithm for -modal distributions; is it possible to obtain an improved version of this algorithm that uses samples?

## References

• [BBBB72] R.E. Barlow, D.J. Bartholomew, J.M. Bremner, and H.D. Brunk. Statistical Inference under Order Restrictions. Wiley, New York, 1972.
• [BFF01] T. Batu, E. Fischer, L. Fortnow, R. Kumar, R. Rubinfeld, and P. White. Testing random variables for independence and identity. In Proc. 42nd IEEE Conference on Foundations of Computer Science, pages 442–451, 2001.
• [BFR00] Tugkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D. Smith, and Patrick White. Testing that distributions are close. In IEEE Symposium on Foundations of Computer Science, pages 259–269, 2000.
• [BFR10] T. Batu, L. Fortnow, R. Rubinfeld, W.D. Smith, and P. White. Testing closeness of discrete distributions, 2010.
• [Bir87a] L. Birgé. Estimating a density under order restrictions: Nonasymptotic minimax risk. Annals of Statistics, 15(3):995–1012, 1987.
• [Bir87b] L. Birgé. On the risk of histograms for estimating decreasing densities. Annals of Statistics, 15(3):1013–1022, 1987.
• [BKR04] Tugkan Batu, Ravi Kumar, and Ronitt Rubinfeld. Sublinear algorithms for testing monotone and unimodal distributions. In ACM Symposium on Theory of Computing, pages 381–390, 2004.
• [CKC83] L. Cobb, P. Koppstein, and N.H. Chen. Estimation and moment recursion relations for multimodal distributions of the exponential family. J. American Statistical Association, 78(381):124–130, 1983.
• [CT04] K.S. Chan and H. Tong. Testing for multimodality with dependent data. Biometrika, 91(1):113–123, 2004.
• [DDS11] C. Daskalakis, I. Diakonikolas, and R.A. Servedio. Learning -modal distributions via testing. Available at http://arxiv.org/abs/1107.2700, 2011.
• [DKW56] A. Dvoretzky, J. Kiefer, and J. Wolfowitz. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. Ann. Mathematical Statistics, 27(3):642–669, 1956.
• [Fou97] A.-L. Fougères. Estimation de densités unimodales. Canadian Journal of Statistics, 25:375–387, 1997.
• [Gre56] U. Grenander. On the theory of mortality measurement. Skand. Aktuarietidskr., 39:125–153, 1956.
• [Gro85] P. Groeneboom. Estimating a monotone density. In Proc. of the Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer, pages 539–555, 1985.
• [JW09] H. K. Jankowski and J. A. Wellner. Estimation of a discrete monotone density. Electronic Journal of Statistics, 3:1567–1605, 2009.
• [Kem91] J.H.B. Kemperman. Mixtures with a limited number of modal intervals. Annals of Statistics, 19(4):2120–2144, 1991.
• [Mas90] P. Massart. The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality. Annals of Probability, 18(3):1269–1283, 1990.
• [Rao69] B.L.S. Prakasa Rao. Estimation of a unimodal density. Sankhya Ser. A, 31:23–36, 1969.
• [Val08a] P. Valiant. Testing Symmetric Properties of Distributions. PhD thesis, M.I.T., 2008.
• [Val08b] Paul Valiant. Testing symmetric properties of distributions. In STOC, pages 383–392, 2008.
• [VV11a] Gregory Valiant and Paul Valiant. Estimating the unseen: an -sample estimator for entropy and support size, shown optimal via new CLTs. In STOC, pages 685–694, 2011.
• [VV11b] Gregory Valiant and Paul Valiant. The power of linear estimators. In FOCS, 2011.

For simplicity, the appendix consists of a slightly expanded and self-contained version of the exposition in the body of the paper, following the “Notation and Preliminaries” section.

## Appendix A Shrinking the domain size: Reductions for distribution-testing problems

In this section we present the general framework of our reduction-based approach and sketch how we instantiate this approach for monotone and -modal distributions.

We denote by the cardinality of an interval , i.e. for we have Fix a distribution over and a partition of into disjoint intervals The flattened distribution corresponding to and is the distribution over defined as follows: for and , . That is, is obtained from by averaging the weight that assigns to each interval over the entire interval. The reduced distribution corresponding to and is the distribution over that assigns the th point the weight assigns to the interval ; i.e., for , we have . Note that if is non-increasing then so is , but this is not necessarily the case for .

Definition 1. Let be a distribution over and let be a partition of into disjoint intervals. We say that is a -flat decomposition of if

The following useful lemma relates closeness of