Testing Shape Restrictions of Discrete Distributions

Testing Shape Restrictions of Discrete Distributions

Clément L. Canonne Columbia University. Email: ccanonne@cs.columbia.edu. Research supported by NSF CCF-1115703 and NSF CCF-1319788.    Ilias Diakonikolas University of Edinburgh. Email: ilias.d@ed.ac.uk. Research supported by EPSRC grant EP/L021749/1, a Marie Curie Career Integration Grant, and a SICSA grant. This work was performed in part while visiting CSAIL, MIT.    Themis Gouleakis CSAIL, MIT. Email: tgoule@mit.edu.    Ronitt Rubinfeld CSAIL, MIT and the Blavatnik School of Computer Science, Tel Aviv University. Email: ronitt@csail.mit.edu.
Abstract

We study the question of testing structured properties (classes) of discrete distributions. Specifically, given sample access to an arbitrary distribution over and a property , the goal is to distinguish between and . We develop a general algorithm for this question, which applies to a large range of “shape-constrained” properties, including monotone, log-concave, -modal, piecewise-polynomial, and Poisson Binomial distributions. Moreover, for all cases considered, our algorithm has near-optimal sample complexity with regard to the domain size and is computationally efficient. For most of these classes, we provide the first non-trivial tester in the literature. In addition, we also describe a generic method to prove lower bounds for this problem, and use it to show our upper bounds are nearly tight. Finally, we extend some of our techniques to tolerant testing, deriving nearly–tight upper and lower bounds for the corresponding questions.

\newaliascnt

corotheorem \aliascntresetthecoro \newaliascntlemtheorem \aliascntresetthelem \newaliascntclmtheorem \aliascntresettheclm \newaliascntfacttheorem \aliascntresetthefact \newaliascntproptheorem \aliascntresettheprop \newaliascntconjtheorem \aliascntresettheconj \newaliascntdefntheorem \aliascntresetthedefn

1 Introduction

Inferring information about the probability distribution that underlies a data sample is an essential question in Statistics, and one that has ramifications in every field of the natural sciences and quantitative research. In many situations, it is natural to assume that this data exhibits some simple structure because of known properties of the origin of the data, and in fact these assumptions are crucial in making the problem tractable. Such assumptions translate as constraints on the probability distribution – e.g., it is supposed to be Gaussian, or to meet a smoothness or “fat tail” condition (see e.g., [Man63, Hou86, TLSM95]).

As a result, the problem of deciding whether a distribution possesses such a structural property has been widely investigated both in theory and practice, in the context of shape restricted inference [BDBB72, SS01] and model selection [MP07]. Here, it is guaranteed or thought that the unknown distribution satisfies a shape constraint, such as having a monotone or log-concave probability density function [SN99, BB05, Wal09, Dia16]. From a different perspective, a recent line of work in Theoretical Computer Science, originating from the papers of Batu et al. [BFR00, BFF01, GR00] has also been tackling similar questions in the setting of property testing (see [Ron08, Ron10, Rub12, Can15] for surveys on this field). This very active area has seen a spate of results and breakthroughs over the past decade, culminating in very efficient (both sample and time-wise) algorithms for a wide range of distribution testing problems [BDKR05, GMV06, AAK07, DDS13, CDVV14, AD15, DKN15b]. In many cases, this led to a tight characterization of the number of samples required for these tasks as well as the development of new tools and techniques, drawing connections to learning and information theory [VV10, VV11a, VV14].

In this paper, we focus on the following general property testing problem: given a class (property) of distributions and sample access to an arbitrary distribution , one must distinguish between the case that (a) , versus (b) for all (i.e., is either in the class, or far from it). While many of the previous works have focused on the testing of specific properties of distributions or obtained algorithms and lower bounds on a case-by-case basis, an emerging trend in distribution testing is to design general frameworks that can be applied to several property testing problems [Val11, VV11a, DKN15b, DKN15a]. This direction, the testing analog of a similar movement in distribution learning [CDSS13, CDSS14b, CDSS14a, ADLS15], aims at abstracting the minimal assumptions that are shared by a large variety of problems, and giving algorithms that can be used for any of these problems. In this work, we make significant progress in this direction by providing a unified framework for the question of testing various properties of probability distributions. More specifically, we describe a generic technique to obtain upper bounds on the sample complexity of this question, which applies to a broad range of structured classes. Our technique yields sample near-optimal and computationally efficient testers for a wide range of distribution families. Conversely, we also develop a general approach to prove lower bounds on these sample complexities, and use it to derive tight or nearly tight bounds for many of these classes.

Related work.

Batu et al. [BKR04] initiated the study of efficient property testers for monotonicity and obtained (nearly) matching upper and lower bounds for this problem; while [AD15] later considered testing the class of Poisson Binomial Distributions, and settled the sample complexity of this problem (up to the precise dependence on ). Indyk, Levi, and Rubinfeld [ILR12], focusing on distributions that are piecewise constant on intervals (“-histograms”) described a -sample algorithm for testing membership to this class. Another body of work by [BDKR05][BKR04], and [DDS13] shows how assumptions on the shape of the distributions can lead to significantly more efficient algorithms. They describe such improvements in the case of identity and closeness testing as well as for entropy estimation, under monotonicity or -modality constraints. Specifically, Batu et al. show in [BKR04] how to obtain a -sample tester for closeness in this setting, in stark contrast to the general lower bound. Daskalakis et al. [DDS13] later gave and -sample testing algorithms for testing respectively identity and closeness of monotone distributions, and obtained similar results for -modal distributions. Finally, we briefly mention two related results, due respectively to [BDKR05] and [DDS12a]. The first one states that for the task of getting a multiplicative estimate of the entropy of a distribution, assuming monotonicity enables exponential savings in sample complexity – , instead of for the general case. The second describes how to test if an unknown -modal distribution is in fact monotone, using only samples. Note that the latter line of work differs from ours in that it presupposes the distributions satisfy some structural property, and uses this knowledge to test something else about the distribution; while we are given a priori arbitrary distributions, and must check whether the structural property holds. Except for the properties of monotonicity and being a PBD, nothing was previously known on testing the shape restricted properties that we study. Independently and concurrently to this work, Acharya, Daskalakis, and Kamath obtained a sample near-optimal efficient algorithm for testing log-concavity.111Following the communication of a preliminary version of this paper (February 2015), we were informed that [ADK15] subsequently obtained near-optimal testers for some of the classes we consider. To the best of our knowledge, their work builds on ideas from [AD15] and their techniques are orthogonal to ours.

Moreover, for the specific problems of identity and closeness testing,222Recall that the identity testing problem asks, given the explicit description of a distribution and sample access to an unknown distribution , to decide whether is equal to or far from it; while in closeness testing both distributions to compare are unknown. recent results of [DKN15b, DKN15a] describe a general algorithm which applies to a large range of shape or structural constraints, and yields optimal identity testers for classes of distributions that satisfy them. We observe that while the question they answer can be cast as a specialized instance of membership testing, our results are incomparable to theirs, both because of the distinction above (testing with versus testing for structure) and as the structural assumptions they rely on are fundamentally different from ours.

1.1 Results and Techniques

Upper Bounds. A natural way to tackle our membership testing problem would be to first learn the unknown distribution as if it satisfied the property, before checking if the hypothesis obtained is indeed both close to the original distribution and to the property. Taking advantage of the purported structure, the first step could presumably be conducted with a small number of samples; things break down, however, in the second step. Indeed, most approximation results leading to the improved learning algorithms one would apply in the first stage only provide very weak guarantees, in the sense. For this reason, they lack the robustness that would be required for the second part, where it becomes necessary to perform tolerant testing between the hypothesis and – a task that would then entail a number of samples almost linear in the domain size. To overcome this difficulty, we need to move away from these global closeness results and instead work with stronger requirements, this time in norm.

At the core of our approach is an idea of Batu et al. [BKR04], which show that monotone distributions can be well-approximated (in a certain technical sense) by piecewise constant densities on a suitable interval partition of the domain; and leverage this fact to reduce monotonicity testing to uniformity testing on each interval of this partition. While the argument of [BKR04] is tailored specifically for the setting of monotonicity testing, we are able to abstract the key ingredients, and obtain a generic membership tester that applies to a wide range of distribution families. In more detail, we provide a testing algorithm which applies to any class of distributions which admits succinct approximate decompositions – that is, each distribution in the class can be well-approximated (in a strong sense) by piecewise constant densities on a small number of intervals (we hereafter refer to this approximation property, formally defined in Section 3, as (Succinctness); and extend the notation to apply to any class of distributions for which all satisfy (1.1)). Crucially, the algorithm does not care about how these decompositions can be obtained: for the purpose of testing these structural properties we only need to establish their existence. Specific examples are given in the corollaries below. Informally, our main algorithmic result, informally stated (see Theorem 3.1 for a detailed formal statement), is as follows:

Theorem 1.1 (Main Theorem).

There exists an algorithm TestSplittable which, given sampling access to an unknown distribution over and parameter , can distinguish with probability between (a) versus (b) , for any property that satisfies the above natural structural criterion (1.1). Moreover, for many such properties this algorithm is computationally efficient, and its sample complexity is optimal (up to logarithmic factors and the exact dependence on ).

We then instantiate this result to obtain “out-of-the-box” computationally efficient testers for several classes of distributions, by showing that they satisfy the premise of our theorem (the definition of these classes is given in Section 2.1):

Corollary \thecoro.

The algorithm TestSplittable can test the classes of monotone, unimodal, log-concave, concave, convex, and monotone hazard rate (MHR) distributions, with samples.

Corollary \thecoro.

The algorithm TestSplittable can test the class of -modal distributions, with samples.

Corollary \thecoro.

The algorithm TestSplittable can test the classes of -histograms and -piecewise degree- distributions, with and samples respectively.

Corollary \thecoro.

The algorithm TestSplittable can test the classes of Binomial and Poisson Binomial Distributions, with samples.

{adjustwidth}

-.75in-.5in Class Upperbound Lowerbound Monotone [BKR04], (Section 1.1) [BKR04], (Section 1.1) Unimodal (Section 1.1) (Section 1.1) -modal (Section 1.1) (Section 1.1) Log-concave, concave, convex (Section 1.1) (Section 1.1) Monotone Hazard Rate (MHR) (Section 1.1) (Section 1.1) Binomial, Poisson Binomial (PBD) [AD15], (Section 1.1)  ([AD15]Section 1.1) -histograms  [ILR12], (Section 1.1) for  [ILR12], (Section 1.1) -piecewise degree- (Section 1.1) (Section 1.1) -SIIRV (Section 1.1)

Table 1: Summary of results.

We remark that the aforementioned sample upper bounds are information-theoretically near-optimal in the domain size (up to logarithmic factors). See Table 1 and the following subsection for the corresponding lower bounds. We did not attempt to optimize the dependence on the parameter , though a more careful analysis can lead to such improvements.

We stress that prior to our work, no non-trivial testing bound was known for most of these classes – specifically, our nearly-tight bounds for -modal with , log-concave, concave, convex, MHR, and piecewise polynomial distributions are new. Moreover, although a few of our applications were known in the literature (the upper and lower bounds on testing monotonicity can be found in [BKR04], while the sample complexity of testing PBDs was recently given333For the sample complexity of testing monotonicity, [BKR04] originally states an upper bound, but the proof seems to only result in an bound. Regarding the class of PBDs, [AD15] obtain an sample complexity, to be compared with our upper bound; as well as an lower bound. in [AD15], and the task of testing -histograms is considered in [ILR12]), the crux here is that we are able to derive them in a unified way, by applying the same generic algorithm to all these different distribution families. We note that our upper bound for -histograms (Section 1.1) also improves on the previous -sample tester, as long as . In addition to its generality, our framework yields much cleaner and conceptually simpler proofs of the upper and lower bounds from [AD15].

Lower Bounds.

To complement our upper bounds, we give a generic framework for proving lower bounds against testing classes of distributions. In more detail, we describe how to reduce – under a mild assumption on the property – the problem of testing membership to (“does ?”) to testing identity to (“does ?”), for any explicit distribution in . While these two problems need not in general be related,444As a simple example, consider the class of all distributions, for which testing membership is trivial. we show that our reduction-based approach applies to a large number of natural properties, and obtain lower bounds that nearly match our upper bounds for all of them. Moreover, this lets us derive a simple proof of the lower bound of [AD15] on testing the class of PBDs. The reader is referred to Theorem 6.1 for the formal statement of our reduction-based lower bound theorem. In this section, we state the concrete corollaries we obtain for specific structured distribution families:

Corollary \thecoro.

Testing log-concavity, convexity, concavity, MHR, unimodality, -modality, -histograms, and -piecewise degree- distributions each require samples (the last three for and , respectively), for any .

Corollary \thecoro.

Testing the classes of Binomial and Poisson Binomial Distributions each require samples, for any .

Corollary \thecoro.

There exist absolute constants and such that testing the class of -SIIRV distributions requires samples, for any and .

Tolerant Testing.

Using our techniques, we also establish nearly–tight upper and lower bounds on tolerant testing44footnotetext: Tolerant testing of a property is defined as follows: given , one must distinguish between (a) and (b) . This turns out to be, in general, a much harder task than that of “regular” testing (where we take ). for shape restrictions. Similarly, our upper and lower bounds are matching as a function of the domain size. More specifically, we give a simple generic upper bound approach (namely, a learning followed by tolerant testing algorithm). Our tolerant testing lower bounds follow the same reduction-based approach as in the non-tolerant case. In more detail, our results are as follows (see Section 6 and Section 7):

Corollary \thecoro.

Tolerant testing of log-concavity, convexity, concavity, MHR, unimodality, and -modality can be performed with samples, for (where is an absolute constant).

Corollary \thecoro.

Tolerant testing of the classes of Binomial and Poisson Binomial Distributions can be performed with samples, for (where is an absolute constant).

Corollary \thecoro.

Tolerant testing of log-concavity, convexity, concavity, MHR, unimodality, and -modality each require samples (the latter for ).

Corollary \thecoro.

Tolerant testing of the classes of Binomial and Poisson Binomial Distributions each require samples.

On the scope of our results.

We point out that our main theorem is likely to apply to many other classes of structured distributions, due to the mild structural assumptions it requires. However, we did not attempt here to be comprehensive; but rather to illustrate the generality of our approach. Moreover, for all properties considered in this paper the generic upper and lower bounds we derive through our methods turn out to be optimal up to at most polylogarithmic factors (with regard to the support size). The reader is referred to Table 1 for a summary of our results and related work.

1.2 Organization of the Paper

We start by giving the necessary background and definitions in Section 2, before turning to our main result, the proof of Theorem 1.1 (our general testing algorithm) in Section 3. In Section 4, we establish the necessary structural theorems for each classes of distributions considered, enabling us to derive the upper bounds of Table 1Section 5 introduces a slight modification of our algorithm which yields stronger testing results for classes of distributions with small effective support, and use it to derive Section 1.1, our upper bound for Poisson Binomial distributions. Second, Section 6 contains the details of our lower bound methodology, and of its applications to the classes of Table 1. Finally, Section 6.2 is concerned with the extension of this methodology to tolerant testing, of which Section 7 describes a generic upper bound counterpart.

2 Notation and Preliminaries

2.1 Definitions

We give here the formal descriptions of the classes of distributions involved in this work. Recall that a distribution over is monotone (non-increasing) if its probability mass function (pmf) satisfies . A natural generalization of the class of monotone distributions is the set of -modal distributions, i.e. distributions whose pmf can go “up and down” or “down and up” up to times:555Note that this slightly deviates from the Statistics literature, where only the peaks are counted as modes (so that what is usually referred to as a bimodal distribution is, according to our definition, -modal).

Definition \thedefn (-modal).

Fix any distribution over , and integer . is said to have modes if there exists a sequence such that either for all , or for all . We call -modal if it has at most modes, and write for the class of all -modal distributions (omitting the dependence on ). The particular case of corresponds to the set of unimodal distributions.

Definition \thedefn (Log-Concave).

A distribution over is said to be log-concave if it satisfies the following conditions: (i) for any such that , ; and (ii) for all , . We write for the class of all log-concave distributions (omitting the dependence on ).

Definition \thedefn (Concave and Convex).

A distribution over is said to be concave if it satisfies the following conditions: (i) for any such that , ; and (ii) for all such that , ; it is convex if the reverse inequality holds in (ii). We write (resp. ) for the class of all concave (resp. convex) distributions (omitting the dependence on ).

It is not hard to see that convex and concave distributions are unimodal; moreover, every concave distribution is also log-concave, i.e. . Note that in both Section 2.1 and Section 2.1, condition (i) is equivalent to enforcing that the distribution be supported on an interval.

Definition \thedefn (Monotone Hazard Rate).

A distribution over is said to have monotone hazard rate (MHR) if its hazard rate is a non-decreasing function. We write for the class of all MHR distributions (omitting the dependence on ).

It is known that every log-concave distribution is both unimodal and MHR (see e.g. [An96, Proposition 10]), and that monotone distributions are MHR. Two other classes of distributions have elicited significant interest in the context of density estimation, that of histograms (piecewise constant) and piecewise polynomial densities:

Definition \thedefn (Piecewise Polynomials [CDSS14a]).

A distribution over is said to be a -piecewise degree- distribution if there is a partition of into disjoint intervals such that for all , where each is a univariate polynomial of degree at most . We write for the class of all -piecewise degree- distributions (omitting the dependence on ). (We note that -piecewise degree- distributions are also commonly referred to as -histograms, and write for .)

Finally, we recall the definition of the two following classes, which both extend the family of Binomial distributions : the first, by removing the need for each of the independent Bernoulli summands to share the same bias parameter.

Definition \thedefn.

A random variable is said to follow a Poisson Binomial Distribution (with parameter ) if it can be written as , where are independent, non-necessarily identically distributed Bernoulli random variables. We denote by the class of all such Poisson Binomial Distributions.

It is not hard to show that Poisson Binomial Distributions are in particular log-concave. One can generalize even further, by allowing each random variable of the summation to be integer-valued:

Definition \thedefn.

Fix any . We say a random variable is a -Sum of Independent Integer Random Variables (-SIIRV) with parameter if it can be written as , where are independent, non-necessarily identically distributed random variables taking value in . We denote by the class of all such -SIIRVs.

2.2 Tools from previous work

We first restate a result of Batu et al. relating closeness to uniformity in and norms to “overall flatness” of the probability mass function, and which will be one of the ingredients of the proof of Theorem 1.1:

Lemma \thelem ([Bfr00, Bff01]).

Let be a distribution on a domain . (a) If , then . (b) If , then .

To check condition (b) above we shall rely on the following, which one can derive from the techniques in [DKN15b] and whose proof we defer to Appendix A:

Lemma \thelem (Adapted from [DKN15b, Theorem 11]).

There exists an algorithm Check-Small- which, given parameters and independent samples from a distribution over (for some absolute constant ), outputs either yes or no, and satisfies the following.

  • If , then the algorithm outputs no with probability at least ;

  • If , then the algorithm outputs yes with probability at least .

Finally, we will also rely on a classical result from Probability, the Dvoretzky–Kiefer–Wolfowitz (DKW) inequality, restated below:

Theorem 2.1 ([Dkw56, Mas90]).

Let be a distribution over . Given independent samples from , define the empirical distribution as follows:

Then, for all , , where denotes the Kolmogorov distance (i.e., the distance between cumulative distribution functions).

In particular, this implies that samples suffice to learn a distribution up to in Kolmogorov distance.

3 The General Algorithm

In this section, we obtain our main result, restated below: See 1.1

Intuition.

Before diving into the proof of this theorem, we first provide a high-level description of the argument. The algorithm proceeds in 3 stages: the first, the decomposition step, attempts to recursively construct a partition of the domain in a small number of intervals, with a very strong guarantee. If the decomposition succeeds, then the unknown distribution will be close (in distance) to its “flattening” on the partition; while if it fails (too many intervals have to be created), this serves as evidence that does not belong to the class and we can reject. The second stage, the approximation step, then learns this flattening of the distribution – which can be done with few samples since by construction we do not have many intervals. The last stage is purely computational, the projection step: where we verify that the flattening we have learned is indeed close to the class . If all three stages succeed, then by the triangle inequality it must be the case that is close to ; and by the structural assumption on the class, if then it will admit succinct enough partitions, and all three stages will go through.

Turning to the proof, we start by defining formally the “structural criterion” we shall rely on, before describing the algorithm at the heart of our result in Section 3.1. (We note that a modification of this algorithm will be described in Section 5, and will allow us to derive Section 1.1.)

Definition \thedefn (Decompositions).

Let and . A class of distributions on is said to be -decomposable if for every there exists and a partition of the interval such that, for all , one of the following holds:

  1. [(i)]

  2. ; or

  3. .

Further, if is dyadic (i.e., each is of the form for some integers , corresponding to the leaves of a recursive bisection of ), then is said to be -splittable.

Lemma \thelem.

If is -decomposable, then it is -splittable.

Proof.

We will begin by proving a claim that for every partition of the interval into intervals, there exists a refinement of that partition which consists of at most dyadic intervals. So, it suffices to prove that every interval , can be partitioned in at most dyadic intervals. Indeed, let be the largest integer such that and let be the smallest integer such that . If follows that and . So, the interval is fully contained in and has size at least .

We will also use the fact that, for every ,

(1)

Now consider the following procedure: Starting from right (resp. left) side of the interval , we add the largest interval which is adjacent to it and fully contained in and recurse until we cover the whole interval (resp. ). Clearly, at the end of this procedure, the whole interval is covered by dyadic intervals. It remains to show that the procedure takes steps. Indeed, using Equation 1, we can see that at least half of the remaining left or right interval is covered in each step (except maybe for the first 2 steps where it is at least a quarter). Thus, the procedure will take at most steps in total. From the above, we can see that each of the intervals of the partition can be covered with dyadic intervals, which completes the proof of the claim.

In order to complete the proof of the lemma, notice that the two conditions in Section 3 are closed under taking subsets. ∎

3.1 The algorithm

Theorem 1.1, and with it Section 1.1 and Section 1.1 will follow from the theorem below, combined with the structural theorems from Section 4:

Theorem 3.1.

Let be a class of distributions over for which the following holds.

  1. is -splittable;

  2. there exists a procedure which, given as input a parameter and the explicit description of a distribution over , returns yes if the distance to is at most , and no if (and either yes or no otherwise).

Then, the algorithm TestSplittable (Algorithm 1) is a -sample tester for , for . (Moreover, if is computationally efficient, then so is TestSplittable.)

1:Domain (interval), sample access to over ; subroutine
2:Parameters and function .
3:Setting Up
4:     Define , , , ; and be as in Section 2.2.
5:     Set is an absolute constant.
6:      Obtain a sequence s of independent samples from .    For any , let be the number of samples falling in .
7:
8:Decomposition
9:     while  and at most splits have been performed  do
10:         Run Check-Small- (from Section 2.2) with parameters and , using the samples of s belonging to .
11:         if  Check-Small- outputs no then
12:              Bisect , and recurse on both halves (using the same samples).
13:         end if
14:     end while
15:     if more than splits have been performed then
16:         return REJECT
17:     else
18:         Let be the partition of from the leaves of the recursion. .
19:     end if
20:
21:Approximation
22:      Learn the flattening of to error (with probability ), using new samples. Let be the resulting hypothesis. is a -histogram.
23:
24:Offline Check
25:     return ACCEPT if and only if returns yes. No sample needed.
26:
Algorithm 1 TestSplittable

3.2 Proof of Theorem 3.1

We now give the proof of our main result (Theorem 3.1), first analyzing the sample complexity of Algorithm 1 before arguing its correctness. For the latter, we will need the following simple lemma from [ILR12], restated below:

Fact 3.2 ([Ilr12, Fact 1]).

Let be a distribution over , and . Given independent samples from (for some absolute constant ), with probability at least we have that, for every interval :

  1. [(i)]

  2. if , then ;

  3. if , then ;

  4. if , then ;

where is the number of the samples falling into .

3.3 Sample complexity.

The sample complexity is immediate, and comes from Steps 6 and 22. The total number of samples is

3.4 Correctness.

Say an interval considered during the execution of the “Decomposition” step is heavy if is big enough on Step 9, and light otherwise; and let and denote the sets of heavy and light intervals respectively. By choice of and a union bound over all possible intervals, we can assume on one hand that with probability at least the guarantees of 3.2 hold simultaneously for all intervals considered. We hereafter condition on this event.

We first argue that if the algorithm does not reject in Step 15, then with probability at least we have . Indeed, we can write

Let us bound the two terms separately.

  • If , then by our choice of threshold we can apply Section 2.2 with ; conditioning on all of the (at most ) events happening, which overall fails with probability at most by a union bound, we get

    as Check-Small- returned yes; and by Section 2.2 this implies .

  • If , then we claim that . Clearly, this is true if , so it only remains to show that . But this follows from 3.2 1, as if we had then would have been big enough, and . Overall,

    for a sufficiently big choice of constant in the definition of ; where we first used that , and then that by Jensen’s inequality.

Putting it together, this yields

Soundness.

By contrapositive, we argue that if the test returns ACCEPT, then (with probability at least ) is -close to . Indeed, conditioning on being -close to , we get by the triangle inequality that

Overall, this happens except with probability at most .

Completeness.

Assume . Then the choice of of and ensures the existence of a good dyadic partition in the sense of Section 3. For any in this partition for which 1 holds (), will have and be kept as a “light leaf” (this by contrapositive of 3.2 2). For the other ones, 2 holds: let be one of these (at most ) intervals.

  • If is too small on Step 9, then is kept as “light leaf.”

  • Otherwise, then by our choice of constants we can use Section 2.2 and apply Section 2.2 with ; conditioning on all of the (at most ) events happening, which overall fails with probability at most by a union bound, Check-Small- will output yes, as

    and is kept as “flat leaf.”

Therefore, as is dyadic the Decomposition stage is guaranteed to stop within at most splits (in the worst case, it goes on until is considered, at which point it succeeds).666In more detail, we want to argue that if is in the class, then a decomposition with at most pieces is found by the algorithm. Since there is a dyadic decomposition with at most pieces (namely, ), it suffices to argue that the algorithm will never split one of the ’s (as every single will eventually be considered by the recursive binary splitting, unless the algorithm stopped recursing in this “path” before even considering , which is even better). But this is the case by the above argument, which ensures each such will be recognized as satisfying one of the two conditions for “good decomposition” (being either close to uniform in , or having very little mass). Thus Step 15 passes, and the algorithm reaches the Approximation stage. By the foregoing discussion, this implies is -close to (and hence to ); is then (except with probability at most ) -close to , and the algorithm returns ACCEPT.

4 Structural Theorems

In this section, we show that a wide range of natural distribution families are succinctly decomposable, and provide efficient projection algorithms for each class.

4.1 Existence of Structural Decompositions

Theorem 4.1 (Monotonicity).

For all , the class of monotone distributions on is -splittable for .

Note that this proof can already be found in [BKR04, Theorem 10], interwoven with the analysis of their algorithm. For the sake of being self-contained, we reproduce the structural part of their argument, removing its algorithmic aspects:

Proof of Theorem 4.1.

We define the recursively as follows: , and for the partition is obtained from by going over the in order, and:

  1. [(a)]

  2. if , then is added as element of (“marked as leaf”);

  3. else, if , then is added as element of (“marked as leaf”);

  4. otherwise, bisect in , (with ) and add both and as elements of .

and repeat until convergence (that is, whenever the last item is not applied for any of the intervals). Clearly, this process is well-defined, and will eventually terminate (as is a non-decreasing sequence of natural numbers, upper bounded by ). Let (with ) be its outcome, so that the ’s are consecutive intervals all satisfying either 1 or 2. As 2 clearly implies 2, we only need to show that ; for this purpose, we shall leverage as in [BKR04] the fact that is monotone to bound the number of recursion steps.

The recursion above defines a complete binary tree (with the leaves being the intervals satisfying 1 or 2, and the internal nodes the other ones). Let be the number of recursion steps the process goes through before converging to (height of the tree); as mentioned above, we have (as we start with an interval of size , and the length is halved at each step.). Observe further that if at any point an interval has , then it immediately (as well as all the ’s for by monotonicity) satisfies 1 and is no longer split (“becomes a leaf”). So at any , the number of intervals for which neither 1 nor 2 holds must satisfy

where denotes the beginning of the -th interval (again we use monotonicity to argue that the extrema were reached at the ends of each interval), so that . In particular, the total number of internal nodes is then

This implies the same bound on the number of leaves . ∎

Corollary \thecoro (Unimodality).

For all , the class of unimodal distributions on is -decomposable for .

Proof.

For any , can be partitioned in two intervals , such that , are either monotone non-increasing or non-decreasing. Applying Theorem 4.1 to and and taking the union of both partitions yields a (no longer necessarily dyadic) partition of . ∎

The same argument yields an analogue statement for -modal distributions:

Corollary \thecoro (-modality).

For any and all , the class of -modal distributions on is -decomposable for .

Corollary \thecoro (Log-concavity, concavity and convexity).

For all , the classes , and of log-concave, concave and convex distributions on are -decomposable for .

Proof.

This is directly implied by Section 4.1, recalling that log-concave, concave and convex distributions are unimodal. ∎

Theorem 4.2 (Monotone Hazard Rate).

For all , the class of MHR distributions on is -decomposable for .

Proof.

This follows from adapting the proof of [CDSS13], which establishes that every MHR distribution can be approximated in distance by a -histogram. For completeness, we reproduce their argument, suitably modified to our purposes, in Appendix B. ∎

Theorem 4.3 (Piecewise Polynomials).

For all , , the class of -piecewise degree- distributions on is -decomposable for . (Moreover, for the class of -histograms () one can take .)

Proof.

The last part of the statement is obvious, so we focus on the first claim. Observing that each of the pieces of a distribution can be subdivided in at most intervals on which is monotone (being degree- polynomial on each such pieces), we obtain a partition of into at most intervals. being monotone on each of them, we can apply an argument almost identical to that of Theorem 4.1 to argue that each interval can be further split into subintervals, yielding a good decomposition with pieces. ∎

4.2 Projection Step: computing the distances

This section contains details of the distance estimation procedures for these classes, required in the last stage of Algorithm 1. (Note that some of these results are phrased in terms of distance approximation, as estimating the distance to sufficient accuracy in particular yields an algorithm for this stage.)

We focus in this section on achieving the sample complexities stated in Section 1.1, Section 1.1, and Section 1.1. While almost all the distance estimation procedures we give in this section are efficient, running in time polynomial in all the parameters or even with only a polylogarithmic dependence on , there are two exceptions – namely, the procedures for monotone hazard rate (Section 4.2) and log-concave (Section 4.2) distributions. We do describe computationally efficient procedures for these two cases as well in Section 4.2.1, at a modest additive cost in the sample complexity.

Lemma \thelem (Monotonicity [Bkr04, Lemma 8]).

There exists a procedure that, on input as well as the full (succinct) specification of a -histogram on , computes the (exact) distance in time .

A straightforward modification of the algorithm above (e.g., by adapting the underlying linear program to take as input the location of the mode of the distribution; then trying all possibilities, running the subroutine times and picking the minimum value) results in a similar claim for unimodal distributions:

Lemma \thelem (Unimodality).

There exists a procedure that, on input as well as the full (succinct) specification of a -histogram on , computes the (exact) distance in time .

A similar result can easily be obtained for the class of -modal distributions as well, with a -time algorithm based on a combination of dynamic and linear programming. Analogous statements hold for the classes of concave and convex distributions , also based on linear programming (specifically, on running different linear programs – one for each possible support – and taking the minimum over them).

Lemma \thelem (Mhr).

There exists a (non-efficient) procedure that, on input , , as well as the full specification of a distribution on , distinguishes between and in time .

Lemma \thelem (Log-concavity).

There exists a (non-efficient) procedure that, on input , , as well as the full specification of a distribution on , distinguishes between and in time .

Section 4.2 and Section 4.2.

We here give a naive algorithm for these two problems, based on an exhaustive search over a (huge) -cover of distributions over . Essentially, contains all possible distributions whose probabilities are of the form , for (so that ). It is not hard to see that this indeed defines an -cover of the set of all distributions, and moreover that it can be computed in time . To approximate the distance from an explicit distribution to the class (either or ), it is enough to go over every element of , checking (this time, efficiently) if and if there is a distribution close to (this time, pointwise, that is for all ) – which also implies and thus . The test for pointwise closeness can be done by checking feasibility of a linear program with variables corresponding to the logarithm of probabilities, i.e. . Indeed, this formulation allows to rephrase the log-concave and MHR constraints as linear constraints, and pointwise approximation is simply enforcing that for all . At the end of this enumeration, the procedure accepts if and only if for some both and the corresponding linear program was feasible. ∎

Lemma \thelem (Piecewise Polynomials).

There exists a procedure that, on input as well as the full specification of an -histogram on , computes an approximation of the distance