Learning Poisson Binomial Distributions
We consider a basic problem in unsupervised learning: learning an unknown Poisson Binomial Distribution. A Poisson Binomial Distribution (PBD) over is the distribution of a sum of independent Bernoulli random variables which may have arbitrary, potentially non-equal, expectations. These distributions were first studied by S. Poisson in 1837 [Poi37] and are a natural -parameter generalization of the familiar Binomial Distribution. Surprisingly, prior to our work this basic learning problem was poorly understood, and known results for it were far from optimal.
We essentially settle the complexity of the learning problem for this basic class of distributions. As our first main result we give a highly efficient algorithm which learns to -accuracy (with respect to the total variation distance) using samples independent of . The running time of the algorithm is quasilinear in the size of its input data, i.e., bit-operations.111We write to hide factors which are polylogarithmic in the argument to ; thus, for example, denotes a quantity which is for some absolute constant . (Observe that each draw from the distribution is a -bit string.) Our second main result is a proper learning algorithm that learns to -accuracy using samples, and runs in time . This sample complexity is nearly optimal, since any algorithm for this problem must use samples. We also give positive and negative results for some extensions of this learning problem to weighted sums of independent Bernoulli random variables.
We begin by considering a somewhat fanciful scenario: You are the manager of an independent weekly newspaper in a city of people. Each week the -th inhabitant of the city independently picks up a copy of your paper with probability . Of course you do not know the values ; each week you only see the total number of papers that have been picked up. For many reasons (advertising, production, revenue analysis, etc.) you would like to have a detailed “snapshot” of the probability distribution (pdf) describing how many readers you have each week. Is there an efficient algorithm to construct a high-accuracy approximation of the pdf from a number of observations that is independent of the population ? We show that the answer is “yes.”
A Poisson Binomial Distribution of order is the distribution of a sum
where are independent Bernoulli (0/1) random variables. The expectations need not all be the same, and thus these distributions generalize the Binomial distribution and, indeed, comprise a much richer class of distributions. (See Section 1.2 below.) It is believed that Poisson [Poi37] was the first to consider this extension of the Binomial distribution222We thank Yuval Peres and Sam Watson for this information [PW11]. and the distribution is sometimes referred to as “Poisson’s Binomial Distribution” in his honor; we shall simply call these distributions PBDs.
PBDs are one of the most basic classes of discrete distributions; indeed, they are arguably the simplest -parameter probability distribution that has some nontrivial structure. As such they have been intensely studied in probability and statistics (see Section 1.2) and arise in many settings; for example, we note here that tail bounds on PBDs form an important special case of Chernoff/Hoeffding bounds [Che52, Hoe63, DP09]. In application domains, PBDs have many uses in research areas such as survey sampling, case-control studies, and survival analysis, see e.g., [CL97] for a survey of the many uses of these distributions in applications. Given the simplicity and ubiquity of these distributions, it is quite surprising that the problem of density estimation for PBDs (i.e., learning an unknown PBD from independent samples) is not well understood in the statistics or learning theory literature. This is the problem we consider, and essentially settle, in this paper.
We work in a natural PAC-style model of learning an unknown discrete probability distribution which is essentially the model of [KMR94]. In this learning framework for our problem, the learner is provided with the value of and with independent samples drawn from an unknown PBD . Using these samples, the learner must with probability at least output a hypothesis distribution such that the total variation distance is at most , where are accuracy and confidence parameters that are provided to the learner.333[KMR94] used the Kullback-Leibler divergence as their distance measure but we find it more natural to use variation distance. A proper learning algorithm in this framework outputs a distribution that is itself a Poisson Binomial Distribution, i.e., a vector which describes the hypothesis PBD where .
1.1 Our results.
Our main result is an efficient algorithm for learning PBDs from many samples independent of Since PBDs are an -parameter family of distributions over the domain , we view such a tight bound as a surprising result. We prove:
Theorem 1 (Main Theorem).
Let be an unknown PBD.
[Learning PBDs from constantly many samples] There is an algorithm with the following properties: given and access to independent draws from , the algorithm uses
samples from , performs
bit operations, and with probability at least outputs a (succinct description of a) distribution over which is such that
[Properly learning PBDs from constantly many samples] There is an algorithm with the following properties: given and access to independent draws from , the algorithm uses
samples from , performs
bit operations, and with probability at least outputs a (succinct description of a) vector defining a PBD such that
We note that, since every sample drawn from is a -bit string, for constant the number of bit-operations performed by our first algorithm is quasilinear in the length of its input. Moreover, the sample complexity of both algorithms is close to optimal, since samples are required even to distinguish the (simpler) Binomial distributions and , which have total variation distance Indeed, in view of this observation, our second algorithm is essentially sample-optimal.
Motivated by these strong learning results for PBDs, we also consider learning a more general class of distributions, namely distributions of the form which are weighted sums of independent Bernoulli random variables. We give an algorithm which uses samples and runs in time if there are only constantly many different weights in the sum:
Theorem 2 (Learning sums of weighted independent Bernoulli random variables).
Let be a weighted sum of unknown independent Bernoullis such that there are at most different values among Then there is an algorithm with the following properties: given and access to independent draws from , it uses
samples from , runs in time
and with probability at least outputs a hypothesis vector defining independent Bernoulli random variables with such that where .
To complement Theorem 2, we also show that if there are many distinct weights in the sum, then even for weights with a very simple structure any learning algorithm must use many samples:
Theorem 3 (Sample complexity lower bound for learning sums of weighted independent Bernoullis).
Let be a weighted sum of unknown independent Bernoullis (where the -th weight is simply ). Let be any learning algorithm which, given and access to independent draws from , outputs a hypothesis distribution such that with probability at least Then must use samples.
1.2 Related work.
At a high level, there has been a recent surge of interest in the theoretical computer science community on fundamental algorithmic problems involving basic types of probability distributions, see e.g., [KMV10, MV10, BS10, VV11] and other recent papers; our work may be considered as an extension of this theme. More specifically, there is a broad literature in probability theory studying various properties of PBDs; see [Wan93] for an accessible introduction to some of this work. In particular, many results study approximations to the Poisson Binomial distribution via simpler distributions. In a well-known result, Le Cam [Cam60] shows that for any PBD with , it holds that
where is the Poisson distribution with parameter . Subsequently many other proofs of this result and similar ones were given using a range of different techniques; [HC60, Che74, DP86, BHJ92] is a sampling of work along these lines, and Steele [Ste94] gives an extensive list of relevant references. Much work has also been done on approximating PBDs by normal distributions (see e.g., [Ber41, Ess42, Mik93, Vol95]) and by Binomial distributions (see e.g., [Ehm91, Soo96, Roo00]). These results provide structural information about PBDs that can be well-approximated via simpler distributions, but fall short of our goal of obtaining approximations of an unknown PBD up to arbitrary accuracy. Indeed, the approximations obtained in the probability literature (such as the Poisson, Normal and Binomial approximations) typically depend only on the first few moments of the target PBD. This is attractive from a learning perspective because it is possible to efficiently estimate such moments from random samples, but higher moments are crucial for arbitrary approximation [Roo00].
Taking a different perspective, it is easy to show (see Section 2 of [KG71]) that every PBD is a unimodal distribution over . (Recall that a distribution over is unimodal if there is a value such that for and for ) The learnability of general unimodal distributions over is well understood: Birgé [Bir87a, Bir97] has given a computationally efficient algorithm that can learn any unimodal distribution over to variation distance from samples, and has shown that any algorithm must use samples. (The [Bir87a, Bir97] upper and lower bounds are stated for continuous unimodal distributions, but the arguments are easily adapted to the discrete case.) Our main result, Theorem 1, shows that the additional PBD assumption can be leveraged to obtain sample complexity independent of with a computationally highly efficient algorithm.
So, how might one leverage the structure of PBDs to remove from the sample complexity? A first observation is that a PBD assigns of its mass to points. So one could draw samples to (approximately) identify these points and then try to estimate the probability assigned to each such point, but clearly such an approach, if followed naïvely, would give sample complexity. Alternatively, one could run Birgé’s algorithm on the restricted support of size , but that will not improve the asymptotic sample complexity. A different approach would be to construct a small -cover (under the total variation distance) of the space of all PBDs on variables. Indeed, if such a cover has size , it can be shown (see Lemma 10 in Section 3.1, or Chapter 7 of [DL01])) that a target PBD can be learned from samples. Still it is easy to argue that any cover needs to have size , so this approach too gives a dependence in the sample complexity.
Our approach, which removes completely from the sample complexity, requires a refined understanding of the structure of the set of all PBDs on variables, in fact one that is more refined than the understanding provided by the aforementioned results (approximating a PBD by a Poisson, Normal, or Binomial distribution). We give an outline of the approach in the next section.
1.3 Our approach.
The starting point of our algorithm for learning PBDs is a theorem of [DP11, Das08] that gives detailed information about the structure of a small -cover (under the total variation distance) of the space of all PBDs on variables (see Theorem 4). Roughly speaking, this result says that every PBD is either close to a PBD whose support is sparse, or is close to a translated “heavy” Binomial distribution. Our learning algorithm exploits this structure of the cover; it has two subroutines corresponding to these two different types of distributions that the cover contains. First, assuming that the target PBD is close to a sparsely supported distribution, it runs Birgé’s unimodal distribution learner over a carefully selected subinterval of to construct a hypothesis ; the (purported) sparsity of the distribution makes it possible for this algorithm to use samples independent of . Then, assuming that the target PBD is close to a translated “heavy” Binomial distribution, the algorithm constructs a hypothesis Translated Poisson Distribution [R0̈7] whose mean and variance match the estimated mean and variance of the target PBD; we show that is close to the target PBD if the target PBD is not close to any sparse distribution in the cover. At this point the algorithm has two hypothesis distributions, and , one of which should be good; it remains to select one as the final output hypothesis. This is achieved using a form of “hypothesis testing” for probability distributions.
The above sketch captures the main ingredients of Part (1) of Theorem 1, but additional work needs to be done to get the proper learning algorithm of Part (2). For the non-sparse case, first note that the Translated Poisson hypothesis is not a PBD. Via a sequence of transformations we are able to show that the Translated Poisson hypothesis can be converted to a Binomial distribution for some To handle the sparse case, we use an alternate learning approach: instead of using Birgé’s unimodal algorithm (which would incur a sample complexity of ), we first show that, in this case, there exists an efficiently constructible -cover of size and then apply a general learning result that we now describe.
The general learning result that we use (Lemma 10) is the following: We show that for any class of target distributions, if has an -cover of size then there is a generic algorithm for learning an unknown distribution from to accuracy that uses samples. Our approach is rather similar to the algorithm of [DL01] for choosing a density estimate (but different in some details); it works by carrying out a tournament that matches every pair of distributions in the cover against each other. Our analysis shows that with high probability some -accurate distribution in the cover will survive the tournament undefeated, and that any undefeated distribution will with high probability be -accurate.
Applying this general result to the -cover of size described above, we obtain a PBD that is -close to the target (this accounts for the increased running time in Part (2) versus Part (1)). We stress that for both the non-proper and proper learning algorithms sketched above, many technical subtleties and challenges arise in implementing the high-level plan given above, requiring a careful and detailed analysis.
We prove Theorem 2 using the general approach of Lemma 10 specialized to weighted sums of independent Bernoullis with constantly many distinct weights. We show how the tournament can be implemented efficiently for the class of weighted sums of independent Bernoullis with constantly many distinct weights, and thus obtain Theorem 2. Finally, the lower bound of Theorem 3 is proved by a direct information-theoretic argument.
For a distribution supported on we write to denote the value of the probability density function (pdf) at point , and to denote the value of the cumulative density function (cdf) at point . For , we write to denote and to denote the conditional distribution of restricted to Sometimes we write and for a subset , meaning and respectively.
Total Variation Distance.
Recall that the total variation distance between two distributions and over a finite domain is
Similarly, if and are two random variables ranging over a finite set, their total variation distance is defined as the total variation distance between their distributions. For convenience, we will often blur the distinction between a random variable and its distribution.
Fix a finite domain , and let denote some set of distributions over Given , a subset is said to be a -cover of (w.r.t. the total variation distance) if for every distribution in there exists some distribution in such that We sometimes say that distributions are -neighbors if . If this holds, we also say that is -close to and vice versa.
Poisson Binomial Distribution.
A Poisson binomial distribution of order is a sum of mutually independent Bernoulli random variables . We denote the set of all Poisson binomial distributions of order by and, if is clear from context, just .
A Poisson binomial distribution can be represented uniquely as a vector satisfying . To go from to its corresponding vector, we find a collection of mutually independent Bernoullis such that is distributed according to and . (Such a collection exists by the definition of a Poisson binomial distribution.) Then we set for all . Lemma 1 of [DP13] shows that the resulting vector is unique.
We denote by the distribution of the sum of mutually independent indicators with expectations , for all . Given the above discussion is unique up to permutation of the ’s. We also sometimes write to denote the distribution of Note the difference between , which refers to the distribution of , and , which refers to the underlying collection of mutually independent Bernoulli random variables.
Translated Poisson Distribution.
We will make use of the translated Poisson distribution for approximating the Poisson Binomial distribution. We define the translated Poisson distribution, and state a known result on how well it approximates the Poisson Binomial distribution.
Definition 1 ([R0̈7]).
We say that an integer random variable is distributed according to the translated Poisson distribution with parameters and , denoted , iff can be written as
where is a random variable distributed according to , where represents the fractional part of .
The following lemma gives a useful bound on the variation distance between a Poisson Binomial Distribution and a suitable translated Poisson distribution. Note that if the variance of the Poisson Binomial Distribution is large, then the lemma gives a strong bound.
Lemma 1 (see (3.4) of [R0̈7]).
Let be independent random indicators with . Then
where and .
The following bound on the total variation distance between translated Poisson distributions will be useful.
Lemma 2 (Lemma 2.1 of [Bl06]).
For and with , we have
Running Times, and Bit Complexity.
Throughout this paper, we measure the running times of our algorithms in numbers of bit operations. For a positive integer , we denote by its description complexity in binary, namely . Moreover, we represent a positive rational number as , where and are relatively prime positive integers. The description complexity of is defined to be . We will assume that all ’s and ’s input to our algorithms are rational numbers.
2 Learning a sum of Bernoulli random variables from samples
In this section, we prove Theorem 1 by providing a sample- and time-efficient algorithm for learning an unknown PBD . We start with an important ingredient in our analysis.
Theorem 4 (Cover for PBDs).
For all , there exists an -cover of such that
can be constructed in time linear in its representation size, i.e., .
Moreover, if , then the collection of Bernoulli random variables has one of the following forms, where is a positive integer, for some absolute constant :
(-Sparse Form) There is some such that, for all , and, for all , .
(-heavy Binomial Form) There is some and such that, for all , and, for all , ; moreover, satisfy and
Finally, for every for which there is no -neighbor in that is in sparse form, there exists some in -heavy Binomial form such that
if , , and , then and
The Basic Learning Algorithm. The high-level structure of our learning algorithms which give Theorem 1 is provided in Algorithm Learn-PBD of Figure 1. We instantiate this high-level structure, with appropriate technical modifications, in Section 2.4, where we give more detailed descriptions of the non-proper and proper algorithms that give parts (1) and (2) of Theorem 1.
At a high level, the subroutine Learn-Sparse is given sample access to and is designed to find an -accurate hypothesis with probability at least , if the unknown PBD is -close to some sparse form PBD inside the cover . Similarly, Learn-Poisson is designed to find an -accurate hypothesis , if is not -close to a sparse form PBD (in this case, Theorem 4 implies that must be -close to some -heavy Binomial form PBD). Finally, Choose-Hypothesis is designed to choose one of the two hypotheses as being -close to The following subsections specify these subroutines, as well as how the algorithm can be used to establish Theorem 1. We note that Learn-Sparse and Learn-Poisson do not return the distributions and as a list of probabilities for every point in . They return instead a succinct description of these distributions in order to keep the running time of the algorithm logarithmic in . Similarly, Choose-Hypothesis operates with succinct descriptions of these distributions.
2.1 Learning when is close to a sparse form PBD.
Our starting point here is the simple observation that any PBD is a unimodal distribution over the domain . (There is a simple inductive proof of this, or see Section 2 of [KG71].) This enables us to use the algorithm of Birgé [Bir97] for learning unimodal distributions. We recall Birgé’s result, and refer the reader to Appendix B for an explanation of how Theorem 5 as stated below follows from [Bir97].
Theorem 5 ([Bir97]).
For all , there is an algorithm that draws
samples from an unknown unimodal distribution over , does
bit-operations, and outputs a (succinct description of a) hypothesis distribution over that has the following form: is uniform over subintervals , whose union , where In particular, the algorithm outputs the lists through and through , as well as the total probability mass that assigns to each subinterval , . Finally, with probability at least , .
The main result of this subsection is the following:
For all , there is an algorithm Learn-Sparse that draws
samples from a target PBD over , does
bit operations, and outputs a (succinct description of a) hypothesis distribution over that has the following form: its support is contained in an explicitly specified interval , where , and for every point in the algorithm explicitly specifies the probability assigned to that point by . 444In particular, our algorithm will output a list of pointers, mapping every point in to some memory location where the probability assigned to that point by is written. The algorithm has the following guarantee: if is -close to some sparse form PBD in the cover of Theorem 4, then with probability at least , , for some absolute constant , and the support of lies in the support of .
The high-level idea of Lemma 3 is quite simple. We truncate of the probability mass from each end of to obtain a conditional distribution ; since is unimodal so is . If is larger than then the algorithm outputs “fail” (and could not have been close to a sparse-form distribution in the cover). Otherwise, we use Birgé’s algorithm to learn the unimodal distribution . A detailed description of the algorithm is given in Figure 2 below.
Proof of Lemma 3: As described in Figure 2, algorithm Learn-Sparse first draws samples from and sorts them to obtain a list of values We claim the following about the values and defined in Step 2 of the algorithm:
With probability at least , we have and .
We only show that with probability at least , since the arguments for , and are identical. Given that each of these conditions is met with probability at least , the union bound establishes our claim.
To show that is satisfied with probability at least we argue as follows: Let . Clearly, while . Given this, if samples are drawn from then the expected number of them that are is at most . It follows then from the Chernoff bound that the probability that more than samples are is at most . Hence except with this failure probability, we have , which implies that . ∎
As specified in Steps 3 and 4, if , where is the constant in the statement of Theorem 4, the algorithm outputs “fail”, returning the trivial hypothesis which puts probability mass on the point . Otherwise, the algorithm runs Birgé’s unimodal distribution learner (Theorem 5) on the conditional distribution , and outputs the result of Birgé’s algorithm. Since is unimodal, it follows that is also unimodal, hence Birgé’s algorithm is appropriate for learning it. The way we apply Birgé’s algorithm to learn given samples from the original distribution is the obvious one: we draw samples from , ignoring all samples that fall outside of , until the right number of samples fall inside , as required by Birgé’s algorithm for learning a distribution of support of size with probability at least . Once we have the right number of samples in , we run Birgé’s algorithm to learn the conditional distribution . Note that the number of samples we need to draw from until the right number of samples fall inside is still , with probability at least . Indeed, since , it follows from the Chernoff bound that with probability at least , if samples are drawn from , at least fall inside .
Analysis: It is easy to see that the sample complexity of our algorithm is as promised. For the running time, notice that, if Birgé’s algorithm is invoked, it will return two lists of numbers through and through , as well as a list of probability masses assigned to each subinterval , , by the hypothesis distribution , where . In linear time, we can compute a list of probabilities , representing the probability assigned by to every point of subinterval , for . So we can represent our output hypothesis via a data structure that maintains pointers, having one pointer per point inside . The pointers map points to probabilities assigned by to these points. Thus turning the output of Birgé’s algorithm into an explicit distribution over incurs linear overhead in our running time, and hence the running time of our algorithm is also as promised. (See Appendix B for an explanation of the running time of Birgé’s algorithm.) Moreover, we also note that the output distribution has the promised structure, since in one case it has a single atom at and in the other case it is the output of Birgé’s algorithm on a distribution of support of size .
It only remains to justify the last part of the lemma. Let be the sparse-form PBD that is close to; say that is supported on where Since is -close to in total variation distance it must be the case that . Since by Claim 4, it must be the case that . Similar arguments give that . So the interval is contained in and has length at most . This means that Birgé’s algorithm is indeed used correctly by our algorithm to learn , with probability at least (that is, unless Claim 4 fails). Now it follows from the correctness of Birgé’s algorithm (Theorem 5) and the discussion above, that the hypothesis output when Birgé’s algorithm is invoked satisfies with probability at least , i.e., unless either Birgé’s algorithm fails, or we fail to get the right number of samples landing inside . To conclude the proof of the lemma we note that:
So the triangle inequality gives: , and Lemma 3 is proved.
2.2 Learning when is close to a -heavy Binomial Form PBD.
For all , there is an algorithm Learn-Poisson that draws
samples from a target PBD over , does
bit operations, and returns two parameters and . The algorithm has the following guarantee: Suppose is not -close to any sparse form PBD in the cover of Theorem 4. Let be the translated Poisson distribution with parameters and . Then with probability at least we have for some absolute constant .
Our proof plan is to exploit the structure of the cover of Theorem 4. In particular, if is not -close to any sparse form PBD in the cover, it must be -close to a PBD in heavy Binomial form with approximately the same mean and variance as , as specified by the final part of the cover theorem. Hence, a natural strategy is to obtain estimates and of the mean and variance of the unknown PBD , and output as a hypothesis a translated Poisson distribution with parameters and . We show that this strategy is a successful one. Before providing the details, we highlight two facts that we will establish in the subsequent analysis and that will be used later. The first is that, assuming is not -close to any sparse form PBD in the cover , its variance satisfies
The second is that under the same assumption, the estimates and of the mean and variance of that we obtain satisfy the following bounds with probability at least :
Proof of Lemma 5: We start by showing that we can estimate the mean and variance of the target PBD .
For all , there exists an algorithm with the following properties: given access to a PBD of order , it produces estimates and for and respectively such that with probability at least :
The algorithm uses
samples and runs in time
We treat the estimation of and separately. For both estimation problems we show how to use samples to obtain estimates and achieving the required guarantees with probability at least (we refer to these as “weak estimators”). Then a routine procedure allows us to boost the success probability to at the expense of a multiplicative factor on the number of samples. While we omit the details of the routine boosting argument, we remind the reader that it involves running the weak estimator times to obtain estimates and outputting the median of these estimates, and similarly for estimating .
We proceed to specify and analyze the weak estimators for and separately:
Weak estimator for : Let be independent samples from , and let . Then
So Chebyshev’s inequality implies that