Learning mixtures of structured distributions over discrete domains
Abstract
Let be a class of probability distributions over the discrete domain We show that if satisfies a rather general condition – essentially, that each distribution in can be wellapproximated by a variablewidth histogram with few bins – then there is a highly efficient (both in terms of running time and sample complexity) algorithm that can learn any mixture of unknown distributions from
We analyze several natural types of distributions over , including logconcave, monotone hazard rate and unimodal distributions, and show that they have the required structural property of being wellapproximated by a histogram with few bins. Applying our general algorithm, we obtain nearoptimally efficient algorithms for all these mixture learning problems as described below. More precisely,

Logconcave distributions: We learn any mixture of logconcave distributions over using samples (independent of ) and running in time bitoperations (note that reading a single sample from takes bit operations). For the special case we give an efficient algorithm using samples; this generalizes the main result of [DDS12b] from the class of Poisson Binomial distributions to the much broader class of all logconcave distributions. Our upper bounds are not far from optimal since any algorithm for this learning problem requires samples.

Monotone hazard rate (MHR) distributions: We learn any mixture of MHR distributions over using samples and running in time bitoperations. Any algorithm for this learning problem must use samples.

Unimodal distributions: We give an algorithm that learns any mixture of unimodal distributions over using samples and running in time bitoperations. Any algorithm for this problem must use samples.
1 Introduction
1.1 Background and motivation.
Learning an unknown probability distribution given access to independent samples is a classical topic with a long history in statistics and probability theory. Theoretical computer science researchers have also been interested in these problems at least since the 1990s [KMR94, Das99], with an explicit focus on the computational efficiency of algorithms for learning distributions. Many works in theoretical computer science over the past decade have focused on learning and testing various kinds of probability distributions over highdimensional spaces, see e.g. [Das99, FM99, DS00, AK01, VW02, FOS05, RS05, BS10, KMV10, MV10, ACS10] and references therein. There has also been significant recent interest in learning and testing various types of probability distributions over the discrete domain , see e.g. [BKR04, VV11b, VV11a, DDS12a, DDS12b].
A natural type of distribution learning problem, which is the focus of this work, is that of learning an unknown mixture of “simple” distributions. Mixtures of distributions have received much attention in statistics [Lin95, RW84, TSM85] and in recent years have been intensively studied in computer science as well (see many of the papers referenced above). Given distributions and nonnegative values that sum to 1, we say that is a mixture of components with mixing weights . A draw from is obtained by choosing with probability and then making a draw from .
In this paper we work in essentially the classical “density estimation” framework [Sil86, Sco92, DL01] which is very similar to the model considered in [KMR94] in a theoretical computer science context. In this framework the learning algorithm is given access to independent samples drawn from an unknown target distribution over , and it must output a hypothesis distribution over such that with high probability the total variation distance between and is at most Thus, for learning mixture distributions, our goal is simply to construct a highaccuracy hypothesis distribution which is very close to the mixture distribution that generated the data. In keeping with the spirit of [KMR94], we shall be centrally concerned with the running time as well as the number of samples required by our algorithms that learn mixtures of various types of discrete distributions over
We focus on density estimation rather than, say, clustering or parameter estimation, for several reasons. First, clustering samples according to which component in the mixture each sample came from is often an impossible task unless restrictive separation assumptions are made on the components; we prefer not to make such assumptions. Second, the classes of distributions that we are chiefly interested in (such as logconcave, MHR and unimodal distributions) are all nonparametric classes, so it is unclear what “parameter estimation” would even mean for these classes. Finally, even in highly restricted special cases, parameter estimation provably requires sample complexity exponential in , the number of components in the mixture. Moitra and Valiant [MV10] have shown that parameter estimation for a mixture of Gaussians inherently requires samples. Their argument can be translated to the discrete setting, with translated Binomial distributions in place of Gaussians, to provide a similar lower bound for parameter estimation of translated Binomial mixtures. Thus, parameter estimation even for a mixture of translated Binomial distributions over (a highly restricted special case of all the mixture classes we consider, since translated Binomial distributions are logconcave, MHR and unimodal) requires samples. This rather discouraging lower bound motivates the study of other variants of the problem of learning mixture distributions.
Returning to our density estimation framework, it is not hard to show that from an informationtheoretic perspective, learning a mixture of distributions from a class of distributions is never much harder than learning a single distribution from . In Appendix A we give a simple argument which establishes the following:
Proposition 1.1.
[Sample Complexity of Learning Mixtures] Let be a class of distributions over . Let be an algorithm which learns any unknown distribution in using samples, i.e., with probability outputs a hypothesis distribution such that . where is the unknown target distribution. Then there is an algorithm which uses samples and learns any unknown mixture of distributions in to variation distance with confidence probability .
While the generic algorithm uses relatively few samples, it is computationally highly inefficient, with running time exponentially higher than the runtime of algorithm (since tries all possible partitions of its input sample into separate subsamples). Indeed, naive approaches to learning mixture distributions run into a “credit assignment” problem of determining which component distribution each sample point belongs to.
As the main contributions of this paper, we (i) give a general algorithm which efficiently learns mixture distributions over provided that the component distributions satisfy a mild condition; and (ii) show that this algorithm can be used to obtain highly efficient algorithms for natural mixture learning problems.
1.2 A general algorithm.
The mild condition which we require of the component distributions in our mixtures is essentially that each component distribution must be close to a (variablewidth) histogram with few bins. More precisely, let us say that a distribution over is flat (see Section 2) if there is a partition of into disjoint intervals such that is close (in total variation distance) to the distribution obtained by “flattening” within each interval (i.e., by replacing , for , with ). Our general result for learning mixture distributions is a highly efficient algorithm that learns any mixture of flat distributions:
Theorem 1.1 (informal statement).
There is an algorithm that learns any mixture of flat distributions over to accuracy , using samples and running in bitoperations.
As we show in Section 1.3 below, Theorem 1.1 yields nearoptimal sample complexity for a range of interesting mixture learning problems, with a running time that is nearly linear in the sample size. Another attractive feature of Theorem 1.1 is that it always outputs hypothesis distributions with a very simple structure (enabling a succinct representation), namely histograms with at most bins.
1.3 Applications of the general approach.
We apply our general approach to obtain a wide range of learning results for mixtures of various natural and wellstudied types of discrete distributions. These include mixtures of logconcave distributions, mixtures of monotone hazard rate (MHR) distributions, and mixtures of unimodal distributions. To do this, in each case we need a structural result stating that any distribution of the relevant type can be wellapproximated by a histogram with few bins. In some cases (unimodal distributions) the necessary structural results were previously known, but in others (logconcave and MHR distributions) we establish novel structural results that, combined with our general approach, yield nearly optimal algorithms.
Logconcave distributions. Discrete logconcave distributions are essentially those distributions that satisfy (see Section 4 for a precise definition). They are closely analogous to logconcave distributions over continuous domains, and encompass a range of interesting and wellstudied types of discrete distributions, including binomial, negative binomial, geometric, hypergeometric, Poisson, Poisson Binomial, hyperPoisson, PólyaEggenberger, and Skellam distributions (see Section 1 of [FBR11]). In the continuous setting, logconcave distributions include uniform, normal, exponential, logistic, extreme value, Laplace, Weibull, Gamma, Chi and ChiSquared and Beta distributions, see [BB05]. Logconcave distributions over have been studied in a range of different contexts including economics, statistics and probability theory, and algebra, combinatorics and geometry, see [An95, FBR11, Sta89] and references therein.
Our main learning result for mixtures of discrete logconcave distributions is:
Theorem 1.2.
There is an algorithm that learns any mixture of logconcave distributions over to variation distance using samples and running in bitoperations.
We stress that the sample complexity above is completely independent of the domain size In the special case of learning a single discrete logconcave distribution we achieve an improved sample complexity of samples, with running time . This matches the sample complexity and running time of the main result of [DDS12b], which was a specialized algorithm for learning Poisson Binomial distributions over . Our new algorithm is simpler, applies to the broader class of all logconcave distributions, has a much simpler and more selfcontained analysis, and generalizes to mixtures of distributions (at the cost of an additional factor in runtime and sample complexity). We note that these algorithmic results are not far from the best possible for mixtures of logconcave distributions. We show in Section 4 that for and , any algorithm for learning a mixture of logconcave distributions to accuracy must use samples.
Monotone Hazard Rate (MHR) distributions. A discrete distribution over is said to have a monotone (increasing) hazard rate if the hazard rate is a nondecreasing function of It is well known that every discrete logconcave distribution is MHR (see e.g. part (ii) of Proposition 10 of [An95]), but MHR is a more general condition than logconcavity (for example, it is easy to check that every nondecreasing distribution over is MHR, but such distributions need not be logconcave). The MHR property is a standard assumption in economics, in particular auction theory and mechanism design [Mye81, FT91, MCWG95]. Such distributions also arise frequently in reliability theory; [BMP63] is a good reference for basic properties of these distributions.
Our main learning result for mixtures of MHR distributions is:
Theorem 1.3.
There is an algorithm that learns any mixture of MHR distributions over to variation distance using samples and running in bitoperations.
This theorem is also nearly optimal. We show that for and , any algorithm for learning a mixture of MHR distributions to accuracy must use samples.
Unimodal distributions. A distribution over is said to be unimodal if its probability mass function is monotone nondecreasing over for some and then monotone nonincreasing on . Every logconcave distribution is unimodal, but the MHR and unimodal conditions are easily seen to be incomparable. Many natural types of distributions are unimodal and there has been extensive work on density estimation for unimodal distributions and related questions [Rao69, Weg70, BKR04, Bir97, Fou97].
Our main learning result for mixtures of unimodal distributions is:
Theorem 1.4.
There is an algorithm that learns any mixture of unimodal distributions over to variation distance using samples and running in bitoperations.
Our approach in fact extends to learning a mixture of modal distributions (see Section 6). The same lower bound argument that we use for mixtures of MHR distributions also gives us that for and , any algorithm for learning a mixture of unimodal distributions to accuracy must use samples.
1.4 Related work.
Logconcave distributions: Maximum likelihood estimators for both continuous [DR09, Wal09] and discrete [FBR11] logconcave distributions have been recently studied by various authors. For special cases of logconcave densities over (that satisfy various restrictions on the shape of the pdf) upper bounds on the minimax risk of estimators are known, see e.g. Exercise 15.21 of [DL01]. (We remark that these results do not imply the case of our logconcave mixture learning result.) Perhaps the most relevant prior work is the recent algorithm of [DDS12b] which gives a sample, time algorithm for learning any Poisson Binomial Distribution over . (As noted above, we match the performance of the [DDS12b] algorithm for the broader class of all logconcave distributions, as the case of our logconcave mixture learning result.)
Achlioptas and McSherry [AM05] and Kannan et al. [KSV08] gave algorithms for clustering points drawn from a mixture of highdimensional logconcave distributions, under various separation assumptions on the distance between the means of the components. We are not aware of prior work on density estimation of mixtures of arbitrary logconcave distributions in either the continuous or the discrete setting.
MHR distributions: As noted above, MHR distributions appear frequently and play an important role in reliability theory and in economics (to the extent that the MHR condition is considered a standard assumption in these settings). Surprisingly, the problem of learning an unknown MHR distribution or mixture of such distributions has not been explicitly considered in the statistics literature. We note that several authors have considered the problem of estimating the hazard rate of an MHR distribution in different contexts, see e.g. [Wan86, HW93, GJ11, Ban08].
Unimodal distributions: The problem of learning a single unimodal distribution is wellunderstood: Birgé [Bir97] gave an efficient algorithm for learning continuous unimodal distributions (whose density is absolutely bounded); his algorithm, when translated to the discrete domain , requires samples. This sample size is also known to be optimal (up to constant factors)[Bir87a]. In recent work, Daskalakis et al. [DDS12a] gave an efficient algorithm to learn modal distributions over . We remark that their result does not imply ours, as even a mixture of two unimodal distributions over may have modes. We are not aware of prior work on efficiently learning mixtures of unimodal distributions.
2 Preliminaries and notation
We write to denote the discrete domain and to denote the set for For we write to denote its norm.
For a probability distribution over we write to denote the probability of element under , so for all and For we write to denote . We write to denote the subdistribution over induced by , i.e., if and otherwise.
A distribution over is nonincreasing (resp. nondecreasing) if (resp. ), for all ; is monotone if it is either nonincreasing or nondecreasing.
Let be distributions over . The total variation distance between and is The Kolmogorov distance between and is defined as Note that
Finally, the following notation and terminology will be useful: given independent samples , drawn from distribution the empirical distribution is defined as follows: for all , .
Partitions, flat decompositions and refinements. Given a partition of into disjoint intervals and a distribution over , we write to denote the flattened distribution. This is the distribution over defined as follows: for and , . That is, is obtained from by averaging the weight that assigns to each interval in over the entire interval.
Definition 2.1 (Flat decomposition).
Let be a distribution over and be a partition of into disjoint intervals. We say that is a flat decomposition of if . If there exists a flat decomposition of then we say that is flat.
Let be a partition of into disjoint intervals, and be a partition of into disjoint intervals. We say that is a refinement of if each interval in is a union of intervals in , i.e., for every there is a subset such that .
For and two partitions of into and intervals respectively, we say that the common refinement of and is the partition of into intervals obtained from and in the obvious way, by taking all possible nonempty intervals of the form It is clear that is both a refinement of and of and that contains at most intervals.
2.1 Basic Tools.
We recall some basic tools from probability.
The VC inequality. Given a family of subsets over , define . The VC–dimension of is the maximum size of a subset that is shattered by (a set is shattered by if for every some satisfies ).
Theorem 2.1 (VC inequality, [Dl01, p.31]).
Let be an empirical distribution of samples from . Let be a family of subsets of VC–dimension . Then
Uniform convergence. We will also use the following uniform convergence bound:
Theorem 2.2 ([Dl01, p17]).
Let be a family of subsets over , and be an empirical distribution of samples from . Let be the random variable . Then we have
3 Learning mixtures of flat distributions
In this section we present and analyze our general algorithm for learning mixtures of flat distributions. We proceed in stages by considering three increasingly demanding learning scenarios, each of which builds on the previous one.
3.1 First scenario: known flat decomposition.
We start with the simplest scenario, in which the learning algorithm is given a partition which is a flat decomposition of for the target distribution being learned.
Theorem 3.1.
Let be any unknown target distribution over and be any flat decomposition of Algorithm draws samples from and with probability at least , outputs such that . Its running time is bit operations.
Proof.
An application of the triangle inequality yields
The first term on the righthand side is at most by the definition of a flat decomposition. The second term is also at most , as follows by Proposition 3.1, stated and proved below. ∎
Proposition 3.1.
Let be any distribution over and let be an empirical distribution of samples from . Let be any partition of into at most intervals. Then with probability at least ,
Proof.
By definition we have
where . Since contains at most intervals, is a union of at most intervals. Consequently the above righthand side is at most , where is the family of all unions of at most intervals over .^{1}^{1}1Formally, define as the collection of all intervals over , including the empty interval. Then . Since the VCdimension of is , Theorem 2.1 implies that the considered quantity has expected value at most . The claimed result now follows by applying Theorem 2.2 with ∎
3.2 Second scenario: unknown flat distribution.
The second algorithm deals with the scenario in which the target distribution is flat but no flat decomposition is provided to the learner. We show that in such a setting we can construct a flat decomposition of , and then we can simply use this to run LearnKnownDecomposition.
The basic subroutine RightInterval will be useful here (and later). It takes as input an explicit description of a distribution over , an interval , and a threshold It returns the longest interval in that ends at and has mass at most under . If no such interval exists then must exceeds , and the subroutine simply returns the singleton interval .
Subroutine RightInterval:
Input: explicit description of distribution ; interval ; threshold

If then set , otherwise set .

Return .
The algorithm to construct a decomposition is given below:
Algorithm ConstructDecomposition:
Input: sample access to unknown distribution over ; parameter ;
accuracy parameter ; confidence parameter

Draw samples to obtain an empirical distribution .

Set .

While :

Let be the interval returned by RightInterval.

Add to and set .


Return
Theorem 3.2.
Let be a class of flat distributions over . Then for any , Algorithm ConstructDecomposition draws samples from , and with probability at least outputs a flat decomposition of . Its running time is bit operations.
To prove the above theorem we will need the following elementary fact about refinements:
Lemma 3.1 ([Dds13, Lemma 4]).
Let be any distribution over and let be a flat decomposition of . If is a refinement of , then is a flat decomposition of .
We will also use the following simple observation about the RightInterval subroutine:
Observation 3.1.
Suppose RightInterval returns an interval and RightInterval returns . Then .
Proof of Theorem 3.2.
Let By Observation 3.1, the partition that the algorithm constructs must contain at most intervals. Let be the common refinement of and a flat decomposition of (the existence of such a decomposition is guaranteed because every distribution in is flat). Now note that
Since is a refinement of the flat decomposition of , Lemma 3.1 implies that the first term on the RHS is at most . It remains to bound Fix any interval and let us consider the contribution
of to . If then the contribution to is zero; on the other hand, if then the contribution to is at most . Thus the total contribution summed across all is at most . Now we observe that with probability at least we have
(1) 
where the inequality follows from the fact that by Proposition 3.1. If then cannot be a singleton, and hence by definition of RightInterval. Finally, it is easy to see that at most intervals in do not belong to (because is the common refinement of and a partition of into at most intervals). Thus the second term on RHS of Eq. (1) is at most . Hence and the theorem is proved. ∎
Our algorithm to learn an unknown flat distribution is now very simple:
Algorithm LearnUnknownDecomposition:
Input: sample access to unknown distribution over ; parameter ;
accuracy parameter ; confidence parameter

Run ConstructDecomposition to obtain a flat decomposition of .

Run LearnKnownDecomposition and return the hypothesis that it outputs.
The following is now immediate:
Theorem 3.3.
Let be a class of flat distributions over . Then for any , Algorithm LearnUnknownDecomposition draws samples from , and with probability at least outputs a hypothesis distribution satisfying Its running time is bit operations.
3.3 Main result (third scenario): learning a mixture of flat distributions.
We have arrived at the scenario of real interest to us, namely learning an unknown mixture of distributions each of which has an (unknown) flat decomposition. The key to learning such distributions is the following structural result, which says that any such mixture must itself have a flat decomposition:
Lemma 3.2.
Let be a class of flat distributions over , and let be any mixture of distributions in Then is a flat distribution.
Proof.
Let be a mixture of components Let denote the flat decomposition of corresponding to , and let be the common refinement of . It is clear that contains at most intervals. By Lemma 3.1, is a flat decomposition for every . Hence we have
(2)  
(3) 
where (2) is the triangle inequality and (3) follows from the fact that the expression in (2) is a nonnegative convex combination of terms bounded from above by . ∎
Given Lemma 3.2, the desired mixture learning algorithm follows immediately from the results of the previous subsection:
Corollary 3.1 (see Theorem 1.1).
Let be a class of flat distributions over , and let be any mixture of distributions in . Then Algorithm LearnUnknownDecomposition draws samples from , and with probability at least outputs a hypothesis distribution satisfying Its running time is bit operations.
4 Learning mixtures of logconcave distributions
In this section we apply our general method from Section 3 to learn logconcave distributions over and mixtures of such distributions. We start with a formal definition:
Definition 4.1.
A probability distribution over is said to be logconcave if it satisfies the following conditions: (i) if are such that then ; and (ii) for all
We note that while some of the literature on discrete logconcave distributions states that the definition consists solely of item (ii) above, item (i) is in fact necessary as well since without it logconcave distributions need not even be unimodal (see the discussion following Definition 2.3 of [FBR11]).
In Section 4.1 we give an efficient algorithm which constructs an flat decomposition of any target logconcave distribution. Combining this with Algorithm LearnKnownDecomposition we obtain an sample algorithm for learning a single discrete logconcave distribution, and combining it with Corollary 3.1 we obtain a sample algorithm for learning a mixture of logconcave distributions.
4.1 Constructing a flat decomposition given samples from a logconcave distribution.
We recall the wellknown fact that logconcavity implies unimodality (see e.g. [KG71]). Thus, it is useful to analyze logconcave distributions which additionally are monotone (since a general logconcave distribution can be viewed as consisting of two such pieces). With this motivation we give the following lemma:
Lemma 4.1.
Let be a distribution over that is nondecreasing and logconcave on . Let be an interval of mass , and suppose that the interval has mass . Then
Proof.
Let be the length of . We decompose into intervals of length , starting from the right. More precisely,
for The leftmost interval may contain nonpositive integers; for this reason define for nonpositive (note that the new distribution is still logconcave). Also define . Let . We claim that
(4) 
for . Eq. (4) holds for , since by the nondecreasing property. The general case follows by induction and using the fact that the ratio is nondecreasing in for any logconcave distribution (an immediate consequence of the definition of logconcavity).
It is easy to see that Eq. (4) implies
for . Since the intervals have geometrically decreasing mass, this implies that
Rearranging yields the desired inequality. ∎
We will also use the following elementary fact:
Fact 4.1.
Let be a distribution over and be an interval such that (i.e., is multiplicatively close to uniform over the interval ). Then the flattened subdistribution satisfies
We are now ready to present and analyze our algorithm DecomposeLogConcave that draws samples from an unknown logconcave distribution and outputs a flat decomposition. The algorithm simply runs the general algorithm ConstructDecomposition with an appropriate choice of parameters. However the analysis will not go via the “generic” Theorem 3.2 (which would yield a weaker bound) but instead uses Lemma 4.1, which is specific to logconcave distributions.
Algorithm DecomposeLogConcave:
Input: sample access to unknown logconcave distribution over ;
accuracy parameter ; confidence parameter

Set .

Run ConstructDecomposition and return the decomposition that it yields.
Our main theorem in this section is the following:
Theorem 4.2.
For any logconcave distribution over , Algorithm DecomposeLogConcave draws samples from and with probability at least constructs a decomposition that is flat.
Proof.
We first note that the number of intervals in is at most by Observation 3.1; this will be useful below. We may also assume that , where is the empirical distribution obtained in Step 1 of ConstructDecomposition; this inequality holds with probability at least , as follows by a combined application of Theorems 2.1 and 2.2. Since is logconcave, it is unimodal. Let be a mode of .
Let be the collection of intervals to the left of . We now bound the contribution of intervals in to . Let be the intervals in listed from left to right. Let be the union of intervals to the left of . If is a singleton, its contribution to is zero. Otherwise,
by the closeness of and in Kolmogorov distance and the definition of RightInterval. Also, by Observation 3.1, , and hence
again by closeness in Kolmogorov distance.
Since is nondecreasing on , we have
for , by Lemma 4.1 and Fact 4.1, using the upper and lower bounds on and respectively. Consequently, for all . Summing this inequality, we get
The righthand side above is at most by our choice of (with an appropriate constant in the bigoh).
Similarly, let be the collection of intervals to the right of . An identical analysis (using the obvious analogue of Lemma 4.1 for nonincreasing logconcave distributions on ) shows that the contribution of intervals in to is at most .
Finally, let be the interval containing . If is a singleton, it does not contribute to . Otherwise, and , hence the contribution of to is at most .
Combining all three cases,
Hence as was to be shown. ∎
Our claimed upper bounds follow from the above theorem by using our framework of Section 3. Indeed, it is clear that we can learn any unknown logconcave distribution by running Algorithm DecomposeLogConcave to obtain a decomposition and then Algorithm LearnKnownDecomposition to obtain a hypothesis distribution :
Corollary 4.1.
Given sample access to a logconcave distribution over , there is an algorithm LearnLogConcave that uses samples from and with probability at least outputs a distribution such that Its running time is bit operations.
Theorem 4.2 of course implies that every logconcave distribution is flat. We may thus apply Corollary 3.1 and obtain our main learning result for mixtures of logconcave distributions:
Corollary 4.2 (see Theorem 1.2).
Let be any mixture of logconcave distributions over . There is an algorithm LearnLogConcaveMixture that draws samples from and with probability at least outputs a distribution such that Its running time is bit operations.
Lower bounds. It is shown in [DL01, Lemma 15.1] that learning a continuous distribution whose density is bounded and convex over to accuracy requires samples. An easy adaptation of this argument implies the same result for a bounded concave density over . By an appropriate discretization procedure, one can show that learning a discrete concave density over requires samples for all Since a discrete concave distribution is also logconcave, the same lower bound holds for this case too. For the case of mixtures, we may consider a uniform mixture of component distributions where the th distribution in the mixture is supported on and is logconcave on its support. It is clear that each component distribution is logconcave over , and it is not difficult to see that in order to learn such a mixture to accuracy , at least of the component distributions must be learned to total variation distance at most . We thus get that for and , any algorithm for learning a mixture of logconcave distributions to accuracy must use samples.
5 Learning mixtures of MHR distributions
In this section we apply our general method from Section 3 to learn monotone hazard rate (MHR) distributions over and mixtures of such distributions.
Definition 5.1.
Let be a distribution supported in . The hazard rate of is the function ; if then we say We say that has monotone hazard rate (MHR) if is a nondecreasing function over
It is known that every logconcave distribution over is MHR but the converse is not true, as can easily be seen from the fact that every monotone nondecreasing distribution over is MHR.
In Section 5.1 we prove that every MHR distribution over has an flat decomposition. We combine this with our general results from Section 3 to get learning results for mixtures of MHR distributions.
5.1 Learning a single MHR distribution.
Our algorithm to construct a flat decomposition of an MHR distribution is DecomposeMHR, given below. Note that this algorithm takes an explicit description of as input and does not draw any samples from . Roughly speaking, the algorithm works by partitioning into intervals such that within each interval the value of never deviates from its value at the leftmost point of the interval by a multiplicative factor of more than
Algorithm DecomposeMHR:
Input: explicit description of MHR distribution over ; accuracy parameter

Set and initialize to be the empty set.

Let be the interval returned by RightInterval, and be the interval returned by RightInterval. Set .

Set to be the smallest integer such that . If no such exists, let and go to Step 5. Otherwise, let and .

While :

Let be the smallest integer such that either or holds. If no such exists let , otherwise, let .

Add to , and set .

Let .


Return .
Our first lemma for the analysis of DecomposeMHR states that MHR distributions satisfy a condition that is similar to being monotone nondecreasing:
Lemma 5.1.
Let be an MHR distribution over . Let