The Fourier Transform of Poisson Multinomial Distributions and its Algorithmic Applications
We study Poisson Multinomial Distributions – a fundamental family of discrete distributions that generalize the binomial and multinomial distributions, and are commonly encountered in computer science. Formally, an -Poisson Multinomial Distribution (PMD) is a random variable of the form , where the ’s are independent random vectors supported on the set of standard basis vectors in . In this paper, we obtain a refined structural understanding of PMDs by analyzing their Fourier transform. As our core structural result, we prove that the Fourier transform of PMDs is approximately sparse, i.e., roughly speaking, its -norm is small outside a small set. By building on this result, we obtain the following applications:
Learning Theory. We design the first computationally efficient learning algorithm for PMDs with respect to the total variation distance. Our algorithm learns an arbitrary -PMD within variation distance using a near-optimal sample size of and runs in time Previously, no algorithm with a runtime was known, even for
Game Theory. We give the first efficient polynomial-time approximation scheme (EPTAS) for computing Nash equilibria in anonymous games. For normalized anonymous games with players and strategies, our algorithm computes a well-supported -Nash equilibrium in time The best previous algorithm for this problem  had running time where , for any
Statistics. We prove a multivariate central limit theorem (CLT) that relates an arbitrary PMD to a discretized multivariate Gaussian with the same mean and covariance, in total variation distance. Our new CLT strengthens the CLT of Valiant and Valiant  by completely removing the dependence on in the error bound.
Along the way we prove several new structural results of independent interest about PMDs. These include: (i) a robust moment-matching lemma, roughly stating that two PMDs that approximately agree on their low-degree parameter moments are close in variation distance; (ii) near-optimal size proper -covers for PMDs in total variation distance (constructive upper bound and nearly-matching lower bound). In addition to Fourier analysis, we employ a number of analytic tools, including the saddlepoint method from complex analysis, that may find other applications.
The Poisson Multinomial Distribution (PMD) is the discrete probability distribution of a sum of mutually independent categorical random variables over the same sample space. A categorical random variable (-CRV) describes the result of a random event that takes on one of possible outcomes. Formally, an -PMD is any random variable of the form where the ’s are independent random vectors supported on the set of standard basis vectors in .
PMDs comprise a broad class of discrete distributions of fundamental importance in computer science, probability, and statistics. A large body of work in the probability and statistics literature has been devoted to the study of the behavior of PMDs under various structural conditions . PMDs generalize the familiar binomial and multinomial distributions, and describe many distributions commonly encountered in computer science (see, e.g., ). The case corresponds to the Poisson binomial distribution (PBD), introduced by Poisson  as a non-trivial generalization of the binomial distribution.
Recent years have witnessed a flurry of research activity on PMDs and related distributions, from several perspectives of theoretical computer science, including learning , property testing , computational game theory , and derandomization . More specifically, the following questions have been of interest to the TCS community:
Is there a statistically and computationally efficient algorithm for learning PMDs from independent samples in total variation distance?
How fast can we compute approximate Nash equilibria in anonymous games with many players and a small number of strategies per player?
How well can a PMD be approximated, in total variation distance, by a discretized Gaussian with the same mean and covariance matrix?
The first question is a fundamental problem in unsupervised learning that has received considerable recent attention in TCS . The aforementioned works have studied the learnability of PMDs, and related distribution families, in particular PBDs (i.e., -PMDs) and sums of independent integer random variables. Prior to this work, no computationally efficient learning algorithm for PMDs was known, even for the case of
The second question concerns an important class of succinct games previously studied in the economics literature , whose (exact) Nash equilibrium computation was recently shown to be intractable . The formal connection between computing Nash equilibria in these games and PMDs was established in a sequence of papers by Daskalakis and Papadimitriou , who leveraged it to gave the first PTAS for the problem. Prior to this work, no efficient PTAS was known, even for anonymous games with strategies per player.
The third question refers to the design of Central Limit Theorems (CLTs) for PMDs with respect to the total variation distance. Despite substantial amount of work in probability theory, the first strong CLT of this form appears to have been shown by Valiant and Valiant , motivated by applications in distribution property testing.  leveraged their CLT to obtain tight lower bounds for several fundamental problems in property testing. We remark that the error bound of the  CLT has a logarithmic dependence on the size of the PMD (number of summands), and it was conjectured in  that this dependence is unnecessary.
The main technical contribution of this work is the use of Fourier analytic techniques to obtain a refined understanding of the structure of PMDs. As our core structural result, we prove that the Fourier transform of PMDs is approximately sparse, i.e., roughly speaking, its -norm is small outside a small set. By building on this property, we are able to obtain various new structural results about PMDs, and make progress on the three questions stated in the previous subsection. In this subsection, we describe our algorithmic and structural contributions in detail.
We start by stating our algorithmic results in learning and computational game theory, followed by an informal description of our structural results and the connections between them.
Distribution Learning. As our main learning result, we obtain the first statistically and computationally efficient learning algorithm for PMDs with respect to the total variation distance. In particular, we show:
We remark that our learning algorithm outputs a succinct description of its hypothesis via its Discrete Fourier Transform (DFT), which is supported on a small size set. We show that the DFT gives both an efficient -sampler and an efficient -evaluation oracle for
Our algorithm learns an unknown -PMD within variation distance with sample complexity and computational complexity The sample complexity of our algorithm is near-optimal for any fixed , as samples are necessary, even for We note that recent work by Daskalakis et al.  established a similar sample upper bound, however their algorithm is not computationally efficient. More specifically, it runs in time which is quasi-polynomial in even for For the case, in recent work  the authors of this paper gave an algorithm with sample complexity and runtime Prior to this work, no algorithm with a sample size and runtime was known, even for
Our learning algorithm and its analysis are described in Section ?.
Computational Game Theory. As our second algorithmic contribution, we give the first efficient polynomial-time approximation scheme (EPTAS) for computing Nash equilibria in anonymous games with many players and a small number of strategies. In anonymous games, all players have the same set of strategies, and the payoff of a player depends on the strategy played by the player and the number of other players who play each of the strategies. In particular, we show:
The previous PTAS for this problem  has running time where Our algorithm decouples the dependence on and and, importantly, its running time dependence on is quasi-polynomial. For an algorithm with runtime was given in , which was sharpened to in the recent work of the authors . Hence, we obtain, for any value of the same qualitative runtime dependence on as in the case
Similarly to , our algorithm proceeds by constructing a proper -cover, in total variation distance, for the space of PMDs. A proper -cover for the set of all -PMDs, is a subset of such that any distribution in is within total variation distance from some distribution in Our main technical contribution is the efficient construction of a proper -cover of near-minimum size (see Theorem ?). We note that, as follows from Theorem ?, the quasi-polynomial dependence on and the doubly exponential dependence on in the runtime are unavoidable for any cover-based algorithm. Our cover upper and lower bounds and our Nash approximation algorithm are given in Section ?.
Statistics. Using our Fourier-based machinery, we prove a strong “size-free” CLT relating the total variation distance between a PMD and an appropriately discretized Gaussian with the same mean and covariance matrix. In particular, we show:
As mentioned above, Valiant and Valiant  proved a CLT of this form and used it as their main technical tool to obtain tight information-theoretic lower bounds for fundamental statistical estimation tasks. This and related CLTs have since been used in proving lower bounds for other problems (see, e.g., ). The error bound in the CLT of  is of the form i.e., it has a dependence on the size of the underlying PMD. Our Theorem ? provides a qualitative improvement over the aforementioned bound, by establishing that no dependence on is necessary. We note that  conjectured that such a qualitative improvement may be possible.
We remark that our techniques for proving Theorem ? are orthogonal to those of . While Valiant and Valiant use Stein’s method, we prove our strengthened CLT using the Fourier techniques that underly this paper. We view Fourier analysis as the right technical tool to analyze sums of independent random variables. An additional ingredient that we require is the saddlepoint method from complex analysis. We hope that our new CLT will be of broader use as an analytic tool to the TCS community. Our CLT is proved in Section ?.
Structure of PMDs. We now provide a brief intuitive overview of our new structural results for PMDs, the relation between them, and their connection to our algorithmic results mentioned above. The unifying theme of our work is a refined analysis of the structure of PMDs, based on their Fourier transform. The Fourier transform is one of the most natural technical tools to consider for analyzing sums of independent random variables, and indeed one of the classical proofs of the (asymptotic) central limit theorem is based on Fourier methods. The basis of our results, both algorithmic and structural, is the following statement:
(Sparsity of the Fourier Transform of PMDs.) For any -PMD , and any there exists a “small” set such that the -norm of its Fourier transform, outside the set is at most
We will need two different versions of the above statement for our applications, and therefore we do not provide a formal statement at this stage. The precise meaning of the term “small” depends on the setting: For the continuous Fourier transform, we essentially prove that the product of the volume of the effective support of the Fourier transform times the number of points in the effective support of our distribution is small. In particular, the set is a scaled version of the dual ellipsoid to the ellipsoid defined by the covariance matrix of Hence, roughly speaking, has an effective support that is the dual of the effective support of (See Lemma ? in Section ? for the precise statement.)
In the case of the Discrete Fourier Transform (DFT), we show that there exists a discrete set with small cardinality, such that -norm of the DFT outside this set is small. At a high-level, to prove this statement, we need the appropriate definition of the (multidimensional) DFT, which turns out to be non-trivial, and is crucial for the computational efficiency of our learning algorithm. More specifically, we chose the period of the DFT to reflect the shape of the effective support of our PMD. (See Proposition ? in Section ? for the statement.)
With Fourier sparsity as our starting point, we obtain new structural results of independent interest for PMDs. The first is a “robust” moment-matching lemma, which we now informally state:
(Parameter Moment Closeness Implies Closeness in Distribution.) For any pair of -PMDs , if the “low-degree” parameter moment profiles of and are close, then are close in total variation distance.
See Definition ? for the definition of parameter moments of a PMD. The formal statement of the aforementioned lemma appears as Lemma ? in Section ?. Our robust moment-matching lemma is the basis for our proper cover algorithm and our EPTAS for Nash equilibria in anonymous games. Our constructive cover upper bound is the following:
A sparse proper cover quantifies the “size” of the space of PMDs and provides useful structural information that can be exploited in a variety of applications. In addition to Nash equilibria in anonymous games, our efficient proper cover construction provides a smaller search space for approximately solving essentially any optimization problem over PMDs. As another corollary of our cover construction, we obtain the first EPTAS for computing threat points in anonymous games.
Perhaps surprisingly, we also prove that our above upper bound is essentially tight:
We remark that, in previous work , the authors proved a tight cover size bound of for -SIIRVs, i.e., sums of independent scalar random variables each supported on While a cover size lower bound for -SIIRVs directly implies the same lower bound for -PMDs, the opposite is not true. Indeed, Theorems ? and ? show that covers for -PMDs are inherently larger, requiring a doubly exponential dependence on
At a high-level, the Fourier techniques of this paper can be viewed as a highly non-trivial generalization of the techniques in our recent paper  on sums of independent scalar random variables. We would like to emphasize that a number of new conceptual and technical ideas are required to overcome the various obstacles arising in the multi-dimensional setting.
We start with an intuitive explanation of two key ideas that form the basis of our approach.
Sparsity of the Fourier Transform of PMDs. Since the Fourier Transform (FT) of a PMD is the product of the FTs of its component CRVs, its magnitude is the product of terms each bounded from above by Note that each term in the product is strictly less than except in a small region, unless the component CRV is trivial (i.e., essentially deterministic). Roughly speaking, to establish the sparsity of the FT of PMDs, we proceed as follows: We bound from above the magnitude of the FT by the FT of a Gaussian with the same covariance matrix as our PMD. (See, for example, Lemma ?.) This gives us tail bounds for the FT of the PMD in terms of the FT of this Gaussian, and when combined with the concentration of the PMD itself, yields the desired property.
Approximation of the logarithm of the Fourier Transform. A key ingredient in our proofs is the approximation of the logarithm of the Fourier Transform (log FT) of PMDs by low-degree polynomials. Observe that the log FT is a sum of terms, which is convenient for the analysis. We focus on approximating the log FT by a low-degree Taylor polynomial within the effective support of the FT. (Note that outside the effective support the log FT can be infinity.) Morally speaking, the log FT is smooth, i.e., it is approximated by the first several terms of its Taylor series. Formally however, this statement is in general not true and requires various technical conditions, depending on the setting. One important point to note is that the sparsity of the FT controls the domain in which this approximation will need to hold, and thus help us bound the Taylor error. We will need to ensure that the sizes of the Taylor coefficients are not too large given the location of the effective support, which turns out to be a non-trivial technical hurdle. To ensure this, we need to be very careful about how we perform this Taylor expansion. In particular, the correct choice of the point that we Taylor expand around will be critical for our applications. We elaborate on these difficulties in the relevant technical sections. Finally, we remark that the degree of polynomial approximation we will require depends on the setting: In our cover upper bounds, we will require (nearly) logarithmic degree, while for our CLT degree- approximation suffices.
We are now ready to give an overview of the ideas in the proofs of each of our results.
Efficient Learning Algorithm. The high-level structure of our learning algorithm relies on the sparsity of the Fourier transform, and is similar to the algorithm in our previous work  for learning sums of independent integer random variables. More specifically, our learning algorithm estimates the effective support of the DFT, and then computes the empirical DFT in this effective support. This high-level description would perhaps suffice, if we were only interested in bounding the sample complexity. In order to obtain a computationally efficient algorithm, it is crucial to use the appropriate definition of the DFT and its inverse.
In more detail, our algorithm works as follows: It starts by drawing samples to estimate the mean vector and covariance matrix of our PMD to good accuracy. Using these estimates, we can bound the effective support of our distribution in an appropriate ellipsoid. In particular, we show that our PMD lies (whp) in a fundamental domain of an appropriate integer lattice where is an integer matrix whose columns are appropriate functions of the eigenvalues and eigenvectors of the (sample) covariance matrix. This property allows us to learn our unknown PMD by learning the random variable To do this, we learn its Discrete Fourier transform. Let be the dual lattice to (i.e., the set of points so that for all ). Importantly, we define the DFT, of our PMD on the dual lattice that is, with A useful property of this definition is the following: the probability that attains a given value is given by the inverse DFT, defined on the lattice namely
The main structural property needed for the analysis of our algorithm is that there exists an explicit set with integer coordinates and cardinality that contains all but of the mass of Given this property, our algorithm draws an additional set of samples of size from the PMD, and computes the empirical DFT (modulo ) on its effective support Using these ingredients, we are able to show that the inverse of the empirical DFT defines a pseudo-distribution that is -close to our unknown PMD in total variation distance.
Observe that the support of the inverse DFT can be large, namely Our algorithm does not explicitly evaluate the inverse DFT at all these points, but outputs a succinct description of its hypothesis , via its DFT We emphasize that this succinct description suffices to efficiently obtain both an approximate evaluation oracle and an approximate sampler for our target PMD Indeed, it is clear that computing the inverse DFT at a single point can be done in time and gives an approximate oracle for the probability mass function of By using additional algorithmic ingredients, we show how to use an oracle for the DFT, , as a black-box to obtain a computationally efficient approximate sampler for
Our learning algorithm and its analysis are given in Section ?.
Constructive Proper Cover and Anonymous Games. The correctness of our learning algorithm easily implies (see Section ?) an algorithm to construct a non-proper -cover for PMDs of size While this upper bound is close to being best possible (see Section ?), it does not suffice for our algorithmic applications in anonymous games. For these applications, it is crucial to obtain an efficient algorithm that constructs a proper -cover, and in fact one that works in a certain stylized way.
To construct a proper cover, we rely on the sparsity of the continuous Fourier Transform of PMDs. Namely, we show that for any PMD with effective support there exists an appropriately defined set such that the contribution of to the -norm of is at most By using this property, we show that any two PMDs, with approximately the same variance in each direction, that have continuous Fourier transforms close to each other in the set are close in total variation distance. We build on this lemma to prove our robust moment-matching result. Roughly speaking, we show that two PMDs, with approximately the same variance in each direction, that are “close” to each other in their low-degree parameter moments are also close in total variation distance. We emphasize that the meaning of the term “close” here is quite subtle: we need to appropriately partition the component CRVs into groups, and approximate the parameter moments of the PMDs formed by each group within a different degree and different accuracy for each degree. (See Lemma ? in Section ?.)
Our algorithm to construct a proper cover, and our EPTAS for Nash equilibria in anonymous games proceed by a careful dynamic programming approach, that is based on our aforementioned robust moment-matching result.
Finally, we note that combining our moment-matching lemma with a recent result in algebraic geometry gives us the following structural result of independent interest: Every PMD is -close to another PMD that is a sum of at most distinct -CRVs.
The aforementioned algorithmic and structural results are given in Section ?.
Cover Size Lower Bound. As mentioned above, a crucial ingredient of our cover upper bound is a robust moment-matching lemma, which translates closeness between the low-degree parameter moments of two PMDs to closeness between their Fourier Transforms, and in turn to closeness in total variation distance. To prove our cover lower bound, we follow the opposite direction. We construct an explicit set of PMDs with the property that any pair of distinct PMDs in our set have a non-trivial difference in (at least) one of their low-degree parameter moments. We then show that difference in one of the parameter moments implies that there exists a point where the probability generating functions have a non-trivial difference. Notably, our proof for this step is non-constructive making essential use of Cauchy’s integral formula. Finally, we can easily translate a pointwise difference between the probability generating functions to a non-trivial total variation distance error. We present our cover lower bound construction in Section ?.
Central Limit Theorem for PMDs. The basic idea of the proof of our CLT will be to compare the Fourier transform of our PMD to that of the discrete Gaussian with the same mean and covariance. By taking the inverse Fourier transform, we will be able to conclude that these distributions are pointwise close. A careful analysis using a Taylor approximation and the fact that both and have small effective support, gives us a total variation distance error independent of the size Alas, this approach results in an error dependence that is exponential in To obtain an error bound that scales polynomially with we require stronger bounds between and at points away from the mean. Intuitively, we need to take advantage of cancellation in the inverse Fourier transform integrals. To achieve this, we will use the saddlepoint method from complex analysis. The full proof of our CLT is given in Section ?.
There is extensive literature on distribution learning and computation of approximate Nash equilibria in various classes of games. We have already mentioned the most relevant references in the introduction.
Daskalakis et al.  studied the structure and learnability of PMDs. They obtained a non-proper -cover of size and an information-theoretic upper bound on the learning sample complexity of The dependence on in their cover size is also quasi-polynomial, but is suboptimal as follows from our upper and lower bounds. Importantly, the  construction yields a non-proper cover. As previously mentioned, a proper cover construction is necessary for our algorithmic applications. We note that the learning algorithm of  relies on enumeration over a cover, hence runs in time quasi-polynomial in even for The techniques of  are orthogonal to ours. Their cover upper bound is obtained by a clever black-box application of the CLT of , combined with a non-robust moment-matching lemma that they deduce from a result of Roos . We remind the reader that our Fourier techniques strengthen both these technical tools: Theorem ? strengthens the CLT of , and we prove a robust and quantitatively essentially optimal moment-matching lemma.
In recent work , the authors used Fourier analytic techniques to study the structure and learnability of sums of independent integer random variables (SIIRVs). The techniques of this paper can be viewed as a (highly nontrivial) generalization of those in . We also note that the upper bounds we obtain in this paper for learning and covering PMDs do not subsume the ones in . In fact, our cover upper and lower bounds in this work show that optimal covers for PMDs are inherently larger than optimal covers for SIIRVs. Moreover, the sample complexity of our SIIRV learning algorithm  is significantly better than that of our PMD learning algorithm in this paper.
Concurrently and independently to our work,  obtained qualitatively similar results using different techniques. We now provide a statement of the  results in tandem with a comparison to our work.
 give a learning algorithm for PMDs with sample complexity and runtime The  algorithm uses the continuous Fourier transform, exploiting its sparsity property, plus additional structural and algorithmic ingredients. The aforementioned runtime is not polynomial in the sample size, unless is fixed. In contrast, our learning algorithm runs in sample–polynomial time, and, for fixed , in nearly-linear time. The  learning algorithm outputs an explicit hypothesis, which can be easily sampled. On the other hand, our algorithm outputs a succinct description of its hypothesis (via its DFT), and we show how to efficiently sample from it.
 also prove a size-free CLT, analogous to our Theorem ?, with error polynomial in and Their CLT is obtained by bootstrapping the CLT of  using techniques from . As previously mentioned, our proof is technically orthogonal to , making use of the sparsity of the Fourier transform combined with tools from complex analysis. It is worth noting that our CLT also achieves a near-optimal dependence in the error as a function of (up to log factors).
Finally,  prove analogues of Theorems ?, ?, and ? with qualitatively similar bounds to ours. We note that  improve the dependence on in the cover size to an optimal while the dependence on in their cover upper bound is the same as in . The cover size lower bound of  is qualitatively of the right form, though slightly suboptimal as a function of The algorithms to construct proper covers and the corresponding EPTAS for anonymous games in both works have running time roughly comparable to the PMD cover size.
In Section ?, we describe and analyze our learning algorithm for PMDs. Section ? contains our proper cover upper bound construction, our cover size lower bound, and the related approximation algorithm for Nash equilibria in anonymous games. Finally, Section ? contains the proof of our CLT.
In this section, we record the necessary definitions and terminology that will be used throughout the technical sections of this paper.
Notation. For , we will denote For a vector , and , we will denote We will use the boldface notation to denote the zero vector or matrix in the appropriate dimension.
Poisson Multinomial Distributions. We start by defining our basic object of study:
We will require the following notion of a parameter moment for a PMD:
(Pseudo-)Distributions and Total Variation Distance. A function , over a finite set , is called a distribution if for all , and The function is called a pseudo-distribution if For , we sometimes write to denote . A distribution supported on a finite domain can be viewed as the probability mass function of a random variable , i.e.,
The total variation distance between two pseudo-distributions and supported on a finite domain is If and are two random variables ranging over a finite set, their total variation distance is defined as the total variation distance between their distributions. For convenience, we will often blur the distinction between a random variable and its distribution.
Covers. Let be a metric space. Given , a subset is said to be a proper -cover of with respect to the metric if for every there exists some such that (If is not necessarily a subset of then we obtain a non-proper -cover.) There may exist many -covers of , but one is typically interested in one with the minimum cardinality. The -covering number of is the minimum cardinality of any -cover of . Intuitively, the covering number of a metric space captures the “size” of the space. In this work, we will be interested on efficiently constructing sparse covers for PMDs under the total variation distance metric.
Distribution Learning. We now define the notion of distribution learning we use in this paper. Note that an explicit description of a discrete distribution via its probability mass function scales linearly with the support size. Since we are interested in the computational complexity of distribution learning, our algorithms will need to use a succinct description of their output hypothesis. A simple succinct representation of a discrete distribution is via an evaluation oracle for the probability mass function:
One of the most general ways to succinctly specify a distribution is to give the code of an efficient algorithm that takes “pure” randomness and transforms it into a sample from the distribution. This is the standard notion of a sampler:
We can now give a formal definition of distribution learning:
Anonymous Games and Nash Equilibria. An anonymous game is a triple where , , is the set of players, , , a common set of strategies available to all players, and the payoff function of player when she plays strategy . This function maps the set of partitions to the interval . That is, it is assumed that the payoff of each player depends on her own strategy and only the number of other players choosing each of the strategies.
We denote by the convex hull of the set , i.e., A mixed strategy is an element of A mixed strategy profile is a mapping from to . We denote by the mixed strategy of player in the profile and the collection of all mixed strategies but ’s in . For , a mixed strategy profile is a (well-supported) -Nash equilibrium iff for all and we have: Note that given a mixed strategy profile , we can compute a player’s expected payoff in time by straightforward dynamic programming.
Note that the mixed strategy of player defines the -CRV , i.e., a random vector supported in the set , such that , for all . Hence, if is a mixed strategy profile, the expected payoff of player for using pure strategy is
Multidimensional Fourier Transform. Throughout this paper, we will make essential use of the (continuous and the discrete) multidimensional Fourier transform. For , we will denote . The (continuous) Fourier Transform (FT) of a function is the function defined as For the case that is a probability mass function, we can equivalently write
For computational purposes, we will also need the Discrete Fourier Transform (DFT) and its inverse, whose definition is somewhat more subtle. Let be an integer matrix. We consider the integer lattice , and its dual lattice Note that and that is not necessarily integral. The quotient is the set of equivalence classes of points in such that two points are in the same equivalence class iff . Similarly, the quotient is the set of equivalence classes of points in such that any two points are in the same equivalence class iff .
The Discrete Fourier Transform (DFT) modulo , , of a function is the function defined as (We will remove the subscript when it is clear from the context.) Similarly, for the case that is a probability mass function, we can equivalently write The inverse DFT of a function is the function defined on a fundamental domain of as follows: Note that these operations are inverse of each other, namely for any function , the inverse DFT of is identified with
Let be an -PMD such that for and we denote , where To avoid clutter in the notation, we will sometimes use the symbol to denote the corresponding probability mass function. With this convention, we can write that
Basics from Linear Algebra. We remind the reader a few basic definitions from linear algebra that we will repeatedly use throughout this paper. The Frobenius norm of is The spectral norm (or induced -norm) of is defined as We note that for any , it holds A symmetric matrix is called positive semidefinite (PSD), denoted by if for all , or equivalently all the eigenvalues of are nonnegative. Similarly, a symmetric matrix is called positive definite (PD), denoted by if for all , , or equivalently all the eigenvalues of are strictly positive. For two symmetric matrices we write to denote that the difference is PSD, i.e., Similarly, we write to denote that the difference is PD, i.e.,
In this section, we describe and analyze our sample near-optimal and computationally efficient learning algorithm for PMDs. This section is organized as follows: In Section ?, we give our main algorithm which, given samples from a PMD , efficiently computes a succinct description of a hypothesis pseudo-distribution such that As previously explained, the succinct description of is via its DFT , which is supported on a discrete set of cardinality . Note that provides an -evaluation oracle for with running time In Section ?, we show how to use in a black-box manner, to efficiently obtain an -sampler for , i.e., sample from a distribution such that Finally, in Section ? we show how a nearly–tight cover upper bound can easily be deduced from our learning algorithm.
In this subsection, we give an algorithm
Efficient-Learn-PMD establishing the following theorem:
Our learning algorithm is described in the following pseudo-code:
Let be the unknown target -PMD. We will denote by the probability mass function of , i.e., . Throughout this analysis, we will denote by and the mean vector and covariance matrix of
First, note that the algorithm
Efficient-Learn-PMD is easily seen to have the desired sample and time complexity. Indeed, the algorithm draws samples in Step 1 and samples in Step 5, for a total sample complexity of The runtime of the algorithm is dominated by computing the DFT in Step ? which takes time Computing an approximate eigendecomposition can be done in time (see, e.g., ). The remaining part of this section is devoted to proving the correctness of our algorithm.
Overview of Analysis. We begin with a brief overview of the analysis. First, we show (Lemma ?) that at least of the probability mass of lies in the ellipsoid with center and covariance matrix Moreover, with high probability over the samples drawn in Step 1 of the algorithm, the estimates and will be good approximations of and (Lemma ?). By combining these two lemmas, we obtain (Corollary ?) that at least of the probability mass of lies in the ellipsoid with center and covariance matrix
By the above, and by our choice of the matrix we use linear-algebraic arguments to prove (Lemma ?) that almost all of the probability mass of lies in the set , a fundamental domain of the lattice This lemma is crucial because it implies that, to learn our PMD it suffices to learn the random variable We do this by learning the Discrete Fourier transform of this distribution. This step can be implemented efficiently due to the sparsity property of the DFT (Proposition ?): except for points in , the magnitude of the DFT will be very small. Establishing the desired sparsity property for the DFT is the main technical contribution of this section.
Given the above, it it fairly easy to complete the analysis of correctness. For every point in we can learn the DFT up to absolute error Since the cardinality of is appropriately small, this implies that the total error over is small. The sparsity property of the DFT (Lemma ?) completes the proof.
Detailed Analysis. We now proceed with the detailed analysis of our algorithm. We start by showing that PMDs are concentrated with high probability. More specifically, the following lemma shows that an unknown PMD , with mean vector and covariance matrix , is effectively supported in an ellipsoid centered at , whose principal axes are determined by the eigenvectors and eigenvalues of and the desired concentration probability:
Let , where the ’s are independent -CRVs. We can write where Note that for any unit vector , , we have that the scalar random variable is a sum of independent, mean , bounded random variables. Indeed, we have that and that Moreover, we can write
where we used the Cauchy-Schwartz inequality twice, the triangle inequality, and the fact that a -CRV with mean by definition satisfy and
Let be the variance of By Bernstein’s inequality, we obtain that for it holds
Let , the covariance matrix of have an orthonormal eigenbasis , with corresponding eigenvalues , Since is positive-semidefinite, we have that , for all We consider the random variable . In addition to being a sum of independent, mean , bounded random variables, we claim that First, it is clear that Moreover, note that for any vector , we have that For , we thus get
Applying (Equation 1) for with yields that for all we have
By a union bound, it follows that
We condition on this event.
Since and are the eigenvectors and eigenvalues of we have that where has column We can thus write
Therefore, we have:
where the last inequality follows from our conditioning, the definition of and the elementary inequality This completes the proof of Lemma ?.
Lemma ? shows that an arbitrary -PMD puts at least of its probability mass in the ellipsoid where is an appropriate universal constant. This is the ellipsoid centered at , whose principal semiaxes are parallel to the ’s, i.e., the eigenvectors of , or equivalently of The length of the principal semiaxis corresponding to is determined by the corresponding eigenvalue of and is equal to
Note that this ellipsoid depends on the mean vector and covariance matrix that are unknown to the algorithm. To obtain a bounding ellipsoid that is known to the algorithm, we will use the following lemma (see Appendix ? for the simple proof) showing that and are good approximations to and respectively.
We also need to deal with the error introduced in the eigendecomposition of . Concretely, we factorize as for an orthogonal matrix and diagonal matrix This factorization is necessarily inexact. By increasing the precision to which we learn by a constant factor, we can still have We could redefine in terms of our computed orthonormal eigenbasis, i.e., . Thus, we may henceforth assume that the decomposition is exact.
For the rest of this section, we will condition on the event that the statements of Lemma ? are satisfied. By combining Lemmas ? and ? , we show that we can get a known ellipsoid containing the effective support of by replacing and in the definition of by their sample versions. More specifically, we have the following corollary:
By Lemma ?, it holds that . Hence, we have that
In terms of and , this is . By standard results, taking inverses reverses the positive semi-definite ordering (see e.g., Corollary 7.7.4 (a) in ). Hence,
Combining the above with Lemma ?, with probability at least over we have that
Since , and therefore , Lemma ? gives that
We then obtain:
Equations (Equation 2) and (Equation 3) yield that the first two terms are . Since is positive-definite, as a function of vectors is an inner product. So, we may apply the Cauchy-Schwartz inequality to bound each of the last two terms from above by
Corollary ? shows that our unknown PMD puts at least of its probability mass in the ellipsoid for an appropriate universal constant This is the ellipsoid centered at , whose principal semiaxes are parallel to the ’s, the eigenvectors of , and the length of the principal semiaxis parallel to is equal to
Our next step is to to relate the ellipsoid to the integer matrix used in our algorithm. Let be the matrix with column where is the constant in the algorithm statement. The matrix is obtained by rounding each entry of to the closest integer point. We note that the ellipsoid can be equivalently expressed as Using the relation between and , we show that is enclosed in the ellipsoid which is in turn enclosed in the parallelepiped with integer corner points This parallelepiped is a fundamental domain of the lattice Formally, we have:
Let be the matrix with columns where is the constant in the algorithm statement. Note that,
For a large enough constant , Corollary ? implies that with probability at least
Note that the above is an equivalent description of the ellipsoid Our lemma will follow from the following claim:
By construction, and differ by at most in each entry, and therefore has Frobenius norm (and, thus, induced -norm) at most For any , we thus have that
Similarly, we get In terms of the PSD ordering, we have:
Since , both and are positive-definite, and so and are invertible. Taking inverses in Equation (Equation 6) reverses the ordering, that is:
The claim now follows.
Hence, Claim ? implies that with probability at least , we have:
where the last inequality follows from (Equation 5). In other words, with probability at least , lies in , which was to be proved.
Recall that denotes the lattice . The above lemma implies that it is sufficient to learn the random variable . To do this, we will learn its Discrete Fourier transform. Let be the dual lattice to Recall that the DFT of the PMD , with , is the function defined by Moreover, the probability that attains a given value is given by the inverse DFT, namely