The Fourier Transform of Poisson Multinomial Distributions
and its Algorithmic Applications
Abstract
We study Poisson Multinomial Distributions – a fundamental family of discrete distributions that generalize the binomial and multinomial distributions, and are commonly encountered in computer science. Formally, an Poisson Multinomial Distribution (PMD) is a random variable of the form , where the ’s are independent random vectors supported on the set of standard basis vectors in . In this paper, we obtain a refined structural understanding of PMDs by analyzing their Fourier transform. As our core structural result, we prove that the Fourier transform of PMDs is approximately sparse, i.e., roughly speaking, its norm is small outside a small set. By building on this result, we obtain the following applications:
Learning Theory. We design the first computationally efficient learning algorithm for PMDs with respect to the total variation distance. Our algorithm learns an arbitrary PMD within variation distance using a nearoptimal sample size of and runs in time Previously, no algorithm with a runtime was known, even for
Game Theory. We give the first efficient polynomialtime approximation scheme (EPTAS) for computing Nash equilibria in anonymous games. For normalized anonymous games with players and strategies, our algorithm computes a wellsupported Nash equilibrium in time The best previous algorithm for this problem [DP08, DP14] had running time where , for any
Statistics. We prove a multivariate central limit theorem (CLT) that relates an arbitrary PMD to a discretized multivariate Gaussian with the same mean and covariance, in total variation distance. Our new CLT strengthens the CLT of Valiant and Valiant [VV10, VV11] by completely removing the dependence on in the error bound.
Along the way we prove several new structural results of independent interest about PMDs. These include: (i) a robust momentmatching lemma, roughly stating that two PMDs that approximately agree on their lowdegree parameter moments are close in variation distance; (ii) nearoptimal size proper covers for PMDs in total variation distance (constructive upper bound and nearlymatching lower bound). In addition to Fourier analysis, we employ a number of analytic tools, including the saddlepoint method from complex analysis, that may find other applications.
1 Introduction
1.1 Background and Motivation
The Poisson Multinomial Distribution (PMD) is the discrete probability distribution of a sum of mutually independent categorical random variables over the same sample space. A categorical random variable (CRV) describes the result of a random event that takes on one of possible outcomes. Formally, an PMD is any random variable of the form where the ’s are independent random vectors supported on the set of standard basis vectors in .
PMDs comprise a broad class of discrete distributions of fundamental importance in computer science, probability, and statistics. A large body of work in the probability and statistics literature has been devoted to the study of the behavior of PMDs under various structural conditions [Bar88, Loh92, BHJ92, Ben03, Roo99, Roo10]. PMDs generalize the familiar binomial and multinomial distributions, and describe many distributions commonly encountered in computer science (see, e.g., [DP07, DP08, Val08, VV11]). The case corresponds to the Poisson binomial distribution (PBD), introduced by Poisson [Poi37] as a nontrivial generalization of the binomial distribution.
Recent years have witnessed a flurry of research activity on PMDs and related distributions, from several perspectives of theoretical computer science, including learning [DDS12, DDO13, DKS15a, DKT15, DKS15b], property testing [Val08, VV10, VV11], computational game theory [DP07, DP08, BCI08, DP09, DP14, GT14], and derandomization [GMRZ11, BDS12, De15, GKM15]. More specifically, the following questions have been of interest to the TCS community:

Is there a statistically and computationally efficient algorithm for learning PMDs from independent samples in total variation distance?

How fast can we compute approximate Nash equilibria in anonymous games with many players and a small number of strategies per player?

How well can a PMD be approximated, in total variation distance, by a discretized Gaussian with the same mean and covariance matrix?
The first question is a fundamental problem in unsupervised learning that has received considerable recent attention in TCS [DDS12, DDO13, DKS15a, DKT15, DKS15b]. The aforementioned works have studied the learnability of PMDs, and related distribution families, in particular PBDs (i.e., PMDs) and sums of independent integer random variables. Prior to this work, no computationally efficient learning algorithm for PMDs was known, even for the case of
The second question concerns an important class of succinct games previously studied in the economics literature [Mil96, Blo99, Blo05], whose (exact) Nash equilibrium computation was recently shown to be intractable [CDO15]. The formal connection between computing Nash equilibria in these games and PMDs was established in a sequence of papers by Daskalakis and Papadimitriou [DP07, DP08, DP09, DP14], who leveraged it to gave the first PTAS for the problem. Prior to this work, no efficient PTAS was known, even for anonymous games with strategies per player.
The third question refers to the design of Central Limit Theorems (CLTs) for PMDs with respect to the total variation distance. Despite substantial amount of work in probability theory, the first strong CLT of this form appears to have been shown by Valiant and Valiant [VV10, VV11], motivated by applications in distribution property testing. [VV10, VV11] leveraged their CLT to obtain tight lower bounds for several fundamental problems in property testing. We remark that the error bound of the [VV10] CLT has a logarithmic dependence on the size of the PMD (number of summands), and it was conjectured in [VV10] that this dependence is unnecessary.
1.2 Our Results
The main technical contribution of this work is the use of Fourier analytic techniques to obtain a refined understanding of the structure of PMDs. As our core structural result, we prove that the Fourier transform of PMDs is approximately sparse, i.e., roughly speaking, its norm is small outside a small set. By building on this property, we are able to obtain various new structural results about PMDs, and make progress on the three questions stated in the previous subsection. In this subsection, we describe our algorithmic and structural contributions in detail.
We start by stating our algorithmic results in learning and computational game theory, followed by an informal description of our structural results and the connections between them.
Distribution Learning.
As our main learning result, we obtain the first statistically and computationally efficient learning algorithm for PMDs with respect to the total variation distance. In particular, we show:
Theorem 1.1 (Efficiently Learning PMDs).
For all and , there is an algorithm for learning PMDs with the following performance guarantee: Let be an unknown PMD. The algorithm uses samples from , runs in time^{1}^{1}1We work in the standard “word RAM” model in which basic arithmetic operations on bit integers are assumed to take constant time. and with probability at least outputs an sampler for
We remark that our learning algorithm outputs a succinct description of its hypothesis via its Discrete Fourier Transform (DFT), which is supported on a small size set. We show that the DFT gives both an efficient sampler and an efficient evaluation oracle for
Our algorithm learns an unknown PMD within variation distance with sample complexity and computational complexity The sample complexity of our algorithm is nearoptimal for any fixed , as samples are necessary, even for We note that recent work by Daskalakis et al. [DKT15] established a similar sample upper bound, however their algorithm is not computationally efficient. More specifically, it runs in time which is quasipolynomial in even for For the case, in recent work [DKS15a] the authors of this paper gave an algorithm with sample complexity and runtime Prior to this work, no algorithm with a sample size and runtime was known, even for
Our learning algorithm and its analysis are described in Section 3.
Computational Game Theory.
As our second algorithmic contribution, we give the first efficient polynomialtime approximation scheme (EPTAS) for computing Nash equilibria in anonymous games with many players and a small number of strategies. In anonymous games, all players have the same set of strategies, and the payoff of a player depends on the strategy played by the player and the number of other players who play each of the strategies. In particular, we show:
Theorem 1.2 (EPTAS for Nash in Anonymous Games).
There is an EPTAS for the mixed Nash equilibrium problem for normalized anonymous games with a constant number of strategies. More precisely, there exists an algorithm with the following performance guarantee: for all , and any normalized anonymous game of players and strategies, the algorithm runs in time and outputs a (wellsupported) Nash equilibrium of
The previous PTAS for this problem [DP08, DP14] has running time where Our algorithm decouples the dependence on and and, importantly, its running time dependence on is quasipolynomial. For an algorithm with runtime was given in [DP09], which was sharpened to in the recent work of the authors [DKS15a]. Hence, we obtain, for any value of the same qualitative runtime dependence on as in the case
Similarly to [DP08, DP14], our algorithm proceeds by constructing a proper cover, in total variation distance, for the space of PMDs. A proper cover for the set of all PMDs, is a subset of such that any distribution in is within total variation distance from some distribution in Our main technical contribution is the efficient construction of a proper cover of nearminimum size (see Theorem 1.4). We note that, as follows from Theorem 1.5, the quasipolynomial dependence on and the doubly exponential dependence on in the runtime are unavoidable for any coverbased algorithm. Our cover upper and lower bounds and our Nash approximation algorithm are given in Section 4.
Statistics.
Using our Fourierbased machinery, we prove a strong “sizefree” CLT relating the total variation distance between a PMD and an appropriately discretized Gaussian with the same mean and covariance matrix. In particular, we show:
Theorem 1.3.
Let be an PMD with covariance matrix Suppose that has no eigenvectors other than with eigenvalue less than Then, there exists a discrete Gaussian so that
As mentioned above, Valiant and Valiant [VV10, VV11] proved a CLT of this form and used it as their main technical tool to obtain tight informationtheoretic lower bounds for fundamental statistical estimation tasks. This and related CLTs have since been used in proving lower bounds for other problems (see, e.g., [CST14]). The error bound in the CLT of [VV10, VV11] is of the form i.e., it has a dependence on the size of the underlying PMD. Our Theorem 1.3 provides a qualitative improvement over the aforementioned bound, by establishing that no dependence on is necessary. We note that [VV10] conjectured that such a qualitative improvement may be possible.
We remark that our techniques for proving Theorem 1.3 are orthogonal to those of [VV10, VV11]. While Valiant and Valiant use Stein’s method, we prove our strengthened CLT using the Fourier techniques that underly this paper. We view Fourier analysis as the right technical tool to analyze sums of independent random variables. An additional ingredient that we require is the saddlepoint method from complex analysis. We hope that our new CLT will be of broader use as an analytic tool to the TCS community. Our CLT is proved in Section 5.
Structure of PMDs.
We now provide a brief intuitive overview of our new structural results for PMDs, the relation between them, and their connection to our algorithmic results mentioned above. The unifying theme of our work is a refined analysis of the structure of PMDs, based on their Fourier transform. The Fourier transform is one of the most natural technical tools to consider for analyzing sums of independent random variables, and indeed one of the classical proofs of the (asymptotic) central limit theorem is based on Fourier methods. The basis of our results, both algorithmic and structural, is the following statement:
Informal Lemma (Sparsity of the Fourier Transform of PMDs.) For any PMD , and any there exists a “small” set such that the norm of its Fourier transform, outside the set is at most
We will need two different versions of the above statement for our applications, and therefore we do not provide a formal statement at this stage. The precise meaning of the term “small” depends on the setting: For the continuous Fourier transform, we essentially prove that the product of the volume of the effective support of the Fourier transform times the number of points in the effective support of our distribution is small. In particular, the set is a scaled version of the dual ellipsoid to the ellipsoid defined by the covariance matrix of Hence, roughly speaking, has an effective support that is the dual of the effective support of (See Lemma 4.2 in Section 4 for the precise statement.)
In the case of the Discrete Fourier Transform (DFT), we show that there exists a discrete set with small cardinality, such that norm of the DFT outside this set is small. At a highlevel, to prove this statement, we need the appropriate definition of the (multidimensional) DFT, which turns out to be nontrivial, and is crucial for the computational efficiency of our learning algorithm. More specifically, we chose the period of the DFT to reflect the shape of the effective support of our PMD. (See Proposition 3.8 in Section 3 for the statement.)
With Fourier sparsity as our starting point, we obtain new structural results of independent interest for PMDs. The first is a “robust” momentmatching lemma, which we now informally state:
Informal Lemma (Parameter Moment Closeness Implies Closeness in Distribution.) For any pair of PMDs , if the “lowdegree” parameter moment profiles of and are close, then are close in total variation distance.
See Definition 2.2 for the definition of parameter moments of a PMD. The formal statement of the aforementioned lemma appears as Lemma 4.6 in Section 4.1. Our robust momentmatching lemma is the basis for our proper cover algorithm and our EPTAS for Nash equilibria in anonymous games. Our constructive cover upper bound is the following:
Theorem 1.4 (Optimal Covers for PMDs).
For all and , there exists an cover , under the total variation distance, of the set of PMDs of size In addition, there exists an algorithm to construct the set that runs in time
A sparse proper cover quantifies the “size” of the space of PMDs and provides useful structural information that can be exploited in a variety of applications. In addition to Nash equilibria in anonymous games, our efficient proper cover construction provides a smaller search space for approximately solving essentially any optimization problem over PMDs. As another corollary of our cover construction, we obtain the first EPTAS for computing threat points in anonymous games.
Perhaps surprisingly, we also prove that our above upper bound is essentially tight:
Theorem 1.5 (Cover Lower Bound for PMDs).
For any , sufficiently small as a function of and , any cover for has size at least
We remark that, in previous work [DKS15a], the authors proved a tight cover size bound of for SIIRVs, i.e., sums of independent scalar random variables each supported on While a cover size lower bound for SIIRVs directly implies the same lower bound for PMDs, the opposite is not true. Indeed, Theorems 1.4 and 1.5 show that covers for PMDs are inherently larger, requiring a doubly exponential dependence on
1.3 Our Approach and Techniques
At a highlevel, the Fourier techniques of this paper can be viewed as a highly nontrivial generalization of the techniques in our recent paper [DKS15a] on sums of independent scalar random variables. We would like to emphasize that a number of new conceptual and technical ideas are required to overcome the various obstacles arising in the multidimensional setting.
We start with an intuitive explanation of two key ideas that form the basis of our approach.
Sparsity of the Fourier Transform of PMDs.
Since the Fourier Transform (FT) of a PMD is the product of the FTs of its component CRVs, its magnitude is the product of terms each bounded from above by Note that each term in the product is strictly less than except in a small region, unless the component CRV is trivial (i.e., essentially deterministic). Roughly speaking, to establish the sparsity of the FT of PMDs, we proceed as follows: We bound from above the magnitude of the FT by the FT of a Gaussian with the same covariance matrix as our PMD. (See, for example, Lemma 3.10.) This gives us tail bounds for the FT of the PMD in terms of the FT of this Gaussian, and when combined with the concentration of the PMD itself, yields the desired property.
Approximation of the logarithm of the Fourier Transform.
A key ingredient in our proofs is the approximation of the logarithm of the Fourier Transform (log FT) of PMDs by lowdegree polynomials. Observe that the log FT is a sum of terms, which is convenient for the analysis. We focus on approximating the log FT by a lowdegree Taylor polynomial within the effective support of the FT. (Note that outside the effective support the log FT can be infinity.) Morally speaking, the log FT is smooth, i.e., it is approximated by the first several terms of its Taylor series. Formally however, this statement is in general not true and requires various technical conditions, depending on the setting. One important point to note is that the sparsity of the FT controls the domain in which this approximation will need to hold, and thus help us bound the Taylor error. We will need to ensure that the sizes of the Taylor coefficients are not too large given the location of the effective support, which turns out to be a nontrivial technical hurdle. To ensure this, we need to be very careful about how we perform this Taylor expansion. In particular, the correct choice of the point that we Taylor expand around will be critical for our applications. We elaborate on these difficulties in the relevant technical sections. Finally, we remark that the degree of polynomial approximation we will require depends on the setting: In our cover upper bounds, we will require (nearly) logarithmic degree, while for our CLT degree approximation suffices.
We are now ready to give an overview of the ideas in the proofs of each of our results.
Efficient Learning Algorithm.
The highlevel structure of our learning algorithm relies on the sparsity of the Fourier transform, and is similar to the algorithm in our previous work [DKS15a] for learning sums of independent integer random variables. More specifically, our learning algorithm estimates the effective support of the DFT, and then computes the empirical DFT in this effective support. This highlevel description would perhaps suffice, if we were only interested in bounding the sample complexity. In order to obtain a computationally efficient algorithm, it is crucial to use the appropriate definition of the DFT and its inverse.
In more detail, our algorithm works as follows: It starts by drawing samples to estimate the mean vector and covariance matrix of our PMD to good accuracy. Using these estimates, we can bound the effective support of our distribution in an appropriate ellipsoid. In particular, we show that our PMD lies (whp) in a fundamental domain of an appropriate integer lattice where is an integer matrix whose columns are appropriate functions of the eigenvalues and eigenvectors of the (sample) covariance matrix. This property allows us to learn our unknown PMD by learning the random variable To do this, we learn its Discrete Fourier transform. Let be the dual lattice to (i.e., the set of points so that for all ). Importantly, we define the DFT, of our PMD on the dual lattice that is, with A useful property of this definition is the following: the probability that attains a given value is given by the inverse DFT, defined on the lattice namely
The main structural property needed for the analysis of our algorithm is that there exists an explicit set with integer coordinates and cardinality that contains all but of the mass of Given this property, our algorithm draws an additional set of samples of size from the PMD, and computes the empirical DFT (modulo ) on its effective support Using these ingredients, we are able to show that the inverse of the empirical DFT defines a pseudodistribution that is close to our unknown PMD in total variation distance.
Observe that the support of the inverse DFT can be large, namely Our algorithm does not explicitly evaluate the inverse DFT at all these points, but outputs a succinct description of its hypothesis , via its DFT We emphasize that this succinct description suffices to efficiently obtain both an approximate evaluation oracle and an approximate sampler for our target PMD Indeed, it is clear that computing the inverse DFT at a single point can be done in time and gives an approximate oracle for the probability mass function of By using additional algorithmic ingredients, we show how to use an oracle for the DFT, , as a blackbox to obtain a computationally efficient approximate sampler for
Our learning algorithm and its analysis are given in Section 3.
Constructive Proper Cover and Anonymous Games.
The correctness of our learning algorithm easily implies (see Section 3.3) an algorithm to construct a nonproper cover for PMDs of size While this upper bound is close to being best possible (see Section 4.5), it does not suffice for our algorithmic applications in anonymous games. For these applications, it is crucial to obtain an efficient algorithm that constructs a proper cover, and in fact one that works in a certain stylized way.
To construct a proper cover, we rely on the sparsity of the continuous Fourier Transform of PMDs. Namely, we show that for any PMD with effective support there exists an appropriately defined set such that the contribution of to the norm of is at most By using this property, we show that any two PMDs, with approximately the same variance in each direction, that have continuous Fourier transforms close to each other in the set are close in total variation distance. We build on this lemma to prove our robust momentmatching result. Roughly speaking, we show that two PMDs, with approximately the same variance in each direction, that are “close” to each other in their lowdegree parameter moments are also close in total variation distance. We emphasize that the meaning of the term “close” here is quite subtle: we need to appropriately partition the component CRVs into groups, and approximate the parameter moments of the PMDs formed by each group within a different degree and different accuracy for each degree. (See Lemma 4.6 in Section 4.1.)
Our algorithm to construct a proper cover, and our EPTAS for Nash equilibria in anonymous games proceed by a careful dynamic programming approach, that is based on our aforementioned robust momentmatching result.
Finally, we note that combining our momentmatching lemma with a recent result in algebraic geometry gives us the following structural result of independent interest: Every PMD is close to another PMD that is a sum of at most distinct CRVs.
The aforementioned algorithmic and structural results are given in Section 4.
Cover Size Lower Bound.
As mentioned above, a crucial ingredient of our cover upper bound is a robust momentmatching lemma, which translates closeness between the lowdegree parameter moments of two PMDs to closeness between their Fourier Transforms, and in turn to closeness in total variation distance. To prove our cover lower bound, we follow the opposite direction. We construct an explicit set of PMDs with the property that any pair of distinct PMDs in our set have a nontrivial difference in (at least) one of their lowdegree parameter moments. We then show that difference in one of the parameter moments implies that there exists a point where the probability generating functions have a nontrivial difference. Notably, our proof for this step is nonconstructive making essential use of Cauchy’s integral formula. Finally, we can easily translate a pointwise difference between the probability generating functions to a nontrivial total variation distance error. We present our cover lower bound construction in Section 4.5.
Central Limit Theorem for PMDs.
The basic idea of the proof of our CLT will be to compare the Fourier transform of our PMD to that of the discrete Gaussian with the same mean and covariance. By taking the inverse Fourier transform, we will be able to conclude that these distributions are pointwise close. A careful analysis using a Taylor approximation and the fact that both and have small effective support, gives us a total variation distance error independent of the size Alas, this approach results in an error dependence that is exponential in To obtain an error bound that scales polynomially with we require stronger bounds between and at points away from the mean. Intuitively, we need to take advantage of cancellation in the inverse Fourier transform integrals. To achieve this, we will use the saddlepoint method from complex analysis. The full proof of our CLT is given in Section 5.
1.4 Related and Prior Work
There is extensive literature on distribution learning and computation of approximate Nash equilibria in various classes of games. We have already mentioned the most relevant references in the introduction.
Daskalakis et al. [DKT15] studied the structure and learnability of PMDs. They obtained a nonproper cover of size and an informationtheoretic upper bound on the learning sample complexity of The dependence on in their cover size is also quasipolynomial, but is suboptimal as follows from our upper and lower bounds. Importantly, the [DKT15] construction yields a nonproper cover. As previously mentioned, a proper cover construction is necessary for our algorithmic applications. We note that the learning algorithm of [DKT15] relies on enumeration over a cover, hence runs in time quasipolynomial in even for The techniques of [DKT15] are orthogonal to ours. Their cover upper bound is obtained by a clever blackbox application of the CLT of [VV10], combined with a nonrobust momentmatching lemma that they deduce from a result of Roos [Roo02]. We remind the reader that our Fourier techniques strengthen both these technical tools: Theorem 1.3 strengthens the CLT of [VV10], and we prove a robust and quantitatively essentially optimal momentmatching lemma.
In recent work [DKS15a], the authors used Fourier analytic techniques to study the structure and learnability of sums of independent integer random variables (SIIRVs). The techniques of this paper can be viewed as a (highly nontrivial) generalization of those in [DKS15a]. We also note that the upper bounds we obtain in this paper for learning and covering PMDs do not subsume the ones in [DKS15a]. In fact, our cover upper and lower bounds in this work show that optimal covers for PMDs are inherently larger than optimal covers for SIIRVs. Moreover, the sample complexity of our SIIRV learning algorithm [DKS15a] is significantly better than that of our PMD learning algorithm in this paper.
1.5 Concurrent and Independent Work
Concurrently and independently to our work, [DDKT16] obtained qualitatively similar results using different techniques. We now provide a statement of the [DDKT16] results in tandem with a comparison to our work.
[DDKT16] give a learning algorithm for PMDs with sample complexity and runtime The [DDKT16] algorithm uses the continuous Fourier transform, exploiting its sparsity property, plus additional structural and algorithmic ingredients. The aforementioned runtime is not polynomial in the sample size, unless is fixed. In contrast, our learning algorithm runs in sample–polynomial time, and, for fixed , in nearlylinear time. The [DDKT16] learning algorithm outputs an explicit hypothesis, which can be easily sampled. On the other hand, our algorithm outputs a succinct description of its hypothesis (via its DFT), and we show how to efficiently sample from it.
[DDKT16] also prove a sizefree CLT, analogous to our Theorem 1.3, with error polynomial in and Their CLT is obtained by bootstrapping the CLT of [VV10, VV11] using techniques from [DKT15]. As previously mentioned, our proof is technically orthogonal to [VV10, VV11, DDKT16], making use of the sparsity of the Fourier transform combined with tools from complex analysis. It is worth noting that our CLT also achieves a nearoptimal dependence in the error as a function of (up to log factors).
Finally, [DDKT16] prove analogues of Theorems 1.2, 1.4, and 1.5 with qualitatively similar bounds to ours. We note that [DDKT16] improve the dependence on in the cover size to an optimal while the dependence on in their cover upper bound is the same as in [DKT15]. The cover size lower bound of [DDKT16] is qualitatively of the right form, though slightly suboptimal as a function of The algorithms to construct proper covers and the corresponding EPTAS for anonymous games in both works have running time roughly comparable to the PMD cover size.
1.6 Organization
2 Preliminaries
In this section, we record the necessary definitions and terminology that will be used throughout the technical sections of this paper.
Notation.
For , we will denote For a vector , and , we will denote We will use the boldface notation to denote the zero vector or matrix in the appropriate dimension.
Poisson Multinomial Distributions.
We start by defining our basic object of study:
Definition 2.1 (Pmd).
For , let , , be the standard unit vector along dimension in . A Categorical Random Variable (CRV) is a vector random variable supported on the set . A Poisson Multinomial Distribution of order , or PMD, is any vector random variable of the form where the ’s are independent CRVs. We will denote by the set of all PMDs.
We will require the following notion of a parameter moment for a PMD:
Definition 2.2 (parameter moment of a PMD).
Let be an PMD such that for and we denote . For , we define the parameter moment of to be We will refer to as the degree of the parameter moment
(Pseudo)Distributions and Total Variation Distance.
A function , over a finite set , is called a distribution if for all , and The function is called a pseudodistribution if For , we sometimes write to denote . A distribution supported on a finite domain can be viewed as the probability mass function of a random variable , i.e.,
The total variation distance between two pseudodistributions and supported on a finite domain is If and are two random variables ranging over a finite set, their total variation distance is defined as the total variation distance between their distributions. For convenience, we will often blur the distinction between a random variable and its distribution.
Covers.
Let be a metric space. Given , a subset is said to be a proper cover of with respect to the metric if for every there exists some such that (If is not necessarily a subset of then we obtain a nonproper cover.) There may exist many covers of , but one is typically interested in one with the minimum cardinality. The covering number of is the minimum cardinality of any cover of . Intuitively, the covering number of a metric space captures the “size” of the space. In this work, we will be interested on efficiently constructing sparse covers for PMDs under the total variation distance metric.
Distribution Learning.
We now define the notion of distribution learning we use in this paper. Note that an explicit description of a discrete distribution via its probability mass function scales linearly with the support size. Since we are interested in the computational complexity of distribution learning, our algorithms will need to use a succinct description of their output hypothesis. A simple succinct representation of a discrete distribution is via an evaluation oracle for the probability mass function:
Definition 2.3 (Evaluation Oracle).
Let be a distribution over An evaluation oracle for is a polynomial size circuit with input bits such that for each the output of the circuit equals the binary representation of the probability . For , an evaluation oracle for is an evaluation oracle for some pseudodistribution which has
One of the most general ways to succinctly specify a distribution is to give the code of an efficient algorithm that takes “pure” randomness and transforms it into a sample from the distribution. This is the standard notion of a sampler:
Definition 2.4 (Sampler).
Let be a distribution over An sampler for is a circuit with input bits and output bits which is such that when then for some distribution which has
We can now give a formal definition of distribution learning:
Definition 2.5 (Distribution Learning).
Let be a family of distributions. A randomized algorithm is a distribution learning algorithm for class if for any and any on input and sample access to with probability algorithm outputs an sampler (or an evaluation oracle) for
Remark 2.6.
We emphasize that our learning algorithm in Section 3 outputs both an sampler and an evaluation oracle for the target distribution.
Anonymous Games and Nash Equilibria.
An anonymous game is a triple where , , is the set of players, , , a common set of strategies available to all players, and the payoff function of player when she plays strategy . This function maps the set of partitions to the interval . That is, it is assumed that the payoff of each player depends on her own strategy and only the number of other players choosing each of the strategies.
We denote by the convex hull of the set , i.e., A mixed strategy is an element of A mixed strategy profile is a mapping from to . We denote by the mixed strategy of player in the profile and the collection of all mixed strategies but ’s in . For , a mixed strategy profile is a (wellsupported) Nash equilibrium iff for all and we have: Note that given a mixed strategy profile , we can compute a player’s expected payoff in time by straightforward dynamic programming.
Note that the mixed strategy of player defines the CRV , i.e., a random vector supported in the set , such that , for all . Hence, if is a mixed strategy profile, the expected payoff of player for using pure strategy is
Multidimensional Fourier Transform.
Throughout this paper, we will make essential use of the (continuous and the discrete) multidimensional Fourier transform. For , we will denote . The (continuous) Fourier Transform (FT) of a function is the function defined as For the case that is a probability mass function, we can equivalently write
For computational purposes, we will also need the Discrete Fourier Transform (DFT) and its inverse, whose definition is somewhat more subtle. Let be an integer matrix. We consider the integer lattice , and its dual lattice Note that and that is not necessarily integral. The quotient is the set of equivalence classes of points in such that two points are in the same equivalence class iff . Similarly, the quotient is the set of equivalence classes of points in such that any two points are in the same equivalence class iff .
The Discrete Fourier Transform (DFT) modulo , , of a function is the function defined as (We will remove the subscript when it is clear from the context.) Similarly, for the case that is a probability mass function, we can equivalently write The inverse DFT of a function is the function defined on a fundamental domain of as follows: Note that these operations are inverse of each other, namely for any function , the inverse DFT of is identified with
Let be an PMD such that for and we denote , where To avoid clutter in the notation, we will sometimes use the symbol to denote the corresponding probability mass function. With this convention, we can write that
Basics from Linear Algebra.
We remind the reader a few basic definitions from linear algebra that we will repeatedly use throughout this paper. The Frobenius norm of is The spectral norm (or induced norm) of is defined as We note that for any , it holds A symmetric matrix is called positive semidefinite (PSD), denoted by if for all , or equivalently all the eigenvalues of are nonnegative. Similarly, a symmetric matrix is called positive definite (PD), denoted by if for all , , or equivalently all the eigenvalues of are strictly positive. For two symmetric matrices we write to denote that the difference is PSD, i.e., Similarly, we write to denote that the difference is PD, i.e.,
3 Efficiently Learning PMDs
In this section, we describe and analyze our sample nearoptimal and computationally efficient learning algorithm for PMDs. This section is organized as follows: In Section 3.1, we give our main algorithm which, given samples from a PMD , efficiently computes a succinct description of a hypothesis pseudodistribution such that As previously explained, the succinct description of is via its DFT , which is supported on a discrete set of cardinality . Note that provides an evaluation oracle for with running time In Section 3.2, we show how to use in a blackbox manner, to efficiently obtain an sampler for , i.e., sample from a distribution such that Finally, in Section 3.3 we show how a nearly–tight cover upper bound can easily be deduced from our learning algorithm.
3.1 Main Learning Algorithm
In this subsection, we give an algorithm EfficientLearnPMD establishing the following theorem:
Theorem 3.1.
For all and , the algorithm EfficientLearnPMD has the following performance guarantee: Let be an unknown PMD. The algorithm uses samples from , runs in time and outputs the DFT of a pseudodistribution that, with probability at least , satisfies
Our learning algorithm is described in the following pseudocode:
Algorithm EfficientLearnPMD Input: sample access to an PMD and Output: A set of cardinality , and the DFT of a pseudodistribution such that Let be a sufficiently large universal constant. Draw samples from , and let be the sample mean and the sample covariance matrix. Compute an approximate spectral decomposition of , i.e., an orthonormal eigenbasis with corresponding eigenvalues , Let be the matrix whose column is the closest integer point to the vector Define to be the set of points of the form for some with Draw samples , , from , and output the empirical DFT , i.e., /* The DFT is a succinct description of the pseudodistribution , the inverse DFT of , defined by: for , and otherwise. Our algorithm does not output explicitly, but implicitly via its DFT.*/
Let be the unknown target PMD. We will denote by the probability mass function of , i.e., . Throughout this analysis, we will denote by and the mean vector and covariance matrix of
First, note that the algorithm EfficientLearnPMD is easily seen to have the desired sample and time complexity. Indeed, the algorithm draws samples in Step 1 and samples in Step 5, for a total sample complexity of The runtime of the algorithm is dominated by computing the DFT in Step 5 which takes time Computing an approximate eigendecomposition can be done in time (see, e.g., [PC99]). The remaining part of this section is devoted to proving the correctness of our algorithm.
Remark 3.2.
We remark that in Step 4 of our algorithm, the notation refers to an equivalence class of points. In particular, any pair of distinct vectors satisfying and correspond to the same point and therefore are not counted twice.
Overview of Analysis.
We begin with a brief overview of the analysis. First, we show (Lemma 3.3) that at least of the probability mass of lies in the ellipsoid with center and covariance matrix Moreover, with high probability over the samples drawn in Step 1 of the algorithm, the estimates and will be good approximations of and (Lemma 3.4). By combining these two lemmas, we obtain (Corollary 3.5) that at least of the probability mass of lies in the ellipsoid with center and covariance matrix
By the above, and by our choice of the matrix we use linearalgebraic arguments to prove (Lemma 3.6) that almost all of the probability mass of lies in the set , a fundamental domain of the lattice This lemma is crucial because it implies that, to learn our PMD it suffices to learn the random variable We do this by learning the Discrete Fourier transform of this distribution. This step can be implemented efficiently due to the sparsity property of the DFT (Proposition 3.8): except for points in , the magnitude of the DFT will be very small. Establishing the desired sparsity property for the DFT is the main technical contribution of this section.
Given the above, it it fairly easy to complete the analysis of correctness. For every point in we can learn the DFT up to absolute error Since the cardinality of is appropriately small, this implies that the total error over is small. The sparsity property of the DFT (Lemma 3.14) completes the proof.
Detailed Analysis.
We now proceed with the detailed analysis of our algorithm. We start by showing that PMDs are concentrated with high probability. More specifically, the following lemma shows that an unknown PMD , with mean vector and covariance matrix , is effectively supported in an ellipsoid centered at , whose principal axes are determined by the eigenvectors and eigenvalues of and the desired concentration probability:
Lemma 3.3.
Let be an PMD with mean vector and covariance matrix For any consider the positivedefinite matrix . Then, with probability at least over we have that
Proof.
Let , where the ’s are independent CRVs. We can write where Note that for any unit vector , , we have that the scalar random variable is a sum of independent, mean , bounded random variables. Indeed, we have that and that Moreover, we can write
where we used the CauchySchwartz inequality twice, the triangle inequality, and the fact that a CRV with mean by definition satisfy and
Let be the variance of By Bernstein’s inequality, we obtain that for it holds
(1) 
Let , the covariance matrix of have an orthonormal eigenbasis , with corresponding eigenvalues , Since is positivesemidefinite, we have that , for all We consider the random variable . In addition to being a sum of independent, mean , bounded random variables, we claim that First, it is clear that Moreover, note that for any vector , we have that For , we thus get
Applying (1) for with yields that for all we have
By a union bound, it follows that
We condition on this event.
Since and are the eigenvectors and eigenvalues of we have that where has column We can thus write
Therefore, we have: