Concentration Inequalities for Multinoulli Random Variables
We investigate concentration inequalities for Dirichlet and Multinomial random variables.
1 Problem Formulation
We analyse the concentration properties of the random variable defined as:
where is a random vector, is deterministic and is the -dimensional simplex. It is easy to show that the maximum in Eq. 1 is equivalent to computing the (scaled) -norm of the vector :
where we have used the fact that . As a consequence, is a bounded random variable in . While the following discussion apply to Dirichlet distributions, we focus on . The results previously available in the literature are summarized in the following.
The literature has analysed the concentration of the -discrepancy of the true distribution and the empirical one in this setting.
(Weissman et al., 2003) Let and . Then, for any and :
(Devroye, 1983) Let and . Then, for any :
While Prop. 1 shows an explicit dependence on the dimension of the random variable, such dependence is hidden in Prop. 2 by the constraint on . Note that for any , . This shows that the -deviation always scales proportionally to the dimension of the random variable, i.e., as .
A better inequality. The natural question is whether is possible to derive a concentration inequality independent from the dimension of by exploiting the correlation between and the maximizer vector . This question has been recently addressed in (Agrawal and Jia, 2017, Lem. C.2):
(Agrawal and Jia, 2017) Let and . Then, for any :
2 Theoretical Analysis (the asymptotic case)
In this section, we provide a counter-argument to the Lem. 3 in the asymptotic regime (i.e., ). The overall idea is to show that the expected value of asymptotically grows as and itself is well concentrated around its expectation. As a result, we can deduce that all quantiles of grow as as well.
We consider the true vector to be uniform, i.e., and .
Consider , and be the uniform distribution on . Let be the vector of ones of dimension . Define where is the matrix with in all the diagonal entry and elsewhere, and . Then:
While the previous lemma may already suggest that should grow as as its expectation, it is still possible that a large part of the distribution is concentrated around a value independent from , with limited probability assigned to, e.g., values growing as , which could justify the growth of the expectation. Thus, in order to conclude the analysis, we need to show that is concentrated “enough” around its expectation.
Since the random variables are correlated, it is complicated to directly analyze the deviation of from its mean. Thus we first apply an orthogonal transformation on to obtain independent r.v. (recall that jointly normally distributed variables are independent if uncorrelated).
Consider the same settings of Lem. 4 and recall that . There exists an orthogonal transformation , s.t.
By exploiting the transformation we can write that . Since are i.i.d. standard Gaussian random variables and is -Lipschitz, we can finally characterize the mean and the deviations of and derive the following anticoncentration inequality for .
Let and . Define and . Then, for any :
This result shows that every quantile of is dependent on the dimension of the random variable, i.e., . Similarly to Lem. 2, it is possible to lower bound the quantile by a dimension-free quantity at the price of having an exponential dependence on in .
Appendix A Proof for the asymptotic scenario
In this section we report the proofs of lemmas and theorem stated in Sec. 2.
a.1 Proof of Lem. 4
Let and . Then:
where we used the fact that the maximizing takes the largest value for all positive components and is equal to otherwise. We recall that the covariance of the normalized multinoulli variable with probabilities is . As a result, a direct application of the central limit theorem gives . Then we can apply the functional CLT and obtain , where is a random vector obtained by truncating from below at the multi-variate Gaussian vector . Since the marginal distribution of each random variable is , i.e., are identically distributed (see definition in Lem. 4), has a distribution composed by a Dirac distribution in and a half normal distribution, and its expected value is , while leads to the final statement on the expectation.
a.2 Proof of Lem. 5
Denote the set of eigenvalues of square matrix . Let such that , where is a matrix full of zeros. Then, we can write the eigenvalues of the covariance matrix of as
where we use the fact that . As a result, the covariance of has one eigenvalue at and eigenvalues equal to with multiplicity . As a result, we can diagonalize it with an orthogonal matrix (obtained using the normalized eigenvectors) and obtain
Define , then:
a.3 Proof of Thm. 6
Let . Then is -Lipschitz:
where denotes the Lipschitz constant of a function and we exploit the fact that is an orthonormal matrix.
We can now study the concentration of the variable .
Given that is a vector of i.i.d. standard Gaussian variables
Substituting the value of and inverting the bound gives the desired statement.
- The analysis holds also in the case , see (Osband and Roy, 2017).
- Note that we can drop the last component of since it is deterministically zero.
- Shipra Agrawal and Randy Jia. Optimistic posterior sampling for reinforcement learning: worst-case regret bounds. In NIPS, pages 1184–1194, 2017.
- Luc Devroye. The equivalence of weak, strong and complete convergence in for kernel density estimates. The Annals of Statistics, 11(3):896–904, 09 1983. doi: 10.1214/aos/1176346255.
- Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11:1563–1600, 2010.
- Ian Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for reinforcement learning? In ICML, volume 70 of Proceedings of Machine Learning Research, pages 2701–2710. PMLR, 2017.
- Wainwright. High-dimensional statistics: A non-asymptotic viewpoint. 2017.
- Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger. Inequalities for the l1 deviation of the empirical distribution. Technical Report HPL-2003-97R1, Hewlett-Packard Labs, 2003.