Testing Everything
Abstract
Given samples from an unknown distribution, p. How to answer basic questions of the form. Is p monotone? Is it unimodal? Is it logconcave? and so on \ldots. These problems have received tremendous attention in statistics, with emphasis on the asymptotic analysis. Over the past decade, a number of researchers have studied these problems in a CS framework, with focus on designing algorithms with sample complexity and running time as small as possible.
Surprisingly, for some of the most basic problems such as testing monotonicity over a discrete domain, such as [n], or the hypergrid [n]^{d}, the known algorithms have highly suboptimal sample complexity. For example, testing monotonicity over the hypergrid [n]^{d} required \tilde{O}(n^{d1/2}) samples, when the best known lower bounds are \Omega(n^{d/2}).
We propose a general framework that resolves the problem of testing for a wide range classes. In particular, our algorithms are provably information theoretically optimal, and for many of the classes considered is highly efficient. We resolve the problems of testing monotonicity, logconcavity and hazardrate distributions, distributions with a few modes, optimally. As an example, we can test monotonicity over the hypergrid with O(n^{d/2}) samples and time.
At a technical level, we give a two step approach to testing. The first is a learning step, that tries to estimate the underlying distribution with one from the class of interest. The second step is a simple modificaiton of the \chi^{2} test. Our algorithms are simple to implement and we compare them in the section on experiments.
Contents
0 Real Todos

\sout
Everything

Test previous item

Put in the lower bound derivation from Paninski/PBD’s.

tmodal distributions – Think Think Think
1 Preliminaries
We use the following probability distances in our paper.
Definition 1.
The total variation distance between distributions p and q is defined as
d_{\mathrm{TV}}(p,q)\lx@stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sup_{A}p(A)% q(A)=\frac{1}{2}\pq\_{1}. 
For a subset of the domain, the total variation distance is defined as half of the \ell_{1} distance restricted to the subset.
Definition 2.
The \chi^{2}distance between p and q over [n] is defined by
\chi^{2}(p,q)\lx@stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sum_{i\in[n]}\frac{% (p_{i}q_{i})^{2}}{q_{i}}=\left[\sum_{i\in[n]}\frac{p_{i}^{2}}{q_{i}}\right]1. 
Definition 3.
The Kolmogorov distance between two probability measures p and q over an ordered set (e.g., {\bf R}) with cumulative density functions (CDF) F_{p} and F_{q} is defined as
d_{\mathrm{K}}(p,q)\lx@stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sup_{x\in% \mathbb{R}}F_{p}(x)F_{q}(x). 
Our paper is primarily concerned with testing against classes of distributions, defined formally as follows:
Definition 4.
Given \varepsilon\in(0,1] and sample access to a distribution p, an algorithm is said to test a class \mathcal{C} if it has the following guarantees:

If p\in\mathcal{C}, the algorithm outputs Accept with probability at least 2/3;

If d_{\mathrm{TV}}(p,\mathcal{C})\geq\varepsilon, the algorithm outputs Reject with probability at least 2/3.
The DvoretzkyKieferWolfowitz (DKW) inequality gives a generic algorithm for learning any distribution with respect to the Kolmogorov distance [DvoretzkyKW56].
Lemma 1.
(See [DvoretzkyKW56],[Massart90]) Suppose we have n i.i.d. samples X_{1},\dots X_{n} from a distribution with CDF F. Let F_{n}(x)\lx@stackrel{{\scriptstyle\mathrm{def}}}{{=}}\frac{1}{n}\sum_{i=1}^{n}% \mathbf{1}_{\{X_{i}\leq x\}} be the empirical CDF. Then \Pr[d_{\mathrm{K}}(F,F_{n})\geq\varepsilon]\leq 2e^{2n\varepsilon^{2}}. In particular, if n=\Omega((1/\varepsilon^{2})\cdot\log(1/\delta)), then \Pr[d_{\mathrm{K}}(F,F_{n})\geq\varepsilon]\leq\delta.
We note the following useful relationships between these distances [GibbsS02]:
Proposition 1.
d_{\mathrm{K}}(p,q)^{2}\leq d_{\mathrm{TV}}(p,q)^{2}\leq\frac{1}{4}\chi^{2}(p,q).
In this paper, we will consider the following classes of distributions:

Monotone distributions over [n]^{d} (denoted by \mathcal{M}_{n}^{d}), for which i\lesssim j implies f_{i}\geq f_{j}^{1}^{1}1This definition describes monotone nonincreasing distributions. By symmetry, identical results hold for monotone nondecreasing distributions.;

Unimodal distributions over [n] (denoted by \mathcal{U}_{n}), for which there exists an i^{*} such that f_{i} is nondecreasing for i\leq i^{*} and nonincreasing for i\geq i^{*};

Logconcave distributions over [n] (denoted by \mathcal{LCD}_{n}), the subclass of unimodal distributions for which f_{i1}f_{i+1}\leq f_{i}^{2};

Monotone hazard rate (MHR) distributions over [n] (denoted by \mathcal{MHR}_{n}), for which i<j implies \frac{f_{i}}{1F_{i}}\leq\frac{f_{j}}{1F_{j}}.
Definition 5.
An \etaeffective support of a distribution p is any set S such that p(S)\geq 1\eta.
The flattening of a function f over a subset S is the function \bar{f} such that \bar{f}_{i}=p(S)/S.
Definition 6.
Let p be a distribution, and support I_{1},\ldots is a partition of the domain. The flattening of p with respect to I_{1},\ldots is the distribution \bar{p} which is the flattening of p over the intervals I_{1},\ldots.
Poisson Sampling
Throughout this paper, we use the standard Poissonization approach. Instead of drawing exactly m samples from a distribution p, we first draw m^{\prime}\sim\rm Poisson(m), and then draw m^{\prime} samples from p. As a result, the number of times different elements in the support of p occur in the sample become independent, giving much simpler analyses. In particular, the number of times we will observe domain element i will be distributed as \rm Poisson(mp_{i}), independently for each i. Since \rm Poisson(m) is tightly concentrated around m, this additional flexibility comes only at a subconstant cost in the sample complexity with an inversely exponential in m, additive increase in the error probability.
2 Overview
Our algorithm for testing a distribution p can be decomposed into three steps.
Nearproper learning in \chi^{2}distance.
Our first step requires a learning algorithm with very specific guarantees. In proper learning, we are given sample access to a distribution p\in\mathcal{C}, where \mathcal{C} is some class of distributions, and we wish to output q\in\mathcal{C} such that p and q are close in total variation distance. In our setting, given sample access to p\in\mathcal{C}, we wish to output q such that q is close to \mathcal{C} in total variation distance, and p and q are close in \chi^{2}distance on an effective support^{2}^{2}2We also require the algorithm to output a description of an effective support for which this property holds. This requirement can be slightly relaxed, as we show in our results for testing unimodality. of p. From an information theoretic standpoint, this problem is harder than proper learning, since \chi^{2}distance is more restrictive than total variation distance. Nonetheless, this problem can be shown to have comparable sample complexity to proper learning for the structured classes we consider in this paper.
Computation of distance to class.
The next step is to see if the hypothesis q is close to the class \mathcal{C} or not. Since we have an explicit description of q, this step requires no further samples from p, i.e. it is purely computational. If we find that q is far from the class \mathcal{C}, then it must be that p\not\in\mathcal{C}, as otherwise the guarantees from the previous step would imply that q is close to \mathcal{C}. Thus, if it is not, we can terminate the algorithm at this point.
\chi^{2}testing.
At this point, the previous two steps guarantee that our distribution q is such that:

If p\in\mathcal{C}, then p and q are close in \chi^{2} distance on a (known) effective support of p;

If d_{\mathrm{TV}}(p,\mathcal{C})\geq\varepsilon, then p and q are far in total variation distance.
We can distinguish between these two cases using O(\sqrt{n}/\varepsilon^{2}) samples with a simple statistical \chi^{2}test, that we describe in Section 3.
Using the above threestep approach, our tester, as described in the next section, can directly test monotonicity, logconcavity, and monotone hazard rate. With an extra trick, using Kolmogorov’s max inequality, it can also test unimodality.
3 A Robust \chi^{2}\ell_{1} Identity Test
Our main result in the Section is Theorem 2. As an immediate corollary, we obtain the following result on testing whether an unknown distribution is close in \chi^{2} or far in \ell_{1} distance to a known distribution. In particular, we show the following:
Theorem 1.
For a known distribution q, there exists an algorithm with sample complexity
O(\sqrt{n}/\varepsilon^{2}) 
distinguishes between the cases

\chi^{2}(p,q)<\varepsilon^{2}/10 versus

\pq\>\varepsilon^{2}.
with probability at least 5/6.
This theorem follows from our main result of this section, stated next, slightly more generally for classes of distributions.
Theorem 2.
Suppose we are given \varepsilon\in(0,1], a class of probability distributions \mathcal{C}, sample access to a distribution p over [n], and an explicit description of a distribution q with the following properties:

d_{\mathrm{TV}}(q,\mathcal{C})\leq\frac{\varepsilon}{2}.

If p\in\mathcal{C}, then \chi^{2}(p,q)\leq\frac{\varepsilon^{2}}{500}.
Then there exists an algorithm with the following guarantees:

If p\in\mathcal{C}, the algorithm outputs Accept with probability at least 2/3;

If d_{\mathrm{TV}}(p,\mathcal{C})\geq\varepsilon, the algorithm outputs Reject with probability at least 2/3.
The time and sample complexity of this algorithm are O\left(\frac{\sqrt{n}}{\varepsilon^{2}}\right).
Remark 1.
As stated in Theorem 2, 2 requires that q is O(\varepsilon^{2})close in \chi^{2}distance to p over its entire domain. For the class of monotone distributions, we are able to efficiently obtain such a q, which immediately implies sampleoptimal learning algorithms for this class. However, for some classes, we cannot learn a q with such strong guarantees, and we must consider modifications to our base testing algorithm.
For example, for logconcave and monotone hazard rate distributions, we can obtain a distribution q and a set S with the following guarantees:

If p\in\mathcal{C}, then \chi^{2}(p_{S},q_{S})\leq O(\varepsilon^{2}) and p(S)\geq 1O(\varepsilon);

If d_{\mathrm{TV}}(p,\mathcal{C})\geq\varepsilon, then d_{\mathrm{TV}}(p,q)\geq\varepsilon/2.
In this scenario, the tester will simply pretend the support of p and q is S, ignoring any samples and support elements in [n]\setminus S. Analysis of this tester is extremely similar to what we present below. In particular, we can still show that the statistic Z will be separated in the two cases. When p\in\mathcal{C}, excluding [n]\setminus S will only reduce Z. On the other hand, when d_{\mathrm{TV}}(p,\mathcal{C})\geq\varepsilon, since p(S)\geq 1O(\varepsilon), p and q must still be far on the remaining support, and we can show that Z is still sufficiently large. Therefore, a small modification allows us to handle this case with the same sample complexity of O(\sqrt{n}/\varepsilon^{2}).
A further modification can handle even weaker learning guarantees. We could handle the previous case because the tester “knows what we don’t know” – it can explicitly ignore the support over which we do not have a \chi^{2}closeness guarantee. A more difficult case is when there may be a low measure interval hidden in our effective support, over which p and q have a large \chi^{2}distance. While we may have insufficient samples to reliably identify this interval, it may still have a large effect on our statistic. A naive solution would be to consider a tester which tries all possible “guesses” for this “bad” interval, but a union bound would incur an extra logarithmic factor in the sample complexity. We manage to avoid this cost through a careful analysis involving Kolmogorov’s max inequality, maintaining the O(\sqrt{n}/\varepsilon^{2}) sample complexity even in this more difficult case.
Being more precise, we can handle cases where we can obtain a distribution q and a set of intervals S=\{I_{1},\dots,I_{b}\} with the following guarantees:

If p\in\mathcal{C}, then p(S)\geq 1O(\varepsilon), p(I_{j})=\Theta(p(S)/b) for all j\in[b], and there exists a set T\subseteq[b] such that T\geq bt (for t=O(1)) and \chi^{2}(p_{R},q_{R})\leq O(\varepsilon^{2}), where R=\cup_{T}I_{j};

If d_{\mathrm{TV}}(p,\mathcal{C})\geq\varepsilon, then d_{\mathrm{TV}}(p,q)\geq\varepsilon/2.
This allows us to additionally test against the class of unimodal distributions.
The tester requires that an effective support is divided into several intervals of roughly equal measure. It computes our statistic over each of these intervals, and we let our statistic Z be the sum of all but the largest t of these values. In the case when p\in\mathcal{C}, Z will only become smaller by performing this operation. We use Kolmogorov’s maximal inequality to show that Z remains large when d_{\mathrm{TV}}(p,\mathcal{C})\geq\varepsilon. More details on this tester are provided in Section LABEL:sec:unimodalappendix.
Proof of Theorem 2: Theorem 2 is proven by analyzing Algorithm 1. As shown in Section A, Z has the following mean and variance:
\mathbb{E}\left[Z\right]=m\cdot\sum_{i\in\mathcal{A}}\frac{(p_{i}q_{i})^{2}}{% q_{i}}=m\cdot\chi^{2}(p_{\mathcal{A}},q_{\mathcal{A}})  (1) 
\operatorname{Var}\left[Z\right]=\sum_{i\in\mathcal{A}}\left[2\frac{p_{i}^{2}}% {q_{i}^{2}}+4m\cdot\frac{p_{i}\cdot(p_{i}q_{i})^{2}}{q_{i}^{2}}\right]  (2) 
where by p_{\mathcal{A}} and q_{\mathcal{A}} we denote respectively the vectors p and q restricted to the coordinates in \mathcal{A}, and we slightly abuse notation when we write \chi^{2}(p_{\mathcal{A}},q_{\mathcal{A}}), as these do not then correspond to probability distributions.
Lemma 2 demonstrates the separation in the means of the statistic Z in the two cases of interest, i.e., p\in\mathcal{C} versus d_{\mathrm{TV}}(p,\mathcal{C})\geq\varepsilon, and Lemma 3 shows the separation in the variances in the two cases. These two results are proved in Section B.
Lemma 2.
If p\in\mathcal{C}, then \mathbb{E}\left[Z\right]\leq\frac{1}{500}m\varepsilon^{2}. If d_{\mathrm{TV}}(p,\mathcal{C})\geq\varepsilon, then \mathbb{E}\left[Z\right]\geq\frac{1}{5}m\varepsilon^{2}.
Lemma 3.
If p\in\mathcal{C}, then \operatorname{Var}\left[Z\right]\leq\frac{1}{500000}m^{2}\varepsilon^{4}. If d_{\mathrm{TV}}(p,\mathcal{C})\geq\varepsilon, then \operatorname{Var}\left[Z\right]\leq\frac{1}{100}E[Z]^{2}.
When p\in\mathcal{C}, we have that
\mathbb{E}\left[Z\right]+\sqrt{3}\operatorname{Var}\left[Z\right]^{1/2}\leq% \left(\frac{1}{500}+\sqrt{3}\left(\frac{1}{500000}\right)^{1/2}\right)m% \varepsilon^{2}\leq\frac{1}{200}m\varepsilon^{2}. 
Thus, Chebyshev’s inequality gives
\Pr\left[Z\geq m\varepsilon^{2}/10\right]\leq\Pr\left[Z\geq m\varepsilon^{2}/2% 00\right]\leq\Pr\left[Z\mathbb{E}\left[Z\right]\geq\sqrt{3}\operatorname{Var}% \left[Z\right]^{1/2}\right]\leq\frac{1}{3}. 
The case for d_{\mathrm{TV}}(p,\mathcal{C})\geq\varepsilon is similar. Here,
\mathbb{E}\left[Z\right]\sqrt{3}\operatorname{Var}\left[Z\right]^{1/2}\geq% \left(1\sqrt{3}\left(\frac{1}{100}\right)^{1/2}\right)E[Z]\geq 3m\varepsilon^% {2}/20. 
Therefore,
\Pr\left[Z\leq m\varepsilon^{2}/10\right]\leq\Pr\left[Z\leq 3m\varepsilon^{2}/% 20\right]\leq\Pr\left[Z\mathbb{E}\left[Z\right]\leq\sqrt{3}\operatorname{Var% }\left[Z\right]^{1/2}\right]\leq\frac{1}{3}. 
\hfill\qed
4 Lower Bounds
We now prove sharp lower bounds for the classes of distributions we consider. We show that the example studied by Paninski [Paninski08] to prove lower bounds on testing uniformity can be used to prove lower bounds for the classes we consider. They consider a class \mathcal{Q} consisting of 2^{n/2} distributions defined as follows. Without loss of generality assume that n is even. For each of the 2^{n/2} vectors z_{0}z_{1}\ldots z_{n/21}\in\{1,1\}^{n/2}, define a distribution q\in\mathcal{Q} over [n] as follows.
\displaystyle q_{i}=\begin{cases}\frac{(1+z_{\ell}c\varepsilon)}{n}&\text{ for% }i=2\ell+1\\ \frac{(1z_{\ell}c\varepsilon)}{n}&\text{ for }i=2\ell.\\ \end{cases}  (3) 
Each distribution in \mathcal{Q} has a total variation distance c\varepsilon/2 from U_{n}, the uniform distribution over [n]. By choosing c to be an appropriate constant, Paninski [Paninski08] showed that a distribution picked uniformly at random from \mathcal{Q} cannot be distinguished from U_{n} with fewer than \sqrt{n}/\varepsilon^{2} samples with probability at least 2/3.
Suppose \mathcal{C} is a class of distributions such that

The uniform distribution U_{n} is in \mathcal{C},

For appropriately chosen c, d_{\mathrm{TV}}(\mathcal{C},\mathcal{Q})\geq\varepsilon,
then testing \mathcal{C} is not easier than distinguishing U_{n} from \mathcal{Q}. Invoking [Paninski08] immediately implies that testing the class \mathcal{C} requires \Omega(\sqrt{n}/\varepsilon^{2}) samples.
The lower bounds for all the one dimensional distributions will follow directly from this construction, and for testing monotonicity in higher dimensions, we extend this construction to d\geq 1, appropriately. These arguments are proved in Section LABEL:sec:lbappendix, leading to the following lower bounds for testing these classes:
Theorem 3.

For any d\geq 1, any algorithm for testing monotonicity over [n]^{d} requires \Omega(n^{d/2}/\varepsilon^{2}) samples.

For d\geq 1, any algorithm for testing independence over [n_{1}]\times\cdots\times[n_{d}] requires \Omega\left(\frac{(n_{1}\cdot n_{2}\ldots\cdot n_{d})^{1/2}}{\varepsilon^{2}}\right) samples.

Any algorithm for testing unimodality, logconcavity, or monotone hazard rate over [n] requires \Omega(\sqrt{n}/\varepsilon^{2}) samples.
References
Appendix A Moments of the ChiSquared Statistic
We analyze the mean and variance of the statistic
Z=\sum_{i\in\mathcal{A}}\frac{(X_{i}mq_{i})^{2}X_{i}}{mq_{i}}, 
where each X_{i} is independently distributed according to \rm Poisson(\text{$mp_{i}$}).
We start with the mean:
\displaystyle\mathbb{E}\left[Z\right]  \displaystyle=\sum_{i\in\mathcal{A}}\mathbb{E}\left[\frac{(X_{i}mq_{i})^{2}X% _{i}}{mq_{i}}\right]  
\displaystyle=\sum_{i\in\mathcal{A}}\frac{\mathbb{E}\left[X_{i}^{2}\right]2mq% _{i}\mathbb{E}\left[X_{i}\right]+m^{2}q_{i}^{2}\mathbb{E}\left[X_{i}\right]}{% mq_{i}}  
\displaystyle=\sum_{i\in\mathcal{A}}\frac{m^{2}p_{i}^{2}+mp_{i}2m^{2}q_{i}p_{% i}+m^{2}q_{i}^{2}mp_{i}}{mq_{i}}  
\displaystyle=m\sum_{i\in\mathcal{A}}\frac{(p_{i}q_{i})^{2}}{q_{i}}  
\displaystyle=m\cdot\chi^{2}(p_{\mathcal{A}},q_{\mathcal{A}}) 
Next, we analyze the variance. Let \lambda_{i}=\mathbb{E}\left[X_{i}\right]=mp_{i} and \lambda_{i}^{\prime}=mq_{i}.
\displaystyle\operatorname{Var}\left[Z\right]  \displaystyle=\sum_{i\in\mathcal{A}}\frac{1}{\lambda_{i}^{\prime 2}}% \operatorname{Var}\left[(X_{i}\lambda_{i})^{2}+2(X_{i}\lambda_{i})(\lambda_{% i}\lambda_{i}^{\prime})(X_{i}\lambda_{i})\right]  
\displaystyle=\sum_{i\in\mathcal{A}}\frac{1}{\lambda_{i}^{\prime 2}}% \operatorname{Var}\left[(X_{i}\lambda_{i})^{2}+(X_{i}\lambda_{i})(2\lambda_{% i}2\lambda_{i}^{\prime}1)\right]  
\displaystyle=\sum_{i\in\mathcal{A}}\frac{1}{\lambda_{i}^{\prime 2}}\mathbb{E}% \left[(X_{i}\lambda_{i})^{4}+2(X_{i}\lambda_{i})^{3}(2\lambda_{i}2\lambda_{% i}^{\prime}1)+(X_{i}\lambda_{i})^{2}(2\lambda_{i}2\lambda_{i}^{\prime}1)^{% 2}\lambda_{i}^{2}\right]  
\displaystyle=\sum_{i\in\mathcal{A}}\frac{1}{\lambda_{i}^{\prime 2}}[3\lambda_% {i}^{2}+\lambda_{i}+2\lambda_{i}(2\lambda_{i}2\lambda_{i}^{\prime}1)+\lambda% _{i}(2\lambda_{i}2\lambda_{i}^{\prime}1)^{2}\lambda_{i}^{2}]  
\displaystyle=\sum_{i\in\mathcal{A}}\frac{1}{\lambda_{i}^{\prime 2}}[2\lambda_% {i}^{2}+\lambda_{i}+4\lambda_{i}(\lambda_{i}\lambda_{i}^{\prime})2\lambda_{i% }+\lambda_{i}(4(\lambda_{i}\lambda_{i}^{\prime})^{2}4(\lambda_{i}\lambda_{i% }^{\prime})+1)]  
\displaystyle=\sum_{i\in\mathcal{A}}\frac{1}{\lambda_{i}^{\prime 2}}[2\lambda_% {i}^{2}+4\lambda_{i}(\lambda_{i}\lambda_{i}^{\prime})^{2}]  
\displaystyle=\sum_{i\in\mathcal{A}}\left[2\frac{p_{i}^{2}}{q_{i}^{2}}+4m\cdot% \frac{p_{i}\cdot(p_{i}q_{i})^{2}}{q_{i}^{2}}\right]  (4) 
The third equality is by noting the random variable has expectation \lambda_{i} and the fourth equality substitutes the values of centralized moments of the Poisson distribution.
Appendix B Analysis of our \chi^{2}Test Statistic
We first prove the key lemmas in the analysis of our \chi^{2}test.
We turn to the latter case. Recall that \mathcal{A}=\{i:q_{i}\geq\varepsilon/50n\}, and thus q(\bar{\mathcal{A}})\leq\varepsilon/50. We first show that d_{\mathrm{TV}}(p_{\mathcal{A}},q_{\mathcal{A}})\geq\frac{6\varepsilon}{25}, where p_{\mathcal{A}},q_{\mathcal{A}} are defined as above and in our slight abuse of notation we use d_{\mathrm{TV}}(p_{\mathcal{A}},q_{\mathcal{A}}) for nonprobability vectors to denote \frac{1}{2}\p_{\mathcal{A}}q_{\mathcal{A}}\_{1}.
Partitioning the support into \mathcal{A} and \bar{\mathcal{A}}, we have
\displaystyle d_{\mathrm{TV}}(p,q)=d_{\mathrm{TV}}(p_{\mathcal{A}},q_{\mathcal% {A}})+d_{\mathrm{TV}}(p_{\bar{\mathcal{A}}},q_{\bar{\mathcal{A}}}).  (5) 
We consider the following cases separately:

p(\bar{\mathcal{A}})\leq\varepsilon/2: In this case,
\displaystyle d_{\mathrm{TV}}(p_{\bar{\mathcal{A}}},q_{\bar{\mathcal{A}}})=% \frac{1}{2}\sum_{i\in\bar{\mathcal{A}}}p_{i}q_{i}\leq\frac{1}{2}(p(\bar{% \mathcal{A}})+q(\bar{\mathcal{A}}))\leq\frac{1}{2}\left(\frac{\varepsilon}{2}+% \frac{\varepsilon}{50}\right)=\frac{13\varepsilon}{50}. Plugging this in (5), and using the fact that d_{\mathrm{TV}}(p,q)\geq\varepsilon shows that d_{\mathrm{TV}}(p_{\mathcal{A}},q_{\mathcal{A}})\geq\frac{6\varepsilon}{25}.

p(\bar{\mathcal{A}})>\varepsilon/2: In this case, by the reverse triangle inequality,
\displaystyle d_{\mathrm{TV}}(p_{\mathcal{A}},q_{\mathcal{A}})\geq\frac{1}{2}(% q(\mathcal{A})p(\mathcal{A}))\geq\frac{1}{2}((1\varepsilon/50)(1% \varepsilon/2))=\frac{6\varepsilon}{25}.
By the CauchySchwarz inequality,
\displaystyle\chi^{2}(p_{\mathcal{A}},q_{\mathcal{A}})  \displaystyle\geq 4\frac{d_{\mathrm{TV}}(p_{\mathcal{A}},q_{\mathcal{A}})^{2}}% {q(\mathcal{A})}  
\displaystyle\geq\frac{\varepsilon^{2}}{5}. 
We conclude by recalling (1).
\hfill\qed
\displaystyle 2\sum_{i\in\mathcal{A}}\frac{p_{i}^{2}}{q_{i}^{2}}  \displaystyle=2\sum_{i\in\mathcal{A}}\left(\frac{(p_{i}q_{i})^{2}}{q_{i}^{2}}% +\frac{2p_{i}q_{i}q_{i}^{2}}{q_{i}^{2}}\right)  
\displaystyle=2\sum_{i\in\mathcal{A}}\left(\frac{(p_{i}q_{i})^{2}}{q_{i}^{2}}% +\frac{2q_{i}(p_{i}q_{i})+q_{i}^{2}}{q_{i}^{2}}\right)  
\displaystyle\leq 2n+2\sum_{i\in\mathcal{A}}\left(\frac{(p_{i}q_{i})^{2}}{q_{% i}^{2}}+2\frac{(p_{i}q_{i})}{q_{i}}\right)  
\displaystyle\leq 4n+4\sum_{i\in\mathcal{A}}\frac{(p_{i}q_{i})^{2}}{q_{i}^{2}}  
\displaystyle\leq 4n+\frac{200n}{\varepsilon}\sum_{i\in\mathcal{A}}\frac{(p_{i% }q_{i})^{2}}{q_{i}}  
\displaystyle=4n+\frac{200n}{\varepsilon}\frac{E[Z]}{m}  
\displaystyle\leq 4n+\frac{1}{100}\sqrt{n}E[Z]  (6) 
The second inequality is the AMGM inequality, the third inequality uses that q_{i}\geq\frac{\varepsilon}{50n} for all i\in\mathcal{A}, the last equality uses (1), and the final inequality substitutes a value m\geq 20000\frac{\sqrt{n}}{\varepsilon^{2}}.
The second term can be similarly bounded:
\displaystyle 4m\sum_{i\in\mathcal{A}}\frac{p_{i}(p_{i}q_{i})^{2}}{q_{i}^{2}}  \displaystyle\leq 4m\left(\sum_{i\in\mathcal{A}}\frac{p_{i}^{2}}{q_{i}^{2}}% \right)^{1/2}\left(\sum_{i\in\mathcal{A}}\frac{(p_{i}q_{i})^{4}}{q_{i}^{2}}% \right)^{1/2}  
\displaystyle\leq 4m\left(4n+\frac{1}{100}\sqrt{n}E[Z]\right)^{1/2}\left(\sum_% {i\in\mathcal{A}}\frac{(p_{i}q_{i})^{4}}{q_{i}^{2}}\right)^{1/2}  
\displaystyle\leq 4m\left(2\sqrt{n}+\frac{1}{10}n^{1/4}E[Z]^{1/2}\right)\left(% \sum_{i\in\mathcal{A}}\frac{(p_{i}q_{i})^{2}}{q_{i}}\right)  
\displaystyle=\left(8\sqrt{n}+\frac{2}{5}n^{1/4}E[Z]^{1/2}\right)E[Z] 
The first inequality is CauchySchwarz, the second inequality uses (6), the third inequality uses the monotonicity of the \ell_{p} norms, and the equality uses (1).
Combining the two terms, we get
\operatorname{Var}\left[Z\right]\leq 4n+9\sqrt{n}\mathbb{E}\left[Z\right]+% \frac{2}{5}n^{1/4}\mathbb{E}\left[Z\right]^{3/2}. 
We now consider the two cases in the statement of our lemma.

When p\in\mathcal{C}, we know from Lemma 2 that \mathbb{E}\left[Z\right]\leq\frac{1}{500}m\varepsilon^{2}. Combined with a choice of m\geq 20000\frac{\sqrt{n}}{\varepsilon^{2}} and the above expression for the variance, this gives:
\operatorname{Var}\left[Z\right]\leq\frac{4}{20000^{2}}m^{2}\varepsilon^{4}+% \frac{9}{20000\cdot 500}m^{2}\varepsilon^{4}+\frac{\sqrt{10}}{12500000}m^{2}% \varepsilon^{4}\leq\frac{1}{500000}m^{2}\varepsilon^{4}. 
When d_{\mathrm{TV}}(p,\mathcal{C})\geq\varepsilon, Lemma 2 and m\geq 20000\frac{\sqrt{n}}{\varepsilon^{2}} give:
\mathbb{E}\left[Z\right]\geq\frac{1}{5}m\varepsilon^{2}\geq 4000\sqrt{n}. Combining this with our expression for variance we get:
\operatorname{Var}\left[Z\right]\leq\frac{4}{4000^{2}}\mathbb{E}\left[Z\right]% ^{2}+\frac{9}{4000}\mathbb{E}\left[Z\right]^{2}+\frac{2}{5\sqrt{4000}}\mathbb{% E}\left[Z\right]^{2}\leq\frac{1}{100}\mathbb{E}\left[Z\right]^{2}.
\hfill\qed