Testing Everything

# Testing Everything

EECS, MIT
EECS, MIT
costis@mit.edu
Gautam Kamath
EECS, MIT
g@csail.mit.edu
###### Abstract

Given samples from an unknown distribution, p. How to answer basic questions of the form. Is p monotone? Is it unimodal? Is it log-concave? and so on \ldots. These problems have received tremendous attention in statistics, with emphasis on the asymptotic analysis. Over the past decade, a number of researchers have studied these problems in a CS framework, with focus on designing algorithms with sample complexity and running time as small as possible.

Surprisingly, for some of the most basic problems such as testing monotonicity over a discrete domain, such as [n], or the hypergrid [n]^{d}, the known algorithms have highly sub-optimal sample complexity. For example, testing monotonicity over the hypergrid [n]^{d} required \tilde{O}(n^{d-1/2}) samples, when the best known lower bounds are \Omega(n^{d/2}).

We propose a general framework that resolves the problem of testing for a wide range classes. In particular, our algorithms are provably information theoretically optimal, and for many of the classes considered is highly efficient. We resolve the problems of testing monotonicity, log-concavity and hazard-rate distributions, distributions with a few modes, optimally. As an example, we can test monotonicity over the hypergrid with O(n^{d/2}) samples and time.

At a technical level, we give a two step approach to testing. The first is a learning step, that tries to estimate the underlying distribution with one from the class of interest. The second step is a simple modificaiton of the \chi^{2} test. Our algorithms are simple to implement and we compare them in the section on experiments.

\listoftodos

## 0 Real Todos

• \sout

Everything

• Test previous item

• Put in the lower bound derivation from Paninski/PBD’s.

• t-modal distributions – Think Think Think

## 1 Preliminaries

We use the following probability distances in our paper.

###### Definition 1.

The total variation distance between distributions p and q is defined as

 d_{\mathrm{TV}}(p,q)\lx@stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sup_{A}|p(A)% -q(A)|=\frac{1}{2}\|p-q\|_{1}.

For a subset of the domain, the total variation distance is defined as half of the \ell_{1} distance restricted to the subset.

###### Definition 2.

The \chi^{2}-distance between p and q over [n] is defined by

 \chi^{2}(p,q)\lx@stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sum_{i\in[n]}\frac{% (p_{i}-q_{i})^{2}}{q_{i}}=\left[\sum_{i\in[n]}\frac{p_{i}^{2}}{q_{i}}\right]-1.
###### Definition 3.

The Kolmogorov distance between two probability measures p and q over an ordered set (e.g., {\bf R}) with cumulative density functions (CDF) F_{p} and F_{q} is defined as

 d_{\mathrm{K}}(p,q)\lx@stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sup_{x\in% \mathbb{R}}|F_{p}(x)-F_{q}(x)|.

Our paper is primarily concerned with testing against classes of distributions, defined formally as follows:

###### Definition 4.

Given \varepsilon\in(0,1] and sample access to a distribution p, an algorithm is said to test a class \mathcal{C} if it has the following guarantees:

• If p\in\mathcal{C}, the algorithm outputs Accept with probability at least 2/3;

• If d_{\mathrm{TV}}(p,\mathcal{C})\geq\varepsilon, the algorithm outputs Reject with probability at least 2/3.

The Dvoretzky-Kiefer-Wolfowitz (DKW) inequality gives a generic algorithm for learning any distribution with respect to the Kolmogorov distance [DvoretzkyKW56].

###### Lemma 1.

(See [DvoretzkyKW56],[Massart90]) Suppose we have n i.i.d. samples X_{1},\dots X_{n} from a distribution with CDF F. Let F_{n}(x)\lx@stackrel{{\scriptstyle\mathrm{def}}}{{=}}\frac{1}{n}\sum_{i=1}^{n}% \mathbf{1}_{\{X_{i}\leq x\}} be the empirical CDF. Then \Pr[d_{\mathrm{K}}(F,F_{n})\geq\varepsilon]\leq 2e^{-2n\varepsilon^{2}}. In particular, if n=\Omega((1/\varepsilon^{2})\cdot\log(1/\delta)), then \Pr[d_{\mathrm{K}}(F,F_{n})\geq\varepsilon]\leq\delta.

We note the following useful relationships between these distances [GibbsS02]:

###### Proposition 1.

d_{\mathrm{K}}(p,q)^{2}\leq d_{\mathrm{TV}}(p,q)^{2}\leq\frac{1}{4}\chi^{2}(p,q).

In this paper, we will consider the following classes of distributions:

• Monotone distributions over [n]^{d} (denoted by \mathcal{M}_{n}^{d}), for which i\lesssim j implies f_{i}\geq f_{j}111This definition describes monotone non-increasing distributions. By symmetry, identical results hold for monotone non-decreasing distributions.;

• Unimodal distributions over [n] (denoted by \mathcal{U}_{n}), for which there exists an i^{*} such that f_{i} is non-decreasing for i\leq i^{*} and non-increasing for i\geq i^{*};

• Log-concave distributions over [n] (denoted by \mathcal{LCD}_{n}), the sub-class of unimodal distributions for which f_{i-1}f_{i+1}\leq f_{i}^{2};

• Monotone hazard rate (MHR) distributions over [n] (denoted by \mathcal{MHR}_{n}), for which i<j implies \frac{f_{i}}{1-F_{i}}\leq\frac{f_{j}}{1-F_{j}}.

###### Definition 5.

An \eta-effective support of a distribution p is any set S such that p(S)\geq 1-\eta.

The flattening of a function f over a subset S is the function \bar{f} such that \bar{f}_{i}=p(S)/|S|.

###### Definition 6.

Let p be a distribution, and support I_{1},\ldots is a partition of the domain. The flattening of p with respect to I_{1},\ldots is the distribution \bar{p} which is the flattening of p over the intervals I_{1},\ldots.

### Poisson Sampling

Throughout this paper, we use the standard Poissonization approach. Instead of drawing exactly m samples from a distribution p, we first draw m^{\prime}\sim\rm Poisson(m), and then draw m^{\prime} samples from p. As a result, the number of times different elements in the support of p occur in the sample become independent, giving much simpler analyses. In particular, the number of times we will observe domain element i will be distributed as \rm Poisson(mp_{i}), independently for each i. Since \rm Poisson(m) is tightly concentrated around m, this additional flexibility comes only at a sub-constant cost in the sample complexity with an inversely exponential in m, additive increase in the error probability.

## 2 Overview

Our algorithm for testing a distribution p can be decomposed into three steps.

### Near-proper learning in \chi^{2}-distance.

Our first step requires a learning algorithm with very specific guarantees. In proper learning, we are given sample access to a distribution p\in\mathcal{C}, where \mathcal{C} is some class of distributions, and we wish to output q\in\mathcal{C} such that p and q are close in total variation distance. In our setting, given sample access to p\in\mathcal{C}, we wish to output q such that q is close to \mathcal{C} in total variation distance, and p and q are close in \chi^{2}-distance on an effective support222We also require the algorithm to output a description of an effective support for which this property holds. This requirement can be slightly relaxed, as we show in our results for testing unimodality. of p. From an information theoretic standpoint, this problem is harder than proper learning, since \chi^{2}-distance is more restrictive than total variation distance. Nonetheless, this problem can be shown to have comparable sample complexity to proper learning for the structured classes we consider in this paper.

### Computation of distance to class.

The next step is to see if the hypothesis q is close to the class \mathcal{C} or not. Since we have an explicit description of q, this step requires no further samples from p, i.e. it is purely computational. If we find that q is far from the class \mathcal{C}, then it must be that p\not\in\mathcal{C}, as otherwise the guarantees from the previous step would imply that q is close to \mathcal{C}. Thus, if it is not, we can terminate the algorithm at this point.

### \chi^{2}-testing.

At this point, the previous two steps guarantee that our distribution q is such that:

• If p\in\mathcal{C}, then p and q are close in \chi^{2} distance on a (known) effective support of p;

• If d_{\mathrm{TV}}(p,\mathcal{C})\geq\varepsilon, then p and q are far in total variation distance.

We can distinguish between these two cases using O(\sqrt{n}/\varepsilon^{2}) samples with a simple statistical \chi^{2}-test, that we describe in Section 3.

Using the above three-step approach, our tester, as described in the next section, can directly test monotonicity, log-concavity, and monotone hazard rate. With an extra trick, using Kolmogorov’s max inequality, it can also test unimodality.

## 3 A Robust \chi^{2}-\ell_{1} Identity Test

Our main result in the Section is Theorem 2. As an immediate corollary, we obtain the following result on testing whether an unknown distribution is close in \chi^{2} or far in \ell_{1} distance to a known distribution. In particular, we show the following:

###### Theorem 1.

For a known distribution q, there exists an algorithm with sample complexity

 O(\sqrt{n}/\varepsilon^{2})

distinguishes between the cases

• \chi^{2}(p,q)<\varepsilon^{2}/10    versus

• \|p-q\|>\varepsilon^{2}.

with probability at least 5/6.

This theorem follows from our main result of this section, stated next, slightly more generally for classes of distributions.

###### Theorem 2.

Suppose we are given \varepsilon\in(0,1], a class of probability distributions \mathcal{C}, sample access to a distribution p over [n], and an explicit description of a distribution q with the following properties:

1. d_{\mathrm{TV}}(q,\mathcal{C})\leq\frac{\varepsilon}{2}.

2. If p\in\mathcal{C}, then \chi^{2}(p,q)\leq\frac{\varepsilon^{2}}{500}.

Then there exists an algorithm with the following guarantees:

• If p\in\mathcal{C}, the algorithm outputs Accept with probability at least 2/3;

• If d_{\mathrm{TV}}(p,\mathcal{C})\geq\varepsilon, the algorithm outputs Reject with probability at least 2/3.

The time and sample complexity of this algorithm are O\left(\frac{\sqrt{n}}{\varepsilon^{2}}\right).

###### Remark 1.

As stated in Theorem 2, 2 requires that q is O(\varepsilon^{2})-close in \chi^{2}-distance to p over its entire domain. For the class of monotone distributions, we are able to efficiently obtain such a q, which immediately implies sample-optimal learning algorithms for this class. However, for some classes, we cannot learn a q with such strong guarantees, and we must consider modifications to our base testing algorithm.

For example, for log-concave and monotone hazard rate distributions, we can obtain a distribution q and a set S with the following guarantees:

• If p\in\mathcal{C}, then \chi^{2}(p_{S},q_{S})\leq O(\varepsilon^{2}) and p(S)\geq 1-O(\varepsilon);

• If d_{\mathrm{TV}}(p,\mathcal{C})\geq\varepsilon, then d_{\mathrm{TV}}(p,q)\geq\varepsilon/2.

In this scenario, the tester will simply pretend the support of p and q is S, ignoring any samples and support elements in [n]\setminus S. Analysis of this tester is extremely similar to what we present below. In particular, we can still show that the statistic Z will be separated in the two cases. When p\in\mathcal{C}, excluding [n]\setminus S will only reduce Z. On the other hand, when d_{\mathrm{TV}}(p,\mathcal{C})\geq\varepsilon, since p(S)\geq 1-O(\varepsilon), p and q must still be far on the remaining support, and we can show that Z is still sufficiently large. Therefore, a small modification allows us to handle this case with the same sample complexity of O(\sqrt{n}/\varepsilon^{2}).

A further modification can handle even weaker learning guarantees. We could handle the previous case because the tester “knows what we don’t know” – it can explicitly ignore the support over which we do not have a \chi^{2}-closeness guarantee. A more difficult case is when there may be a low measure interval hidden in our effective support, over which p and q have a large \chi^{2}-distance. While we may have insufficient samples to reliably identify this interval, it may still have a large effect on our statistic. A naive solution would be to consider a tester which tries all possible “guesses” for this “bad” interval, but a union bound would incur an extra logarithmic factor in the sample complexity. We manage to avoid this cost through a careful analysis involving Kolmogorov’s max inequality, maintaining the O(\sqrt{n}/\varepsilon^{2}) sample complexity even in this more difficult case.

Being more precise, we can handle cases where we can obtain a distribution q and a set of intervals S=\{I_{1},\dots,I_{b}\} with the following guarantees:

• If p\in\mathcal{C}, then p(S)\geq 1-O(\varepsilon), p(I_{j})=\Theta(p(S)/b) for all j\in[b], and there exists a set T\subseteq[b] such that |T|\geq b-t (for t=O(1)) and \chi^{2}(p_{R},q_{R})\leq O(\varepsilon^{2}), where R=\cup_{T}I_{j};

• If d_{\mathrm{TV}}(p,\mathcal{C})\geq\varepsilon, then d_{\mathrm{TV}}(p,q)\geq\varepsilon/2.

This allows us to additionally test against the class of unimodal distributions.

The tester requires that an effective support is divided into several intervals of roughly equal measure. It computes our statistic over each of these intervals, and we let our statistic Z be the sum of all but the largest t of these values. In the case when p\in\mathcal{C}, Z will only become smaller by performing this operation. We use Kolmogorov’s maximal inequality to show that Z remains large when d_{\mathrm{TV}}(p,\mathcal{C})\geq\varepsilon. More details on this tester are provided in Section LABEL:sec:unimodal-appendix.

Proof of Theorem 2: Theorem 2 is proven by analyzing Algorithm 1. As shown in Section A, Z has the following mean and variance:

 \mathbb{E}\left[Z\right]=m\cdot\sum_{i\in\mathcal{A}}\frac{(p_{i}-q_{i})^{2}}{% q_{i}}=m\cdot\chi^{2}(p_{\mathcal{A}},q_{\mathcal{A}}) (1)
 \operatorname{Var}\left[Z\right]=\sum_{i\in\mathcal{A}}\left[2\frac{p_{i}^{2}}% {q_{i}^{2}}+4m\cdot\frac{p_{i}\cdot(p_{i}-q_{i})^{2}}{q_{i}^{2}}\right] (2)

where by p_{\mathcal{A}} and q_{\mathcal{A}} we denote respectively the vectors p and q restricted to the coordinates in \mathcal{A}, and we slightly abuse notation when we write \chi^{2}(p_{\mathcal{A}},q_{\mathcal{A}}), as these do not then correspond to probability distributions.

Lemma 2 demonstrates the separation in the means of the statistic Z in the two cases of interest, i.e., p\in\mathcal{C} versus d_{\mathrm{TV}}(p,\mathcal{C})\geq\varepsilon, and Lemma 3 shows the separation in the variances in the two cases. These two results are proved in Section B.

###### Lemma 2.

If p\in\mathcal{C}, then \mathbb{E}\left[Z\right]\leq\frac{1}{500}m\varepsilon^{2}. If d_{\mathrm{TV}}(p,\mathcal{C})\geq\varepsilon, then \mathbb{E}\left[Z\right]\geq\frac{1}{5}m\varepsilon^{2}.

###### Lemma 3.

If p\in\mathcal{C}, then \operatorname{Var}\left[Z\right]\leq\frac{1}{500000}m^{2}\varepsilon^{4}. If d_{\mathrm{TV}}(p,\mathcal{C})\geq\varepsilon, then \operatorname{Var}\left[Z\right]\leq\frac{1}{100}E[Z]^{2}.

Assuming Lemmas 2 and 3, Theorem 2 is now a simple application of Chebyshev’s inequality.

When p\in\mathcal{C}, we have that

 \mathbb{E}\left[Z\right]+\sqrt{3}\operatorname{Var}\left[Z\right]^{1/2}\leq% \left(\frac{1}{500}+\sqrt{3}\left(\frac{1}{500000}\right)^{1/2}\right)m% \varepsilon^{2}\leq\frac{1}{200}m\varepsilon^{2}.

Thus, Chebyshev’s inequality gives

 \Pr\left[Z\geq m\varepsilon^{2}/10\right]\leq\Pr\left[Z\geq m\varepsilon^{2}/2% 00\right]\leq\Pr\left[Z-\mathbb{E}\left[Z\right]\geq\sqrt{3}\operatorname{Var}% \left[Z\right]^{1/2}\right]\leq\frac{1}{3}.

The case for d_{\mathrm{TV}}(p,\mathcal{C})\geq\varepsilon is similar. Here,

 \mathbb{E}\left[Z\right]-\sqrt{3}\operatorname{Var}\left[Z\right]^{1/2}\geq% \left(1-\sqrt{3}\left(\frac{1}{100}\right)^{1/2}\right)E[Z]\geq 3m\varepsilon^% {2}/20.

Therefore,

 \Pr\left[Z\leq m\varepsilon^{2}/10\right]\leq\Pr\left[Z\leq 3m\varepsilon^{2}/% 20\right]\leq\Pr\left[Z-\mathbb{E}\left[Z\right]\leq-\sqrt{3}\operatorname{Var% }\left[Z\right]^{1/2}\right]\leq\frac{1}{3}.

\hfill\qed

## 4 Lower Bounds

We now prove sharp lower bounds for the classes of distributions we consider. We show that the example studied by Paninski [Paninski08] to prove lower bounds on testing uniformity can be used to prove lower bounds for the classes we consider. They consider a class \mathcal{Q} consisting of 2^{n/2} distributions defined as follows. Without loss of generality assume that n is even. For each of the 2^{n/2} vectors z_{0}z_{1}\ldots z_{n/2-1}\in\{-1,1\}^{n/2}, define a distribution q\in\mathcal{Q} over [n] as follows.

 \displaystyle q_{i}=\begin{cases}\frac{(1+z_{\ell}c\varepsilon)}{n}&\text{ for% }i=2\ell+1\\ \frac{(1-z_{\ell}c\varepsilon)}{n}&\text{ for }i=2\ell.\\ \end{cases} (3)

Each distribution in \mathcal{Q} has a total variation distance c\varepsilon/2 from U_{n}, the uniform distribution over [n]. By choosing c to be an appropriate constant, Paninski [Paninski08] showed that a distribution picked uniformly at random from \mathcal{Q} cannot be distinguished from U_{n} with fewer than \sqrt{n}/\varepsilon^{2} samples with probability at least 2/3.

Suppose \mathcal{C} is a class of distributions such that

• The uniform distribution U_{n} is in \mathcal{C},

• For appropriately chosen c, d_{\mathrm{TV}}(\mathcal{C},\mathcal{Q})\geq\varepsilon,

then testing \mathcal{C} is not easier than distinguishing U_{n} from \mathcal{Q}. Invoking [Paninski08] immediately implies that testing the class \mathcal{C} requires \Omega(\sqrt{n}/\varepsilon^{2}) samples.

The lower bounds for all the one dimensional distributions will follow directly from this construction, and for testing monotonicity in higher dimensions, we extend this construction to d\geq 1, appropriately. These arguments are proved in Section LABEL:sec:lb-appendix, leading to the following lower bounds for testing these classes:

###### Theorem 3.

• For any d\geq 1, any algorithm for testing monotonicity over [n]^{d} requires \Omega(n^{d/2}/\varepsilon^{2}) samples.

• For d\geq 1, any algorithm for testing independence over [n_{1}]\times\cdots\times[n_{d}] requires \Omega\left(\frac{(n_{1}\cdot n_{2}\ldots\cdot n_{d})^{1/2}}{\varepsilon^{2}}\right) samples.

• Any algorithm for testing unimodality, log-concavity, or monotone hazard rate over [n] requires \Omega(\sqrt{n}/\varepsilon^{2}) samples.

## Appendix A Moments of the Chi-Squared Statistic

We analyze the mean and variance of the statistic

 Z=\sum_{i\in\mathcal{A}}\frac{(X_{i}-mq_{i})^{2}-X_{i}}{mq_{i}},

where each X_{i} is independently distributed according to \rm Poisson(\text{$mp_{i}$}).

 \displaystyle\mathbb{E}\left[Z\right] \displaystyle=\sum_{i\in\mathcal{A}}\mathbb{E}\left[\frac{(X_{i}-mq_{i})^{2}-X% _{i}}{mq_{i}}\right] \displaystyle=\sum_{i\in\mathcal{A}}\frac{\mathbb{E}\left[X_{i}^{2}\right]-2mq% _{i}\mathbb{E}\left[X_{i}\right]+m^{2}q_{i}^{2}-\mathbb{E}\left[X_{i}\right]}{% mq_{i}} \displaystyle=\sum_{i\in\mathcal{A}}\frac{m^{2}p_{i}^{2}+mp_{i}-2m^{2}q_{i}p_{% i}+m^{2}q_{i}^{2}-mp_{i}}{mq_{i}} \displaystyle=m\sum_{i\in\mathcal{A}}\frac{(p_{i}-q_{i})^{2}}{q_{i}} \displaystyle=m\cdot\chi^{2}(p_{\mathcal{A}},q_{\mathcal{A}})

Next, we analyze the variance. Let \lambda_{i}=\mathbb{E}\left[X_{i}\right]=mp_{i} and \lambda_{i}^{\prime}=mq_{i}.

 \displaystyle\operatorname{Var}\left[Z\right] \displaystyle=\sum_{i\in\mathcal{A}}\frac{1}{\lambda_{i}^{\prime 2}}% \operatorname{Var}\left[(X_{i}-\lambda_{i})^{2}+2(X_{i}-\lambda_{i})(\lambda_{% i}-\lambda_{i}^{\prime})-(X_{i}-\lambda_{i})\right] \displaystyle=\sum_{i\in\mathcal{A}}\frac{1}{\lambda_{i}^{\prime 2}}% \operatorname{Var}\left[(X_{i}-\lambda_{i})^{2}+(X_{i}-\lambda_{i})(2\lambda_{% i}-2\lambda_{i}^{\prime}-1)\right] \displaystyle=\sum_{i\in\mathcal{A}}\frac{1}{\lambda_{i}^{\prime 2}}\mathbb{E}% \left[(X_{i}-\lambda_{i})^{4}+2(X_{i}-\lambda_{i})^{3}(2\lambda_{i}-2\lambda_{% i}^{\prime}-1)+(X_{i}-\lambda_{i})^{2}(2\lambda_{i}-2\lambda_{i}^{\prime}-1)^{% 2}-\lambda_{i}^{2}\right] \displaystyle=\sum_{i\in\mathcal{A}}\frac{1}{\lambda_{i}^{\prime 2}}[3\lambda_% {i}^{2}+\lambda_{i}+2\lambda_{i}(2\lambda_{i}-2\lambda_{i}^{\prime}-1)+\lambda% _{i}(2\lambda_{i}-2\lambda_{i}^{\prime}-1)^{2}-\lambda_{i}^{2}] \displaystyle=\sum_{i\in\mathcal{A}}\frac{1}{\lambda_{i}^{\prime 2}}[2\lambda_% {i}^{2}+\lambda_{i}+4\lambda_{i}(\lambda_{i}-\lambda_{i}^{\prime})-2\lambda_{i% }+\lambda_{i}(4(\lambda_{i}-\lambda_{i}^{\prime})^{2}-4(\lambda_{i}-\lambda_{i% }^{\prime})+1)] \displaystyle=\sum_{i\in\mathcal{A}}\frac{1}{\lambda_{i}^{\prime 2}}[2\lambda_% {i}^{2}+4\lambda_{i}(\lambda_{i}-\lambda_{i}^{\prime})^{2}] \displaystyle=\sum_{i\in\mathcal{A}}\left[2\frac{p_{i}^{2}}{q_{i}^{2}}+4m\cdot% \frac{p_{i}\cdot(p_{i}-q_{i})^{2}}{q_{i}^{2}}\right] (4)

The third equality is by noting the random variable has expectation \lambda_{i} and the fourth equality substitutes the values of centralized moments of the Poisson distribution.

## Appendix B Analysis of our \chi^{2}-Test Statistic

We first prove the key lemmas in the analysis of our \chi^{2}-test.

Proof of Lemma 2: The former case is straightforward from (1) and 2 of q.

We turn to the latter case. Recall that \mathcal{A}=\{i:q_{i}\geq\varepsilon/50n\}, and thus q(\bar{\mathcal{A}})\leq\varepsilon/50. We first show that d_{\mathrm{TV}}(p_{\mathcal{A}},q_{\mathcal{A}})\geq\frac{6\varepsilon}{25}, where p_{\mathcal{A}},q_{\mathcal{A}} are defined as above and in our slight abuse of notation we use d_{\mathrm{TV}}(p_{\mathcal{A}},q_{\mathcal{A}}) for non-probability vectors to denote \frac{1}{2}\|p_{\mathcal{A}}-q_{\mathcal{A}}\|_{1}.

Partitioning the support into \mathcal{A} and \bar{\mathcal{A}}, we have

 \displaystyle d_{\mathrm{TV}}(p,q)=d_{\mathrm{TV}}(p_{\mathcal{A}},q_{\mathcal% {A}})+d_{\mathrm{TV}}(p_{\bar{\mathcal{A}}},q_{\bar{\mathcal{A}}}). (5)

We consider the following cases separately:

• p(\bar{\mathcal{A}})\leq\varepsilon/2: In this case,

 \displaystyle d_{\mathrm{TV}}(p_{\bar{\mathcal{A}}},q_{\bar{\mathcal{A}}})=% \frac{1}{2}\sum_{i\in\bar{\mathcal{A}}}|p_{i}-q_{i}|\leq\frac{1}{2}(p(\bar{% \mathcal{A}})+q(\bar{\mathcal{A}}))\leq\frac{1}{2}\left(\frac{\varepsilon}{2}+% \frac{\varepsilon}{50}\right)=\frac{13\varepsilon}{50}.

Plugging this in (5), and using the fact that d_{\mathrm{TV}}(p,q)\geq\varepsilon shows that d_{\mathrm{TV}}(p_{\mathcal{A}},q_{\mathcal{A}})\geq\frac{6\varepsilon}{25}.

• p(\bar{\mathcal{A}})>\varepsilon/2: In this case, by the reverse triangle inequality,

 \displaystyle d_{\mathrm{TV}}(p_{\mathcal{A}},q_{\mathcal{A}})\geq\frac{1}{2}(% q(\mathcal{A})-p(\mathcal{A}))\geq\frac{1}{2}((1-\varepsilon/50)-(1-% \varepsilon/2))=\frac{6\varepsilon}{25}.

By the Cauchy-Schwarz inequality,

 \displaystyle\chi^{2}(p_{\mathcal{A}},q_{\mathcal{A}}) \displaystyle\geq 4\frac{d_{\mathrm{TV}}(p_{\mathcal{A}},q_{\mathcal{A}})^{2}}% {q(\mathcal{A})} \displaystyle\geq\frac{\varepsilon^{2}}{5}.

We conclude by recalling (1).

\hfill\qed

Proof of Lemma 3: We bound the terms of (2) separately, starting with the first.

 \displaystyle 2\sum_{i\in\mathcal{A}}\frac{p_{i}^{2}}{q_{i}^{2}} \displaystyle=2\sum_{i\in\mathcal{A}}\left(\frac{(p_{i}-q_{i})^{2}}{q_{i}^{2}}% +\frac{2p_{i}q_{i}-q_{i}^{2}}{q_{i}^{2}}\right) \displaystyle=2\sum_{i\in\mathcal{A}}\left(\frac{(p_{i}-q_{i})^{2}}{q_{i}^{2}}% +\frac{2q_{i}(p_{i}-q_{i})+q_{i}^{2}}{q_{i}^{2}}\right) \displaystyle\leq 2n+2\sum_{i\in\mathcal{A}}\left(\frac{(p_{i}-q_{i})^{2}}{q_{% i}^{2}}+2\frac{(p_{i}-q_{i})}{q_{i}}\right) \displaystyle\leq 4n+4\sum_{i\in\mathcal{A}}\frac{(p_{i}-q_{i})^{2}}{q_{i}^{2}} \displaystyle\leq 4n+\frac{200n}{\varepsilon}\sum_{i\in\mathcal{A}}\frac{(p_{i% }-q_{i})^{2}}{q_{i}} \displaystyle=4n+\frac{200n}{\varepsilon}\frac{E[Z]}{m} \displaystyle\leq 4n+\frac{1}{100}\sqrt{n}E[Z] (6)

The second inequality is the AM-GM inequality, the third inequality uses that q_{i}\geq\frac{\varepsilon}{50n} for all i\in\mathcal{A}, the last equality uses (1), and the final inequality substitutes a value m\geq 20000\frac{\sqrt{n}}{\varepsilon^{2}}.

The second term can be similarly bounded:

 \displaystyle 4m\sum_{i\in\mathcal{A}}\frac{p_{i}(p_{i}-q_{i})^{2}}{q_{i}^{2}} \displaystyle\leq 4m\left(\sum_{i\in\mathcal{A}}\frac{p_{i}^{2}}{q_{i}^{2}}% \right)^{1/2}\left(\sum_{i\in\mathcal{A}}\frac{(p_{i}-q_{i})^{4}}{q_{i}^{2}}% \right)^{1/2} \displaystyle\leq 4m\left(4n+\frac{1}{100}\sqrt{n}E[Z]\right)^{1/2}\left(\sum_% {i\in\mathcal{A}}\frac{(p_{i}-q_{i})^{4}}{q_{i}^{2}}\right)^{1/2} \displaystyle\leq 4m\left(2\sqrt{n}+\frac{1}{10}n^{1/4}E[Z]^{1/2}\right)\left(% \sum_{i\in\mathcal{A}}\frac{(p_{i}-q_{i})^{2}}{q_{i}}\right) \displaystyle=\left(8\sqrt{n}+\frac{2}{5}n^{1/4}E[Z]^{1/2}\right)E[Z]

The first inequality is Cauchy-Schwarz, the second inequality uses (6), the third inequality uses the monotonicity of the \ell_{p} norms, and the equality uses (1).

Combining the two terms, we get

 \operatorname{Var}\left[Z\right]\leq 4n+9\sqrt{n}\mathbb{E}\left[Z\right]+% \frac{2}{5}n^{1/4}\mathbb{E}\left[Z\right]^{3/2}.

We now consider the two cases in the statement of our lemma.

• When p\in\mathcal{C}, we know from Lemma 2 that \mathbb{E}\left[Z\right]\leq\frac{1}{500}m\varepsilon^{2}. Combined with a choice of m\geq 20000\frac{\sqrt{n}}{\varepsilon^{2}} and the above expression for the variance, this gives:

 \operatorname{Var}\left[Z\right]\leq\frac{4}{20000^{2}}m^{2}\varepsilon^{4}+% \frac{9}{20000\cdot 500}m^{2}\varepsilon^{4}+\frac{\sqrt{10}}{12500000}m^{2}% \varepsilon^{4}\leq\frac{1}{500000}m^{2}\varepsilon^{4}.
• When d_{\mathrm{TV}}(p,\mathcal{C})\geq\varepsilon, Lemma 2 and m\geq 20000\frac{\sqrt{n}}{\varepsilon^{2}} give:

 \mathbb{E}\left[Z\right]\geq\frac{1}{5}m\varepsilon^{2}\geq 4000\sqrt{n}.

Combining this with our expression for variance we get:

 \operatorname{Var}\left[Z\right]\leq\frac{4}{4000^{2}}\mathbb{E}\left[Z\right]% ^{2}+\frac{9}{4000}\mathbb{E}\left[Z\right]^{2}+\frac{2}{5\sqrt{4000}}\mathbb{% E}\left[Z\right]^{2}\leq\frac{1}{100}\mathbb{E}\left[Z\right]^{2}.

\hfill\qed

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters   