Asymptotically optimal private estimation under mean square loss

Asymptotically optimal private estimation under mean square loss

Min Ye                           Alexander Barg
Abstract

We consider the minimax estimation problem of a discrete distribution with support size k under locally differential privacy constraints. A privatization scheme is applied to each raw sample independently, and we need to estimate the distribution of the raw samples from the privatized samples. A positive number \epsilon measures the privacy level of a privatization scheme.

In our previous work (arXiv:1702.00610), we proposed a family of new privatization schemes and the corresponding estimator. We also proved that our scheme and estimator are order optimal in the regime e^{\epsilon}\ll k under both \ell_{2}^{2} and \ell_{1} loss. In other words, for a large number of samples the worst-case estimation loss of our scheme was shown to differ from the optimal value by at most a constant factor. In this paper, we eliminate this gap by showing asymptotic optimality of the proposed scheme and estimator under the \ell_{2}^{2} (mean square) loss. More precisely, we show that for any k and \epsilon, the ratio between the worst-case estimation loss of our scheme and the optimal value approaches 1 as the number of samples tends to infinity.

footnotetext:   The authors are with Dept. of ECE and ISR, University of Maryland, College Park, MD 20742. Emails: yeemmi@gmail.com and abarg@umd.edu. Research supported by NSF grants CCF1422955 and CCF1618603.

I Introduction

This paper continues our work [13]. The context of the problem that we consider is related to a major challenge in the statistical analysis of user data, namely, the conflict between learning accurate statistics and protecting sensitive information about the individuals. As in [13], we rely on a particular formalization of user privacy called differential privacy, introduced in [1, 2]. Generally speaking, differential privacy requires that the adversary not be able to reliably infer an individual’s data from public statistics even with access to all the other users’ data. The concept of differential privacy has been developed in two different contexts: the global privacy context (for instance, when institutions release statistics related to groups of people) [3], and the local privacy context when individuals disclose their personal data [4].

In this paper, we consider the minimax estimation problem of a discrete distribution with support size k under locally differential privacy. This problem has been studied in the non-private setting [5, 6], where we can learn the distribution from the raw samples. In the private setting, we need to estimate the distribution of raw samples from the privatized samples which are generated independently from the raw samples according to a conditional distribution Q (also called a privatization scheme). Given a privacy parameter \epsilon>0, we say that Q is \epsilon-locally differentially private if the probabilities of the same output conditional on different inputs differ by a factor of at most e^{\epsilon}. Clearly, smaller \epsilon means that it is more difficult to infer the original data from the privatized samples, and thus leads to higher privacy. For a given \epsilon, our objective is to find the optimal privatization scheme with \epsilon-privacy level to minimize the expected estimation loss for the worst-case distribution. In this paper, we are mainly concerned with the scenario where we have a large number of samples, which captures the modern trend toward “big data” analytics.

I-A Existing results

The following two privatization schemes are the most well-known in the literature: the k-ary Randomized Aggregatable Privacy-Preserving Ordinal Response (k-RAPPOR) scheme [7, 8], and the k-ary Randomized Response (k-RR) scheme [9, 10]. The k-RAPPOR scheme is order optimal in the high privacy regime where \epsilon is very close to 0, and the k-RR scheme is order optimal in the low privacy regime where e^{\epsilon}\approx k [11]. Very recently, a family of privatization schemes and the corresponding estimators were proposed independently by Wang et al. [12] and the present authors [13]. In [13], we further showed that under both \ell_{2}^{2} and \ell_{1} loss, these privatization schemes and the corresponding estimators are order-optimal in the medium to high privacy regimes when e^{\epsilon}\ll k.

Duchi et al. [14] gave an order-optimal lower bound on the minimax private estimation loss for the high privacy regime where \epsilon is very close to 0. In [13], we proved a stronger lower bound which is order-optimal in the whole region e^{\epsilon}\ll k. This lower bound implies that the schemes and the estimators proposed in [12, 13] are order optimal in this regime. Here order-optimal means that the ratio between the true value and the lower bound is upper bounded by a constant (larger than 1) when n and k/e^{\epsilon} both become large enough.

I-B Our contributions

In this paper, we focus on the \ell_{2}^{2} (mean square) loss. We prove an asymptotically tight lower bound on the minimax private estimation loss for all values of k and \epsilon. In other words, for every k and every \epsilon, the ratio between the true value and our lower bound goes to 1 when n goes to infinity. This is a huge improvement over the lower bounds in [13] and [14] for the following two reasons: First, although the lower bounds in [13] and [14] are order-optimal, they differ from the true value by a factor of several hundred. In practice, an improvement of several percentage points is already considered as a substantial advance (see for instance, [11]). So these order-optimal bounds are far from satisfactory. Second, the bounds in [13] and [14] only hold for certain regions of k and \epsilon while the lower bound in this paper holds for all values of k and \epsilon.

Furthermore, as an immediate consequence of our lower bound, we show that the schemes and the estimators proposed in [12, 13] are asymptotically optimal! In other words, the ratio between the lower bound and the worst-case estimation loss of these schemes and estimators goes to 1 when n goes to infinity.

I-C Organization of the paper

In Section II, we formulate the problem and give a more detailed review of the existing results. Section III is devoted to an overview of the main results of this paper and to illustrating the main ideas behind the proof. Since the proof is very long and technical, we include a short Section IV, where we explain the argument in formal terms, while skipping many details. The complete proof is given in Section V which (with Appendices) takes the most of the length of the paper. In Section VI, we point out two possible directions for future research.

II Problem formulation and existing results

Notation: Let {\mathscr{X}}=\{1,2,\dots,k\} be the source alphabet and let {\textbf{{p}}}=(p_{1},p_{2},\dots,p_{k}) be a probability distribution on {\mathscr{X}}. Denote by \Delta_{k}=\{{\textbf{{p}}}\in\mathbb{R}^{k}:p_{i}\geq 0\text{~{}for~{}}i=1,2,% \dots,k,\sum_{i=1}^{k}p_{i}=1\} the k-dimensional probability simplex. Let X be a random variable (RV) that takes values on {\mathscr{X}} according to p, so that p_{i}=P(X=i). Denote by X^{n}=(X^{(1)},X^{(2)},\dots,X^{(n)}) the vector formed of n independent copies of the RV X.

II-A Problem formulation

In the classical (non-private) distribution estimation problem, we are given direct access to i.i.d. samples \{X^{(i)}\}_{i=1}^{n} drawn according to some unknown distribution {\textbf{{p}}}\in\Delta_{k}. Our goal is to estimate p based on the samples [6]. We define an estimator \hat{{\textbf{{p}}}} as a function \hat{{\textbf{{p}}}}:{\mathscr{X}}^{n}\to\mathbb{R}^{k}, and assess its quality in terms of the worst-case risk (expected loss)

\sup_{{\textbf{{p}}}\in\Delta_{k}}\underset{X^{n}\sim{\textbf{{p}}}^{n}}{% \mathbb{E}}\ell(\hat{{\textbf{{p}}}}(X^{n}),{\textbf{{p}}}),

where \ell is some loss function. The minimax risk is defined as the solution of the following saddlepoint problem:

r_{k,n}^{\ell}:=\inf_{\hat{{\textbf{{p}}}}}\sup_{{\textbf{{p}}}\in\Delta_{k}}% \underset{X^{n}\sim{\textbf{{p}}}^{n}}{\mathbb{E}}\ell(\hat{{\textbf{{p}}}}(X^% {n}),{\textbf{{p}}}).

In the private distribution estimation problem, we can no longer access the raw samples \{X^{(i)}\}_{i=1}^{n}. Instead, we estimate the distribution p from the privatized samples \{Y^{(i)}\}_{i=1}^{n}, obtained by applying a privatization mechanism Q independently to each raw sample X^{(i)}. A privatization mechanism (also called privatization scheme) {\textbf{{Q}}}:{\mathscr{X}}\to{\mathscr{Y}} is simply a conditional distribution {\textbf{{Q}}}_{Y|X}. The privatized samples Y^{(i)} take values in a set {\mathscr{Y}} (the “output alphabet”) that does not have to be the same as {\mathscr{X}}.

The quantities \{Y^{(i)}\}_{i=1}^{n} are i.i.d. samples drawn according to the marginal distribution m given by

{\textbf{{m}}}(S)=\sum_{i=1}^{k}{\textbf{{Q}}}(S|i)p_{i} (1)

for any S\in\sigma({\mathscr{Y}}), where \sigma({\mathscr{Y}}) denotes an appropriate \sigma-algebra on {\mathscr{Y}}. In accordance with this setting, the estimator \hat{{\textbf{{p}}}} is a measurable function \hat{{\textbf{{p}}}}:{\mathscr{Y}}^{n}\to\mathbb{R}^{k}. We assess the quality of the privatization scheme Q and the corresponding estimator \hat{{\textbf{{p}}}} by the worst-case risk

r_{k,n}^{\ell}({\textbf{{Q}}},\hat{{\textbf{{p}}}}):=\sup_{{\textbf{{p}}}\in% \Delta_{k}}\underset{Y^{n}\sim{\textbf{{m}}}^{n}}{\mathbb{E}}\ell(\hat{{% \textbf{{p}}}}(Y^{n}),{\textbf{{p}}}),

where {\textbf{{m}}}^{n} is the n-fold product distribution and m is given by (1). Define the minimax risk of the privatization scheme Q as

r_{k,n}^{\ell}({\textbf{{Q}}}):=\inf_{\hat{{\textbf{{p}}}}}r_{k,n}^{\ell}({% \textbf{{Q}}},\hat{{\textbf{{p}}}}). (2)
Definition II.1.

For a given \epsilon>0, a privatization mechanism {\textbf{{Q}}}:{\mathscr{X}}\to{\mathscr{Y}} is said to be \epsilon-locally differentially private if

\sup_{S\in\sigma({\mathscr{Y}})}\frac{{\textbf{{Q}}}(Y\in S|X=x)}{{\textbf{{Q}% }}(Y\in S|X=x^{\prime})}\leq e^{\epsilon}\text{~{}for all~{}}x,x^{\prime}\in{% \mathscr{X}}. (3)

Denote by {\mathscr{D}}_{\epsilon} the set of all \epsilon-locally differentially private mechanisms. Given a privacy level \epsilon, we seek to find the optimal {\textbf{{Q}}}\in{\mathscr{D}}_{\epsilon} with the smallest possible minimax risk among all the \epsilon-locally differentially private mechanisms. Accordingly, define the \epsilon-private minimax risk as

r_{\epsilon,k,n}^{\ell}:=\inf_{{\textbf{{Q}}}\in{\mathscr{D}}_{\epsilon}}r_{k,% n}^{\ell}({\textbf{{Q}}}). (4)

As already mentioned, we will limit ourselves to \ell=\ell_{2}^{2}.

Main Problem: Suppose that the cardinality k of the source alphabet is known to the estimator. We would like to find the asymptotic growth rate of r_{\epsilon,k,n}^{\ell_{2}^{2}} as n\to\infty and to construct an asymptotically optimal privatization mechanism and a corresponding estimator of p from the privatized samples.

It is this problem that we address—and resolve—in this paper. Specifically, we prove a lower bound on r_{\epsilon,k,n}^{\ell_{2}^{2}}, which implies that the mechanism and the corresponding estimator proposed in [13] are asymptotically optimal for the private estimation problem.

II-B Previous results

In this section we briefly review known results that are relevant to our problem. In Sect. I-A we mentioned several papers that have considered it, viz., [9, 7, 8, 10, 11, 12, 14]. In this section we discuss only the results of [13] since they subsume the (earlier) results of the mentioned references, and since they are formulated in the form convenient for our presentation.

Let {\mathscr{D}}_{\epsilon,F} be the set of \epsilon-locally differentially private schemes with finite output alphabet. Let

{\mathscr{D}}_{\epsilon,E}=\biggl{\{}{\textbf{{Q}}}\in{\mathscr{D}}_{\epsilon,% F}:\frac{{\textbf{{Q}}}(y|x)}{\min_{x^{\prime}\in{\mathscr{X}}}{\textbf{{Q}}}(% y|x^{\prime})}\in\{1,e^{\epsilon}\}\text{~{}for all~{}}x\in{\mathscr{X}}\text{% ~{}and all~{}}y\in{\mathscr{Y}}\biggr{\}}. (5)

In [13, Theorem IV.5], we have shown that

r_{\epsilon,k,n}^{\ell_{2}^{2}}=\inf_{{\textbf{{Q}}}\in{\mathscr{D}}_{\epsilon% ,E}}r_{k,n}^{\ell_{2}^{2}}({\textbf{{Q}}}). (6)

As a result, below we limit ourselves to schemes {\textbf{{Q}}}\in{\mathscr{D}}_{\epsilon,E} in this paper. For such schemes, since the output alphabet is finite, we can write the marginal distribution m in (1) as a vector {\textbf{{m}}}=(\sum_{j=1}^{k}p_{j}{\textbf{{Q}}}(y|j),y\in{\mathscr{Y}}). We will also use the shorthand notation {\textbf{{m}}}={\textbf{{p}}}{\textbf{{Q}}} to denote this vector.

In [13], we introduced a family of privatization schemes which are parameterized by the integer d\in\{1,2,\dots,k-1\}. Given k and d, let the output alphabet be {\mathscr{Y}}_{k,d}=\{y\in\{0,1\}^{k}:\sum_{i=1}^{k}y_{i}=d\}, so |{\mathscr{Y}}_{k,d}|=\binom{k}{d}.

Definition II.2 ([13]).

Consider the following privatization scheme:

{\textbf{{Q}}}_{k,\epsilon,d}(y|i)=\frac{e^{\epsilon}y_{i}+(1-y_{i})}{\binom{k% -1}{d-1}e^{\epsilon}+\binom{k-1}{d}} (7)

for all y\in{\mathscr{Y}}_{k,d} and all i\in{\mathscr{X}}. The corresponding empirical estimator of p under {\textbf{{Q}}}_{k,\epsilon,d} is defined as follows:

\hat{p_{i}}=\Big{(}\frac{(k-1)e^{\epsilon}+\frac{(k-1)(k-d)}{d}}{(k-d)(e^{% \epsilon}-1)}\Big{)}\frac{T_{i}}{n}-\frac{(d-1)e^{\epsilon}+k-d}{(k-d)(e^{% \epsilon}-1)}, (8)

where T_{i}=\sum_{j=1}^{n}Y_{i}^{(j)} is the number of privatized samples whose i-th coordinate is 1.

It is easy to verify that {\textbf{{Q}}}_{k,\epsilon,d} is \epsilon-locally differentially private. The estimation loss under {\textbf{{Q}}}_{k,\epsilon,d} and the empirical estimator is calculated in the following proposition.

Proposition II.3.

[13, Prop. III.1] Suppose that the privatization scheme is {\textbf{{Q}}}_{k,\epsilon,d} and the empirical estimator is given by (8). Let {\textbf{{m}}}={\textbf{{p}}}{\textbf{{Q}}}_{k,\epsilon,d}. For all \epsilon,n and k, we have that

\underset{Y^{n}\sim{\textbf{{m}}}^{n}}{\mathbb{E}}\ell_{2}^{2}(\hat{{\textbf{{% p}}}}(Y^{n}),{\textbf{{p}}})=\frac{1}{n}\Big{(}\frac{(d(k-2)+1)e^{2\epsilon}}{% (k-d)(e^{\epsilon}-1)^{2}}+\frac{2(k-2)e^{\epsilon}}{(e^{\epsilon}-1)^{2}}+% \frac{(k-2)(k-d)+1}{d(e^{\epsilon}-1)^{2}}-\sum_{i=1}^{k}p_{i}^{2}\Big{)}. (9)

The sum \sum_{i}p_{i}^{2} is maximized for {\textbf{{p}}}_{U}=(1/k,1/k,\dots,1/k), so the worst-case distribution is the uniform one. Substituting {\textbf{{p}}}_{U} in (9), we obtain

r_{k,n}^{\ell_{2}^{2}}({\textbf{{Q}}}_{k,\epsilon,d},\hat{{\textbf{{p}}}})=% \underset{Y^{n}\sim{\textbf{{m}}}_{U}^{n}}{\mathbb{E}}\ell_{2}^{2}(\hat{{% \textbf{{p}}}}(Y^{n}),{\textbf{{p}}}_{U})=\frac{(k-1)^{2}}{nk(e^{\epsilon}-1)^% {2}}\frac{(de^{\epsilon}+k-d)^{2}}{d(k-d)}. (10)

Given k and \epsilon, define

d^{\ast}=d^{\ast}(k,\epsilon):=\operatorname*{arg\!\min}_{1\leq d\leq k-1}% \frac{(de^{\epsilon}+k-d)^{2}}{d(k-d)}, (11)

where the ties are resolved arbitrarily. We obtain

r_{k,n}^{\ell_{2}^{2}}({\textbf{{Q}}}_{k,\epsilon,d^{\ast}},\hat{{\textbf{{p}}% }})=\min_{1\leq d\leq k-1}r_{k,n}^{\ell_{2}^{2}}({\textbf{{Q}}}_{k,\epsilon,d}% ,\hat{{\textbf{{p}}}}). (12)

By differentiation in (11) we find that d^{\ast} can only take two possible values given in the next proposition.

Proposition II.4.
d^{\ast}=\lceil k/(e^{\epsilon}+1)\rceil\text{ or }\lfloor k/(e^{\epsilon}+1)\rfloor.

Therefore, when k/(e^{\epsilon}+1)\leq 1, d^{\ast}=1; when k/(e^{\epsilon}+1)>1, a simple comparison can determine the value of d^{\ast}.

Clearly, r_{k,n}^{\ell_{2}^{2}}({\textbf{{Q}}}_{k,\epsilon,d^{\ast}},\hat{{\textbf{{p}}% }}) serves as an upper bound on r_{\epsilon,k,n}^{\ell_{2}^{2}}. In [13], we also proved an order optimal lower bound on r_{\epsilon,k,n}^{\ell_{2}^{2}} in the regime e^{\epsilon}\ll k and n large enough. Combining the upper and lower bounds, we obtain the following theorem.

Theorem II.5.

[13] Let e^{\epsilon}\ll k. For n large enough,

r_{\epsilon,k,n}^{\ell_{2}^{2}}=\Theta\Big{(}\frac{ke^{\epsilon}}{n(e^{% \epsilon}-1)^{2}}\Big{)}.

III Overview of the results and the main ideas of the proof

III-A Main result of the paper

Let

M(k,\epsilon):=\frac{(k-1)^{2}}{k(e^{\epsilon}-1)^{2}}\frac{(d^{\ast}e^{% \epsilon}+k-d^{\ast})^{2}}{d^{\ast}(k-d^{\ast})} (13)

where d^{\ast} is defined in (11). Note that r_{k,n}^{\ell_{2}^{2}}({\textbf{{Q}}}_{k,\epsilon,d^{\ast}},\hat{{\textbf{{p}}% }})=\frac{1}{n}M(k,\epsilon).

Theorem III.1.

For every k and \epsilon, there are a positive constant C(k,\epsilon)>0 and an integer N(k,\epsilon) such that when n\geq N(k,\epsilon),

r_{\epsilon,k,n}^{\ell_{2}^{2}}\geq\frac{1}{n}M(k,\epsilon)-\frac{C(k,\epsilon% )}{n^{14/13}}. (14)

This result together with (12) and (10) implies that

r_{\epsilon,k,n}^{\ell_{2}^{2}}=r_{k,n}^{\ell_{2}^{2}}({\textbf{{Q}}}_{k,% \epsilon,d^{\ast}},\hat{{\textbf{{p}}}})-O(n^{-14/13}). (15)

This theorem completely determines the dominant term of r_{\epsilon,k,n}^{\ell_{2}^{2}}. It also gives an upper bound on the order of the remainder term. Moreover, this theorem also implies that our scheme {\textbf{{Q}}}_{k,\epsilon,d^{\ast}} and the empirical estimator \hat{{\textbf{{p}}}} defined in (8), both proposed in [13], are asymptotically optimal for the Main Problem.

III-B Main ideas of the proof

In this subsection we illustrate the main ideas that lead to determining the dominant term of r_{\epsilon,k,n}^{\ell_{2}^{2}}. In view of (10)–(12) we need to show that \lim_{n\to\infty}nr_{\epsilon,k,n}^{\ell_{2}^{2}}=M(k,\epsilon). Clearly, r_{\epsilon,k,n}^{\ell_{2}^{2}}\leq r_{k,n}^{\ell_{2}^{2}}({\textbf{{Q}}}_{k,% \epsilon,d^{\ast}},\hat{{\textbf{{p}}}}) for all n\geq 1, implying that \limsup_{n\to\infty}nr_{\epsilon,k,n}^{\ell_{2}^{2}}\leq M(k,\epsilon). Therefore, we only need to show the lower bound

\liminf_{n\to\infty}nr_{\epsilon,k,n}^{\ell_{2}^{2}}=\liminf_{n\to\infty}n\inf% _{{\textbf{{Q}}}\in{\mathscr{D}}_{\epsilon,E}}r_{k,n}^{\ell_{2}^{2}}({\textbf{% {Q}}})\geq M(k,\epsilon). (16)

Lower bounds on r_{\epsilon,k,n}^{\ell_{2}^{2}} can be derived using Assouad’s method; see [14, 13]. More specifically, we can choose a finite subset of the probability simplex and assume that the probability distributions only come from this finite set, thereby reducing the original estimation problem to a hypothesis testing problem. This approach enables us to derive the correct scaling order of r_{\epsilon,k,n}^{\ell_{2}^{2}}; see Theorem II.5. Deriving the correct constant in front of the main term is more problematic because Assouad’s method relies on reducing a continuous domain (the probability simplex) to a finite set.

In this paper, we use a different approach to obtain an asymptotically tight lower bound in (16). Since the worst-case estimation loss is always lower bounded by the average estimation loss, the minimax risk r_{\epsilon,k,n}^{\ell_{2}^{2}} can be bounded below by the Bayes estimation loss. More specifically, we assume that the probability distributions are chosen uniformly from a small neighborhood of the uniform distribution {\textbf{{p}}}_{U}. Surprisingly, the lower bound on the Bayes estimation loss turns out to be asymptotically the same as the worst-case estimation loss of our scheme and estimator. In other words, the ratio between these two quantities goes to 1 when n goes to infinity.

In order to obtain the lower bound on the Bayes estimation loss, we refine a classical method in asymptotic statistics, namely, local asymptotic normality (LAN) of the posterior distribution [15, 16, 17]. We briefly describe the implications of LAN for our problem and explain why the classical approach does not directly apply to our problem. Our objective is to estimate p from the privatized samples Y^{n}. In the Bayesian setup, we assume that p is drawn uniformly from {\mathscr{P}}, a very small neighborhood of {\textbf{{p}}}_{U}. Let {\textbf{{P}}}=[P_{1},P_{2},\dots,P_{k}] denote the random vector that corresponds to p. Applying the LAN method of [15, 16, 17], one can show that when the radius of {\mathscr{P}} is of order n^{-1/2}, with large probability the conditional distribution of P given Y^{n} approaches a jointly Gaussian distribution as n goes to infinity, and the covariance matrix \Sigma=\Sigma(n,{\textbf{{Q}}}) of this Gaussian distribution is completely determined by n and the privatization scheme Q. Note that \Sigma is independent of the value of Y^{n}. We further note that the top-left (k-1)\times(k-1) submatrix of \Sigma is the inverse of the Fisher information matrix computed with respect to the parameters p_{1},p_{2},\dots,p_{k-1} (a detailed discussion of this issue appears in Appendix J). It is clear that the trace of \Sigma serves an asymptotic lower bound on r_{k,n}^{\ell_{2}^{2}}({\textbf{{Q}}}). At the same time, for all {\textbf{{Q}}}\in{\mathscr{D}}_{\epsilon,E}, we can show that

\operatorname{tr}(\Sigma)\geq r_{k,n}^{\ell_{2}^{2}}({\textbf{{Q}}}_{k,% \epsilon,d^{\ast}},\hat{{\textbf{{p}}}}),

where {\textbf{{Q}}}_{k,\epsilon,d^{\ast}} and \hat{{\textbf{{p}}}} are given respectively in (7) and (8). Therefore,

\liminf_{n\to\infty}nr_{k,n}^{\ell_{2}^{2}}({\textbf{{Q}}})\geq M(k,\epsilon)% \text{~{}for all~{}}{\textbf{{Q}}}\in{\mathscr{D}}_{\epsilon,E}.

This inequality gives a strong intuition of why (16) should hold. However, it does not imply (16) because a pointwise asymptotic lower bound does not imply a uniform asymptotic lower bound. For this reason, we cannot directly use the classical approach and must develop more delicate arguments that prove (16) and (15). Another feature of our approach worth mentioning is that, unlike the methods in [15, 16, 17], our proof is completely elementary.

IV Sketch of the proof of Theorem III.1

In the previous section we explained a general approach to the proof of (16). The formal proof that we will present is rather long and technical. For this reason, in this section we outline a “road map” to help the readers to follow our arguments.

For reader’s convenience we made a short table of our main notation; see Table I. Most, although not all, definitions from this table are also given in the main text.

The following conventions about our notation are made with reference to Table I. The vector u is chosen to be (k-1)-dimensional (u is a function of {\textbf{{p}}}, but we omit the dependence for notational simplicity). At the same time, using the normalization condition, we define u_{k}=u_{k}({\textbf{{u}}}):=-\sum_{j=1}^{k-1}u_{j}. Clearly, u_{k}=p_{k}-1/k. The same convention is used for the random vector U and for the vector \tilde{{\textbf{{u}}}} both of which appear later in the paper.

The following obvious relations will be repeatedly used in the proof:

\sum_{i=1}^{L}q_{i}=\sum_{i=1}^{L}q_{ij}=1\text{ for all }j\in[k],\quad\sum_{j% =1}^{k}u_{j}=0,\quad\sum_{i=1}^{L}t_{i}(y^{n})=n. (17)
TABLE I: Main notation
{\mathscr{X}}=\{1,2,\dots,k\} source alphabet
{\textbf{{p}}}=(p_{1},\dots,p_{k}) distribution on {\mathscr{X}}, where p_{j}=P(X=j)
{\mathscr{D}}_{\epsilon,E} set of privatization mechanisms of the form (5)
{\textbf{{u}}}=(u_{1},\dots,u_{k-1}) difference between the first k-1 coordinates of p
u_{j}:=p_{j}-\frac{1}{k},j=1,\dots,k-1 and the uniform distribution
u_{k}=u_{k}({\textbf{{u}}}):=-\sum_{j=1}^{k-1}u_{j} by definition u_{k}=p_{k}-1/k
L^{\prime} cardinality of the original output alphabet
{\mathscr{Y}} original output alphabet. WLOG we assume {\mathscr{Y}}=\{1,2,\dots,L^{\prime}\}
L the number of equivalence classes \{A_{1},A_{2},\dots,A_{L}\} in {\mathscr{Y}}
\{A_{1},A_{2},\dots,A_{L}\} equivalent output alphabet after symbol merging
{\textbf{{Q}}}=({\textbf{{Q}}}(i|j))_{i\in[L^{\prime}],j\in[k]} privatization mechanism (conditional distribution)
q_{ij},i\in[L],j\in[k] conditional probability of observing output symbols in A_{i} if the raw sample is j
q_{i}:=\frac{1}{k}\sum_{j}q_{ij},i\in[L] by definition q_{i}=P(Y\in A_{i}) when p is the uniform distribution over {\mathscr{X}}
(t_{i}(y^{n}),i=1,\dots,L) composition of the observed vector, t_{i}(y^{n}) is the number of occurrences of symbol A_{i} in y^{n}
{\textbf{{v}}}=(v_{1},\dots,v_{L})
v_{i}:=t_{i}(y^{n})-nq_{i},i\in[L]
{\textbf{{w}}}={\textbf{{w}}}({\textbf{{v}}},{\textbf{{Q}}})\in\mathbb{R}^{k-1} vector with coordinates \sum_{i=1}^{L}\frac{(q_{im}-q_{ik})v_{i}}{q_{i}},m=1,\dots,k-1
{\textbf{{U}}}=(U_{1},U_{2},\dots,U_{k-1}) random vector corresponding to {\textbf{{u}}}=(u_{1},\dots,u_{k-1})
U_{k}=U_{k}({\textbf{{U}}}):=-\sum_{j=1}^{k-1}U_{j} random variable corresponding to u_{k}
{\textbf{{V}}}=(V_{1},\dots,V_{L}) random vector corresponding to {\textbf{{v}}}=(v_{1},\dots,v_{L})
B(\alpha) ellipsoid (18) of “radius” \alpha;
B_{1}:=B(n^{-5/13}), B_{2}:=B(n^{-5/13}-3n^{-6/13}/\delta_{0})

Below we study distributions supported on ellipsoids, and we use the following generic notation: Given \alpha>0, let us define an ellipsoid

B(\alpha)=\Big{\{}{\textbf{{u}}}\in\mathbb{R}^{k-1}:\sum_{i=1}^{k-1}u_{i}^{2}+% \Big{(}\sum_{i=1}^{k-1}u_{i}\Big{)}^{2}<\alpha^{2}\Big{\}}. (18)

For future use we note that {\textbf{{u}}}\in B(\alpha) if and only {\textbf{{u}}}^{T}(I+J){\textbf{{u}}}<\alpha^{2}, where I is the identity matrix and J is the all-ones matrix.

On account of (6), to prove (14) it suffices to show that for every k and \epsilon, there are a positive constant C(k,\epsilon)>0 and an integer N(k,\epsilon) such that when n\geq N(k,\epsilon),

\inf_{{\textbf{{Q}}}\in{\mathscr{D}}_{\epsilon,E}}r_{k,n}^{\ell_{2}^{2}}({% \textbf{{Q}}})\geq\frac{1}{n}M(k,\epsilon)-\frac{C(k,\epsilon)}{n^{14/13}} (19)

In the rest of this paper we will prove (19). It is important to note that, because of the infimum on the distribution Q, the constants C(k,\epsilon) and N(k,\epsilon) should not depend on it. It is for this reason that the classical LAN approach does not directly apply. Below we provide more details about this.

IV-A Output alphabet reduction

As mentioned above, we use Bayes estimation loss to bound r_{\epsilon,k,n}^{\ell_{2}^{2}} below, and we assume that the probability distributions are chosen uniformly from a small neighborhood of the uniform distribution {\textbf{{p}}}_{U}. Recalling the definition of the vector u and u_{k} in Table I, we note that estimating p is equivalent to estimating (u_{1},\dots,u_{k}). Let U be the random vector that corresponds to u. We assume that U is uniformly distributed over the ellipsoid

B_{1}=B\Big{(}\frac{1}{n^{5/13}}\Big{)}. (20)

Suppose that the size of the original output alphabet {\mathscr{Y}} is some integer L^{\prime}, which is independent of k and \epsilon, and can take an arbitrarily large value. Recall that our objective is (19). If L^{\prime} enters the estimates of the estimation loss, then we can not bound it by a function of k and \epsilon, which is something we would like to avoid. This can be done using the following simple observation. Without loss of generality, we assume that {\mathscr{Y}}=\{1,2,\dots,L^{\prime}\}. Since {\textbf{{Q}}}\in{\mathscr{D}}_{\epsilon,E} (see (5)), for every i\in\{1,\dots,L^{\prime}\}, the vector ({\textbf{{Q}}}(i|j),j=1,\dots,k) is proportional to one of the vectors in the set \{1,e^{\epsilon}\}^{k}. It is easy to see that we can merge into one symbol all the output symbols that correspond to proportional vectors, thereby reducing the size of the output alphabet without affecting the Bayes estimation loss. Suppose that, upon merging all such symbols, the size of the output alphabet becomes L. Clearly, L\leq 2^{k}, and we will henceforth assume that the output alphabet is \{A_{1},A_{2},\dots,A_{L}\} and let q_{ij} denote the conditional probability of observing A_{i} if the raw sample is j\in[k].

IV-B Gaussian approximation of the posterior pdf f_{U|Y^{n}}

IV-B1 Approximation

For i=1,2,\dots,L, let q_{i}:=\frac{1}{k}\sum_{j=1}^{k}q_{ij}, and let t_{i}=t_{i}(y^{n}) be the number of times that symbol A_{i} appears in y^{n}. Define v_{i}:=v_{i}(y^{n})=t_{i}(y^{n})-nq_{i} for i=1,\dots,L. Then for {\textbf{{u}}}\in B_{1}, we have

\displaystyle f_{{\textbf{{U}}}|Y^{n}}({\textbf{{u}}}|y^{n}) \displaystyle\;\propto\;\prod_{i=1}^{L}(q_{i}+\sum_{j=1}^{k}u_{j}q_{ij})^{t_{i}}
\displaystyle\;\propto\;\exp\Big{(}\sum_{i=1}^{L}\Big{(}(nq_{i}+v_{i})\log\Big% {(}1+\sum_{j=1}^{k}\frac{u_{j}q_{ij}}{q_{i}}\Big{)}\Big{)}\Big{)}. (21)

Since our objective is to estimate u_{1},u_{2},\dots,u_{k}, we view all the factors that do not contain u as constants in the formula above. Let us introduce a notation for the exponent of the right-hand side:

g({\textbf{{u}}},y^{n})=\sum_{i=1}^{L}\Big{(}(nq_{i}+v_{i})\log\Big{(}1+\sum_{% j=1}^{k}\frac{u_{j}q_{ij}}{q_{i}}\Big{)}\Big{)}. (22)

Let V be a random vector corresponding to the vector {\textbf{{v}}}=(v_{1},\dots,v_{L}). Since P(Y=A_{i})=\sum_{j=1}^{k}p_{j}q_{ij}, the expectation of V_{i} equals

\mathbb{E}V_{i}=n\sum_{j=1}^{k}p_{j}q_{ij}-\frac{n}{k}\sum_{j=1}^{k}q_{ij}=% \sum_{j=1}^{k}nu_{j}q_{ij}.

Assuming that {\textbf{{u}}}\in B_{1}, we therefore conclude that \mathbb{E}V_{i}=O(n^{8/13}). By definition, \operatorname{Var}(V_{i})=\operatorname{Var}(t_{i}(Y^{n})) for all i. According to the Central Limit Theorem, when n is large, \operatorname{Var}(V_{i})=O(n). As a consequence, when n is large, with large probability, V_{i}=O(n^{8/13}). This fact together with the relation

\log\Big{(}1+\sum_{j=1}^{k}\frac{u_{j}q_{ij}}{q_{i}}\Big{)}\approx\sum_{j=1}^{% k}\frac{u_{j}q_{ij}}{q_{i}}-\frac{1}{2}\Big{(}\sum_{j=1}^{k}\frac{u_{j}q_{ij}}% {q_{i}}\Big{)}^{2}

gives us the following approximation of g({\textbf{{u}}},y^{n}):

g({\textbf{{u}}},y^{n})\approx\sum_{i=1}^{L}v_{i}\sum_{j=1}^{k}\frac{u_{j}q_{% ij}}{q_{i}}-\frac{1}{2}\sum_{i=1}^{L}nq_{i}\Big{(}\sum_{j=1}^{k}\frac{u_{j}q_{% ij}}{q_{i}}\Big{)}^{2}=-\frac{1}{2}h_{{\textbf{{v}}}}({\textbf{{u}}})+\sum_{i=% 1}^{L}\frac{v_{i}^{2}}{2nq_{i}},

where for {\textbf{{v}}}=(v_{1},v_{2},\dots,v_{L}), the function h_{{\textbf{{v}}}}:\mathbb{R}^{k-1}\to\mathbb{R} is defined as

h_{{\textbf{{v}}}}({\textbf{{u}}})=\sum_{i=1}^{L}\frac{n}{q_{i}}\Big{(}\sum_{j% =1}^{k}u_{j}q_{ij}-\frac{v_{i}}{n}\Big{)}^{2}\text{~{}for all~{}}{\textbf{{u}}% }\in\mathbb{R}^{k-1}. (23)

Thus, for large n the density function of the posterior distribution is approximately given by

f_{{\textbf{{U}}}|Y^{n}}({\textbf{{u}}}|y^{n})\;\propto\;\exp(-\frac{1}{2}h_{{% \textbf{{v}}}}({\textbf{{u}}})),

and h_{{\textbf{{v}}}}({\textbf{{u}}}) is a quadratic function of u. Had u been taking values in the entire space \mathbb{R}^{k-1}, we would be able to conclude that the posterior distribution of U given Y^{n} is approximately Gaussian. Supposing that this is the case, define the matrix

\Phi=\Phi(n,{\textbf{{Q}}}):=\sum_{i=1}^{L}{\frac{n}{q_{i}}}\Big{(}(q_{i,1}-q_% {ik}),\dots,(q_{ik-1}-q_{ik})\Big{)}^{T}\Big{(}(q_{i,1}-q_{ik}),\dots,(q_{i,k-% 1}-q_{ik})\Big{)},

and the vector {\textbf{{w}}}\in\mathbb{R}^{k-1}

{\textbf{{w}}}={\textbf{{w}}}({\textbf{{v}}},{\textbf{{Q}}}):=\Big{(}\sum_{i=1% }^{L}\frac{(q_{i,1}-q_{ik})v_{i}}{q_{i}},\sum_{i=1}^{L}\frac{(q_{i,2}-q_{ik})v% _{i}}{q_{i}},\dots,\sum_{i=1}^{L}\frac{(q_{ik-1}-q_{ik})v_{i}}{q_{i}}\Big{)}^{% T}.

Then the covariance and the vector of means of this Gaussian distribution are given by \Phi^{-1} and \Phi^{-1}{\textbf{{w}}}. Note that \Phi is independent of the value of Y^{n}. At the same time, since w depends on the value of v, and v is a function of y^{n}, the mean vector \Phi^{-1}{\textbf{{w}}} does vary with the realization of Y^{n}.

IV-B2 Large deviations

The above argument does not directly apply because u is limited to the ellipsoid B_{1}. In order to claim that the posterior distribution can be indeed approximated as Gaussian, we need to show that the entire mass is concentrated in B_{1}, i.e., that under the Gaussian distribution {\mathcal{N}}(\Phi^{-1}{\textbf{{w}}},\Phi^{-1}) the probability P({\textbf{{U}}}\not\in B_{1})\approx 0.

To build intuition, let us consider the one-dimensional case. Let h(x)=\frac{(x-\mu)^{2}}{2\sigma^{2}} be the (absolute value of the) exponent of the Gaussian pdf. By the Chernoff bound, for X\sim{\mathcal{N}}(\mu,\sigma^{2}) we have

P(|X-\mu|\geq t)\leq 2\exp(-\frac{t^{2}}{2\sigma^{2}}). (24)

This bound immediately implies that for any \tilde{x}\in\mathbb{R}

P(\sqrt{h(X)}-\sqrt{h(\tilde{x})}\geq t)\leq 2\exp(-t^{2}).

The situation is similar in the multi-dimensional case. Indeed, one can show that for {\textbf{{U}}}\sim{\mathcal{N}}(\Phi^{-1}{\textbf{{w}}},\Phi^{-1}) we have the following inequality: for any \tilde{{\textbf{{u}}}}\in\mathbb{R}^{k-1} and any \alpha>0

P(\sqrt{h_{{\textbf{{v}}}}({\textbf{{U}}})}-\sqrt{h_{{\textbf{{v}}}}(\tilde{{% \textbf{{u}}}})}>n^{\alpha})\leq\exp(-n^{\alpha})

for a large enough n. Now our task is to show that for almost all y^{n}\in{\mathscr{Y}}^{n}, we can find a \tilde{{\textbf{{u}}}}=\tilde{{\textbf{{u}}}}(y^{n}) such that

B_{1}^{c}\subseteq\{{\textbf{{u}}}\in\mathbb{R}^{k-1}:\sqrt{h_{{\textbf{{v}}}}% ({\textbf{{u}}})}-\sqrt{h_{{\textbf{{v}}}}(\tilde{{\textbf{{u}}}})}>n^{\alpha}\} (25)

for some \alpha>0, or more precisely, that (25) holds for all y^{n} in a subset E_{2}\subseteq{\mathscr{Y}}^{n} such that P(Y^{n}\in E_{2})\approx 1.

Toward that end, we define another ellipsoid

B_{2}=B\Big{(}\frac{1}{n^{5/13}}-\frac{3/\delta_{0}}{n^{6/13}}\Big{)} (26)

(see (18)), where \delta_{0} is a constant which will be specified later. Observe that the ratio between the radii of B_{2} and B_{1} approaches 1 when n is large. Thus, P({\textbf{{u}}}\in B_{2})\approx P({\textbf{{u}}}\in B_{1}), and since P({\textbf{{u}}}\in B_{1})=1, we have P({\textbf{{u}}}\in B_{2})\approx 1. Conditional on the event {\textbf{{U}}}=\tilde{{\textbf{{u}}}}\in B_{2}, we have111Recall the convention that u_{k}=-\sum_{i=1}^{k-1}u_{i} and \tilde{u}_{k}=-\sum_{i=1}^{k-1}\tilde{u}_{i}.

\mathbb{E}\Big{(}\sum_{i=1}^{L}\frac{n}{q_{i}}\Big{(}\sum_{j=1}^{k}\tilde{u}_{% j}q_{ij}-\frac{V_{i}}{n}\Big{)}^{2}\Big{)}=\frac{1}{n}\sum_{i=1}^{L}\frac{1}{q% _{i}}\operatorname{Var}\Big{(}t_{i}(Y^{n})\Big{)}=O(1),

so for large n

\sqrt{\sum_{i=1}^{L}\frac{n}{q_{i}}\Big{(}\sum_{j=1}^{k}\tilde{u}_{j}q_{ij}-% \frac{V_{i}(Y^{n})}{n}\Big{)}^{2}}<n^{1/26}

with large probability. We can phrase this as follows: conditional on {\textbf{{U}}}\in B_{2}, for almost all y^{n}\in{\mathscr{Y}}^{n} we can find \tilde{{\textbf{{u}}}}=\tilde{{\textbf{{u}}}}(y^{n})\in B_{2} such that

\sqrt{\sum_{i=1}^{L}\frac{n}{q_{i}}\Big{(}\sum_{j=1}^{k}\tilde{u}_{j}q_{ij}-% \frac{v_{i}(y^{n})}{n}\Big{)}^{2}}<n^{1/26}.

Since P({\textbf{{U}}}\in B_{2})\approx 1, this is the same as saying that for almost all y^{n}\in{\mathscr{Y}}^{n}, we can find \tilde{{\textbf{{u}}}}=\tilde{{\textbf{{u}}}}(y^{n})\in B_{2} such that

\sqrt{h_{{\textbf{{v}}}}(\tilde{{\textbf{{u}}}})}=\sqrt{\sum_{i=1}^{L}\frac{n}% {q_{i}}\Big{(}\sum_{j=1}^{k}\tilde{u}_{j}q_{ij}-\frac{v_{i}(y^{n})}{n}\Big{)}^% {2}}<n^{1/26}. (27)

By the triangle inequality, for any \tilde{{\textbf{{u}}}}\in B_{2} and any {\textbf{{u}}}\in B_{1}^{c}, we have

\sqrt{\sum_{i=1}^{k}(u_{i}-\tilde{u}_{i})^{2}}>\frac{3/\delta_{0}}{n^{6/13}}. (28)

Our next goal is to use this inequality to bound below the quantity \big{(}\sum_{i=1}^{L}\frac{1}{q_{i}}(\sum_{j=1}^{k}(u_{j}-\tilde{u}_{j})q_{ij}% ^{2}\big{)}^{1/2}. Let us introduce the quantity

\delta=\delta({\textbf{{Q}}}):=\min_{{{\textbf{{u}}}\in\mathbb{R}^{k-1}:\,\sum% _{i=1}^{k}u_{i}^{2}=1}}\Big{(}\sum_{i=1}^{L}\frac{1}{q_{i}}\Big{(}\sum_{j=1}^{% k}u_{j}q_{ij}\Big{)}^{2}\Big{)}^{1/2} (29)

Intuitively, \delta measures how well Q can distinguish between different u’s. Our argument proceeds differently depending on whether \delta\geq\delta_{0} or not.

If \delta is small, then there is a pair u and {\textbf{{u}}}^{\prime} that are well-separated from each other by the \ell_{2} distance, but the posterior probabilities of u and {\textbf{{u}}}^{\prime} given y^{n} are close to each other. Thus, the case \delta<\delta_{0} for a sufficiently small constant \delta_{0} can be handled by a straightforward application of the Le Cam method, and the main obstacle is represented by the opposite case.

For \delta\geq\delta_{0}, according to (28) and (29), for any \tilde{{\textbf{{u}}}}\in B_{2} and any {\textbf{{u}}}\in B_{1}^{c},

\displaystyle\sqrt{\sum_{i=1}^{L}\frac{n}{q_{i}}\Big{(}\sum_{j=1}^{k}\tilde{u}% _{j}q_{ij}-u_{j}q_{ij}\Big{)}^{2}} \displaystyle=n^{1/2}\sqrt{\sum_{i=1}^{L}\frac{1}{q_{i}}\Big{(}\sum_{j=1}^{k}(% \tilde{u}_{j}-u_{j})q_{ij}\Big{)}^{2}} (30)
\displaystyle\geq n^{1/2}\delta\sqrt{\sum_{i=1}^{k}(u_{i}-\tilde{u}_{i})^{2}}
\displaystyle\geq n^{1/2}\delta_{0}\frac{3/\delta_{0}}{n^{6/13}}
\displaystyle=3n^{1/26}.

Combining (27) and (30) and using the triangle inequality, we conclude that for almost all y^{n}\in{\mathscr{Y}}^{n} and any {\textbf{{u}}}\in B_{1}^{c}

\sqrt{h_{{\textbf{{v}}}}({\textbf{{u}}})}=\sqrt{\sum_{i=1}^{L}\frac{n}{q_{i}}% \Big{(}\sum_{j=1}^{k}u_{j}q_{ij}-\frac{v_{i}(y^{n})}{n}\Big{)}^{2}}\geq 2n^{1/% 26}. (31)

Now combining (31) with (27), we deduce that for almost all y^{n}\in{\mathscr{Y}}^{n}, we can find a \tilde{{\textbf{{u}}}}=\tilde{{\textbf{{u}}}}(y^{n}) such that

B_{1}^{c}\subseteq\{{\textbf{{u}}}\in\mathbb{R}^{k-1}:\sqrt{h_{{\textbf{{v}}}}% ({\textbf{{u}}})}-\sqrt{h_{{\textbf{{v}}}}(\tilde{{\textbf{{u}}}})}>n^{1/26}\}.

Once the details are filled in, this will establish (25), and thus we will be able to conclude that for almost all y^{n}\in{\mathscr{Y}}^{n}, the posterior distribution of U given Y^{n}=y^{n} is very close to {\mathcal{N}}(\Phi^{-1}{\textbf{{w}}},\Phi^{-1}).

It is a standard fact that under \ell_{2}^{2} loss, the optimal Bayes estimator for u is \mathbb{E}({\textbf{{U}}}|Y^{n}). Therefore, the optimal Bayes estimation loss is \sum_{i=1}^{k}\mathbb{E}(U_{i}-\mathbb{E}(U_{i}|Y^{n}))^{2}, and this is equal to the sum of the variances of the posterior distributions of U_{i} given Y^{n}. We just showed that the posterior distribution is very close to {\mathcal{N}}(\Phi^{-1}{\textbf{{w}}},\Phi^{-1}). Therefore, \sum_{i=1}^{k}\mathbb{E}(U_{i}-\mathbb{E}(U_{i}|Y^{n}))^{2} can be approximated as \operatorname{tr}(\Phi^{-1})+{\textbf{{1}}}^{T}\Phi^{-1}{\textbf{{1}}}, where 1 is the column vector of k-1 ones. More precisely, we can show that

r_{k,n}^{\ell_{2}^{2}}({\textbf{{Q}}})\geq\operatorname{tr}(\Phi^{-1})+{% \textbf{{1}}}^{T}\Phi^{-1}{\textbf{{1}}}-\big{|}O\big{(}n^{-14/13}\big{)}\big{% |}.

Then we will prove that for all {\textbf{{Q}}}\in{\mathscr{D}}_{\epsilon,E},

\operatorname{tr}(\Phi^{-1})+{\textbf{{1}}}^{T}\Phi^{-1}{\textbf{{1}}}\geq% \frac{1}{n}M(k,\epsilon)

and since by (11), (12), r_{k,n}^{\ell_{2}^{2}}({\textbf{{Q}}}_{k,\epsilon,d^{\ast}},\hat{{\textbf{{p}}% }})=\frac{1}{n}M(k,\epsilon), this will prove Theorem III.1.

V Asymptotically tight lower bound: Proof of Theorem III.1

In this section we develop the plan of attack outlined in Section IV. Since the proof is rather long, we divide it into several steps, isolating each of them in its own subsection.

V-A Output alphabet reduction

Here we fill in the details left out in Sec. IV-A. Given {\textbf{{Q}}}\in{\mathscr{D}}_{\epsilon,E}, assume without loss of generality that the output alphabet is {\mathscr{Y}}=\{1,2,\dots,L^{\prime}\} for some integer L^{\prime}. Define an equivalence relation “\equiv” on {\mathscr{Y}} as follows: for i_{1},i_{2}\in\{1,2,\dots,L^{\prime}\}, we say that i_{1}\equiv i_{2} if

\frac{{\textbf{{Q}}}(i_{1}|j)}{\sum_{j^{\prime}=1}^{k}{\textbf{{Q}}}(i_{1}|j^{% \prime})}=\frac{{\textbf{{Q}}}(i_{2}|j)}{\sum_{j^{\prime}=1}^{k}{\textbf{{Q}}}% (i_{2}|j^{\prime})}\text{~{}for all~{}}j\in[k]. (32)

In other words, we say that i_{1}\equiv i_{2} if the vectors ({\textbf{{Q}}}(i_{1}|1),{\textbf{{Q}}}(i_{1}|2),\dots,{\textbf{{Q}}}(i_{1}|k)) and ({\textbf{{Q}}}(i_{2}|1),{\textbf{{Q}}}(i_{2}|2),\linebreak[4]\dots,{\textbf{{% Q}}}(i_{2}|k)) are proportional to each other. It is easy to verify that \equiv is indeed an equivalence relation and therefore it induces a partition of {\mathscr{Y}} into L disjoint equivalence classes \{A_{i}\}_{i=1}^{L}, so {\mathscr{Y}}=\cup_{i=1}^{L}A_{i}. By definition of {\mathscr{D}}_{\epsilon,E}, for any i\in{\mathscr{Y}}, the vector ({\textbf{{Q}}}(i|1),{\textbf{{Q}}}(i|2),\dots,{\textbf{{Q}}}(i|k)) is proportional to one of the vectors in \{1,e^{\epsilon}\}^{k}, which implies that L\leq 2^{k}.

Next we will show that for the purposes of this proof, the original output alphabet {\mathscr{Y}} can be replaced with the alphabet \{A_{1},\dots,A_{L}\} with only minor notational changes. We briefly discuss them below, with the overall goal of writing out the pdf f_{U|Y^{n}} as in (21)-(22).

For i\in\{1,2,\dots,L^{\prime}\} and y^{n}\in{\mathscr{Y}}^{n}, let t^{\prime}_{i}(y^{n}) be the number of times that symbol i appears in y^{n}, and let

q^{\prime}_{i}=\frac{1}{k}\sum_{j=1}^{k}{\textbf{{Q}}}(i|j). (33)

For i\in\{1,2,\dots,L\} and j\in\{1,2,\dots,k\}, define

\displaystyle t_{i}(y^{n})=\sum_{a\in A_{i}}t^{\prime}_{a}(y^{n}),\quad q_{i}=% \sum_{a\in A_{i}}q^{\prime}_{a}, (34)
\displaystyle q_{ij}={\textbf{{Q}}}(A_{i}|j)=\sum_{a\in A_{i}}{\textbf{{Q}}}(a% |j).

By definition of the equivalence relation (32), we have

\frac{{\textbf{{Q}}}(a|j)}{q^{\prime}_{a}}=\frac{q_{ij}}{q_{i}}\text{~{}for % all~{}}a\in A_{i}\text{~{}and~{}}j\in[k]. (35)

For i\in\{1,2,\dots,L^{\prime}\} and y^{n}\in{\mathscr{Y}}^{n}, let

{\textbf{{v}}}^{\prime}(y^{n})=(v^{\prime}_{1}(y^{n}),v^{\prime}_{2}(y^{n}),% \dots,v^{\prime}_{L^{\prime}}(y^{n}))=(t^{\prime}_{1}(y^{n})-nq^{\prime}_{1},t% ^{\prime}_{2}(y^{n})-nq^{\prime}_{2},\dots,t^{\prime}_{L^{\prime}}(y^{n})-nq^{% \prime}_{L^{\prime}}).

Since \sum_{i=1}^{L^{\prime}}t^{\prime}_{i}(y^{n})=\sum_{i=1}^{L^{\prime}}nq^{\prime% }_{i}=n, we have \sum_{i=1}^{L^{\prime}}v^{\prime}_{i}(y^{n})=0 for every y^{n}\in{\mathscr{Y}}^{n}. We also define the random vector

{\textbf{{V}}}^{\prime}=(V^{\prime}_{1},V^{\prime}_{2},\dots,V^{\prime}_{L^{% \prime}}):=(t^{\prime}_{1}(Y^{n})-nq^{\prime}_{1},t^{\prime}_{2}(Y^{n})-nq^{% \prime}_{2},\dots,t^{\prime}_{L^{\prime}}(Y^{n})-nq^{\prime}_{L^{\prime}}).

For i\in[L] and y^{n}\in{\mathscr{Y}}^{n}, let

{\textbf{{v}}}(y^{n})=(v_{1}(y^{n}),v_{2}(y^{n}),\dots,v_{L}(y^{n}))=(t_{1}(y^% {n})-nq_{1},t_{2}(y^{n})-nq_{2},\dots,t_{L}(y^{n})-nq_{L}).

Similarly, \sum_{i=1}^{L}v_{i}(y^{n})=0 for every y^{n}\in{\mathscr{Y}}^{n}. Moreover, by definition,

v_{i}(y^{n})=\sum_{a\in A_{i}}v^{\prime}_{a}(y^{n})\text{~{}for all~{}}y^{n}% \in{\mathscr{Y}}^{n}. (36)

We also define the random vector

{\textbf{{V}}}=(V_{1},V_{2},\dots,V_{L})=(t_{1}(Y^{n})-nq_{1},t_{2}(Y^{n})-nq_% {2},\dots,t_{L}(Y^{n})-nq_{L}).

For the simplicity of notation, from now on we will write {\textbf{{v}}}^{\prime}(y^{n}) and {\textbf{{v}}}(y^{n}) as {\textbf{{v}}}^{\prime} and {\textbf{{v}}}, respectively. Similarly, we abbreviate v^{\prime}_{i}(y^{n}) and v_{i}(y^{n}) as v^{\prime}_{i} and v_{i}, respectively. In Proposition V.1 below, we will show that in order to prove the lower bound in Theorem III.1, we only need the quantities q_{i},v_{i},q_{ij},i\in[L],j\in[k]. Therefore abusing notation, we will write {\mathscr{Y}} as \{A_{1},\dots,A_{L}\} and remove L^{\prime} and all the quantities associated with it from consideration in most parts of this proof.

Next let us switch our attention to the pdf f_{U|Y^{n}}. Given {\textbf{{p}}}=(p_{1},\dots,p_{k})\in\Delta_{k}, we define a (k-1)-dimensional vector {\textbf{{u}}}=(u_{1},u_{2},\dots,u_{k-1})=(p_{1}-1/k,p_{2}-1/k,\dots,p_{k-1}-% 1/k) and let {\textbf{{U}}}=(U_{1},\dots,U_{k-1}) be the corresponding random vector. Clearly, p_{k}=1/k-\sum_{i=1}^{k-1}u_{i}. As a result, estimating p is equivalent to estimating (u_{1},u_{2},\dots,u_{k-1},-\sum_{i=1}^{k-1}u_{i}), so from now on we will focus on estimating the latter. Given {\textbf{{u}}}\in\mathbb{R}^{k-1}, define u_{k}({\textbf{{u}}})=-\sum_{i=1}^{k-1}u_{i} and define the corresponding random variable U_{k}({\textbf{{U}}})=-\sum_{i=1}^{k-1}U_{i}. Below we write these quantities as u_{k} and U_{k} respectively for simplicity of notation.

We only consider distributions of U supported on a small ellipsoid centered at (\frac{1}{k},\frac{1}{k},\dots,\frac{1}{k}). We use Bayes estimation loss to bound below the minimax estimation loss r_{k,n}^{\ell_{2}^{2}}({\textbf{{Q}}}). More specifically, we assume that the random vector U is uniformly distributed over the ellipsoid B_{1}=B(\frac{1}{n^{5/13}}).

We have the following proposition.

Proposition V.1.

Assume that U is distributed over the ellipsoid B_{1} and let f_{{\textbf{{U}}},Y^{n}} be the density of the joint distribution. For {\textbf{{u}}}\in B_{1} we have f_{{\textbf{{U}}},Y^{n}}({\textbf{{u}}},y^{n})=C_{{\textbf{{v}}}^{\prime}}\exp% (g({\boldsymbol{u}},y^{n})), where

g({\textbf{{u}}},y^{n})=\sum_{i=1}^{L}\Big{(}(nq_{i}+v_{i})\log\Big{(}1+\sum_{% j=1}^{k}\frac{u_{j}q_{ij}}{q_{i}}\Big{)}\Big{)}, (37)

and C_{{\textbf{{v}}}^{\prime}}=\frac{1}{\text{\rm Vol}(B_{1})}\prod_{i=1}^{L^{% \prime}}(q^{\prime}_{i})^{nq^{\prime}_{a}+v^{\prime}_{a}}.

Proof.

We can write the joint distribution density function of the random vectors U and Y^{n} as follows:

f_{{\textbf{{U}}},Y^{n}}({\textbf{{u}}},y^{n})=\begin{cases}\frac{1}{C_{0}}% \prod_{i=1}^{L^{\prime}}(q^{\prime}_{i}+\sum_{j=1}^{k}u_{j}{\textbf{{Q}}}(i|j)% )^{t_{i}^{\prime}(y^{n})}&\mbox{~{}if~{}}{\textbf{{u}}}\in B_{1}\\ 0&\mbox{~{}if~{}}{\textbf{{u}}}\notin B_{1}\end{cases},

where C_{0} is the volume of B_{1}. For {\textbf{{u}}}\in B_{1}, we have

\displaystyle f_{{\textbf{{U}}},Y^{n}}({\textbf{{u}}},y^{n}) \displaystyle=\frac{1}{C_{0}}\prod_{i=1}^{L^{\prime}}{(q^{\prime}_{i})}^{nq^{% \prime}_{a}+v^{\prime}_{a}}\prod_{i=1}^{L^{\prime}}\Big{(}1+\sum_{j=1}^{k}% \frac{u_{j}{\textbf{{Q}}}(i|j)}{q^{\prime}_{i}}\Big{)}^{nq^{\prime}_{a}+v^{% \prime}_{a}}
\displaystyle=C_{{\textbf{{v}}}^{\prime}}\prod_{i=1}^{L^{\prime}}\exp\Big{\{}(% nq^{\prime}_{a}+v^{\prime}_{a})\log\Big{(}1+\sum_{j=1}^{k}\frac{u_{j}{\textbf{% {Q}}}(i|j)}{q^{\prime}_{i}}\Big{)}\Big{\}}
\displaystyle=C_{{\textbf{{v}}}^{\prime}}\exp\Big{(}\sum_{i=1}^{L^{\prime}}% \Big{(}(nq^{\prime}_{a}+v^{\prime}_{a})\log\Big{(}1+\sum_{j=1}^{k}\frac{u_{j}{% \textbf{{Q}}}(i|j)}{q^{\prime}_{i}}\Big{)}\Big{)}\Big{)}
\displaystyle=C_{{\textbf{{v}}}^{\prime}}\exp\Big{(}\sum_{i=1}^{L}\sum_{a\in A% _{i}}\Big{(}(nq^{\prime}_{a}+v^{\prime}_{a})\log\Big{(}1+\sum_{j=1}^{k}\frac{u% _{j}{\textbf{{Q}}}(a|j)}{q^{\prime}_{a}}\Big{)}\Big{)}\Big{)}
\displaystyle=C_{{\textbf{{v}}}^{\prime}}\exp\Big{(}\sum_{i=1}^{L}\Big{(}(nq_{% i}+v_{i})\log\Big{(}1+\sum_{j=1}^{k}\frac{u_{j}q_{ij}}{q_{i}}\Big{)}\Big{)}% \Big{)},

where C_{{\textbf{{v}}}^{\prime}}=\frac{1}{C_{0}}\prod_{i=1}^{L^{\prime}}(q^{\prime}% _{i})^{nq^{\prime}_{a}+v^{\prime}_{a}} is a constant that depends on {\textbf{{v}}}^{\prime} but not on u, and the last equality follows from (35) and (36). ∎

V-B Approximating g({\textbf{{u}}},y^{n})

As already said, we will assume that the output alphabet of Q has the form \{A_{1},A_{2},\dots,A_{L}\} and will use the auxiliary quantities (the composition, etc.) associated with it according to their definitions in (34) and (36). Let

g_{2}({\textbf{{u}}},y^{n}):=\sum_{i=1}^{L}v_{i}\sum_{j=1}^{k}\frac{u_{j}q_{ij% }}{q_{i}}-\frac{1}{2}\sum_{i=1}^{L}nq_{i}\Big{(}\sum_{j=1}^{k}\frac{u_{j}q_{ij% }}{q_{i}}\Big{)}^{2}. (38)

Further, let E_{1}\subseteq{\mathscr{Y}}^{n} be defined as follows:

E_{1}=\Big{\{}y^{n}:\sum_{i=1}^{L}|v_{i}|<2kn^{8/13}\Big{\}}.

We will show that when n is large, the difference between g({\textbf{{u}}},y^{n}) and g_{2}({\textbf{{u}}},y^{n}) is small for all {\textbf{{u}}}\in B_{1} and y^{n}\in E_{1}.

Proposition V.2.

Let g({\textbf{{u}}},y^{n}) be as defined in (37). Then there is an integer N(k,\epsilon) such that for every y^{n}\in E_{1},{\textbf{{u}}}\in B_{1} and n>N(k,\epsilon),

|g({\textbf{{u}}},y^{n})-g_{2}({\textbf{{u}}},y^{n})|<\frac{2k^{3}}{n^{2/13}}. (39)

Consequently, for all such y^{n},{\textbf{{u}}} and n,

|\exp(g({\textbf{{u}}},y^{n}))-\exp(g_{2}({\textbf{{u}}},y^{n}))|\leq\frac{4k^% {3}}{n^{2/13}}\exp(g_{2}({\textbf{{u}}},y^{n})). (40)
Proof.

1. We begin with approximating g({\textbf{{u}}},y^{n}) as follows:

g_{1}({\textbf{{u}}},y^{n})=\sum_{i=1}^{L}(nq_{i}+v_{i})\Big{(}\sum_{j=1}^{k}% \frac{u_{j}q_{ij}}{q_{i}}-\frac{1}{2}\Big{(}\sum_{j=1}^{k}\frac{u_{j}q_{ij}}{q% _{i}}\Big{)}^{2}\Big{)}.

For {\textbf{{u}}}\in B_{1}, we have |u_{j}|<\frac{1}{n^{5/13}} for all j\in\{1,2,\dots,k\} (see (18)), and so |\sum_{j=1}^{k}u_{j}q_{ij}|<\frac{1}{n^{5/13}}\sum_{j=1}^{k}q_{ij}. Using this inequality together with the definition of q_{i} (see Table I) we obtain

\Big{|}\sum_{j=1}^{k}\frac{u_{j}q_{ij}}{q_{i}}\Big{|}<\frac{k}{n^{5/13}},\quad i% =1,2,\dots,L. (41)

Given k,\epsilon, there is an integer N(k,\epsilon) such that for all n>N(k,\epsilon) we can bound the difference of g({\textbf{{u}}},y^{n}) and g_{1}({\textbf{{u}}},y^{n}) as follows: for all {\textbf{{u}}}\in B_{1} and y^{n}\in{\mathscr{Y}}^{n},

\displaystyle|g({\textbf{{u}}},y^{n})-g_{1}({\textbf{{u}}},y^{n})| \displaystyle\leq\sum_{i=1}^{L}\Big{(}(nq_{i}+v_{i})\Big{|}\log\Big{(}1+\sum_{% j=1}^{k}\frac{u_{j}q_{ij}}{q_{i}}\Big{)}-\Big{(}\sum_{j=1}^{k}\frac{u_{j}q_{ij% }}{q_{i}}-\frac{1}{2}\Big{(}\sum_{j=1}^{k}\frac{u_{j}q_{ij}}{q_{i}}\Big{)}^{2}% \Big{)}\Big{|}\Big{)}
\displaystyle\overset{(a)}{\leq}\sum_{i=1}^{L}(nq_{i}+v_{i})\Big{|}\sum_{j=1}^% {k}\frac{u_{j}q_{ij}}{q_{i}}\Big{|}^{3}
\displaystyle<\frac{k^{3}}{n^{15/13}}\sum_{i=1}^{L}(nq_{i}+v_{i})=\frac{k^{3}}% {n^{2/13}}, (42)

where (a) follows from Prop. A.1 (Appendix A).

2. Multiplying out in the definition of g_{1}({\textbf{{u}}},y^{n}), we can simplify this expression as follows:

\displaystyle g_{1}({\textbf{{u}}},y^{n}) \displaystyle=n\sum_{i=1}^{L}\sum_{j=1}^{k}u_{j}q_{ij}+\sum_{i=1}^{L}v_{i}\sum% _{j=1}^{k}\frac{u_{j}q_{ij}}{q_{i}}-\frac{1}{2}\sum_{i=1}^{L}nq_{i}\Big{(}\sum% _{j=1}^{k}\frac{u_{j}q_{ij}}{q_{i}}\Big{)}^{2}-\frac{1}{2}\sum_{i=1}^{L}v_{i}% \Big{(}\sum_{j=1}^{k}\frac{u_{j}q_{ij}}{q_{i}}\Big{)}^{2}
\displaystyle=\sum_{i=1}^{L}v_{i}\sum_{j=1}^{k}\frac{u_{j}q_{ij}}{q_{i}}-\frac% {1}{2}\sum_{i=1}^{L}nq_{i}\Big{(}\sum_{j=1}^{k}\frac{u_{j}q_{ij}}{q_{i}}\Big{)% }^{2}-\frac{1}{2}\sum_{i=1}^{L}v_{i}\Big{(}\sum_{j=1}^{k}\frac{u_{j}q_{ij}}{q_% {i}}\Big{)}^{2},

where the second equality follows because \sum_{i=1}^{L}q_{ij}=1 for all j=1,2,\dots,k and \sum_{j=1}^{k}u_{j}=0 (see (17)), and so \sum_{i=1}^{L}\sum_{j=1}^{k}u_{j}q_{ij}=0.

We bound the difference of g_{1}({\textbf{{u}}},y^{n}) and g_{2}({\textbf{{u}}},y^{n}) as follows:

\big{|}g_{1}({\textbf{{u}}},y^{n})-g_{2}({\textbf{{u}}},y^{n})\big{|}=\frac{1}% {2}\Big{|}\sum_{i=1}^{L}v_{i}\Big{(}\sum_{j=1}^{k}\frac{u_{j}q_{ij}}{q_{i}}% \Big{)}^{2}\Big{|}<\frac{k^{2}}{2n^{10/13}}\|{\textbf{{v}}}\|_{1}\text{~{}for % all~{}}{\textbf{{u}}}\in B_{1}, (43)

where the inequality follows from (41). As an immediate consequence, for every y^{n}\in E_{1},

\big{|}g_{1}({\textbf{{u}}},y^{n})-g_{2}({\textbf{{u}}},y^{n})\big{|}\leq\frac% {k^{3}}{n^{2/13}}\text{~{}for all~{}}{\textbf{{u}}}\in B_{1}.

Combined with (42), we deduce that for every y^{n}\in E_{1},{\textbf{{u}}}\in B_{1}

\Big{|}g({\textbf{{u}}},y^{n})-g_{2}({\textbf{{u}}},y^{n})\Big{|}<\frac{2k^{3}% }{n^{2/13}}.

Therefore, for all y^{n}\in E_{1},{\textbf{{u}}}\in B_{1}

\displaystyle\big{|}\exp(g({\textbf{{u}}},y^{n}))-\exp(g_{2}({\textbf{{u}}},y^% {n}))\big{|} \displaystyle\leq\exp(g_{2}({\textbf{{u}}},y^{n}))\max\Big{\{}\exp\Big{(}\frac% {2k^{3}}{n^{2/13}}\Big{)}-1,1-\exp\Big{(}\frac{-2k^{3}}{n^{2/13}}\Big{)}\Big{\}} (44)
\displaystyle\leq\frac{4k^{3}}{n^{2/13}}\exp(g_{2}({\textbf{{u}}},y^{n})),

where the last inequality holds for all n\geq N(k,\epsilon) for a suitably chosen N(k,\epsilon), and it follows from the fact that |e^{x}-1|\leq 2|x| for all x\leq 1/2. ∎

Next we show that P(Y^{n}\in E_{1}) is close to 1 when n is large enough.

Proposition V.3.

There is an integer N(k,\epsilon) such that for all n>N(k,\epsilon),

P(Y^{n}\in E_{1})>1-\frac{1}{n^{1/13}}. (45)
Proof.

Given {\textbf{{u}}}\in B_{1}, define the event

E_{{\textbf{{u}}}}=\bigcap_{i=1}^{L}\Big{\{}y^{n}:\Big{|}v_{i}-\sum_{j=1}^{k}% nu_{j}q_{ij}\Big{|}<\frac{kn^{8/13}}{2^{k}}\Big{\}}.

(Recall that v is a function of y^{n}). To prove the proposition, we first show that E_{{\textbf{{u}}}}\subseteq E_{1} for all {\textbf{{u}}}\in B_{1}. Then we show that P(Y^{n}\in(E_{{\textbf{{u}}}})^{c}|{\textbf{{U}}}={\textbf{{u}}}) is small for all {\textbf{{u}}}\in B_{1}.

By the triangle inequality, we have

\displaystyle\|{\textbf{{v}}}\|_{1} \displaystyle\leq\sum_{i=1}^{L}\Big{|}\sum_{j=1}^{k}nu_{j}q_{ij}\Big{|}+\sum_{% i=1}^{L}\Big{|}v_{i}-\sum_{j=1}^{k}nu_{j}q_{ij}\Big{|}.

Further,

\sum_{i=1}^{L}\Big{|}\sum_{j=1}^{k}nu_{j}q_{ij}\Big{|}\leq\sum_{i=1}^{L}\sum_{% j=1}^{k}n|u_{j}|q_{ij}\leq n^{8/13}\sum_{i=1}^{L}\sum_{j=1}^{k}q_{ij}=kn^{8/13},

and trivially,

\sum_{i=1}^{L}\Big{|}v_{i}-\sum_{j=1}^{k}nu_{j}q_{ij}\Big{|}\leq 2^{k}\max_{i% \in[L]}\Big{|}v_{i}-\sum_{j=1}^{k}nu_{j}q_{ij}\Big{|}.

Combining these estimates, we obtain

\|{\textbf{{v}}}\|_{1}\leq kn^{8/13}+2^{k}\max_{1\leq i\leq L}\Big{|}v_{i}-% \sum_{j=1}^{k}nu_{j}q_{ij}\Big{|}\text{~{}for all~{}}{\textbf{{u}}}\in B_{1}.

As a result, for all {\textbf{{u}}}\in B_{1}

E_{{\textbf{{u}}}}\subseteq E_{1}. (46)

Note that, conditional on {\textbf{{U}}}={\textbf{{u}}}, the random variable t_{i}(Y^{n}) has binomial distribution B(n,q_{i}+\sum_{j=1}^{k}u_{j}q_{ij}). Using Hoeffding’s inequality, for every {\textbf{{u}}}\in B_{1},i=1,\dots,L we have

\displaystyle P\Big{\{}\Big{|}V_{i}-\sum_{j=1}^{k}nu_{j}q_{ij}\Big{|}\geq\frac% {kn^{8/13}}{2^{k}}\,\Big{|}\,{\textbf{{U}}}={\textbf{{u}}}\Big{\}} \displaystyle=P\Big{\{}\Big{|}\frac{t_{i}(Y^{n})}{n}-\Big{(}q_{i}+\sum_{j=1}^{% k}u_{j}q_{ij}\Big{)}\Big{|}\geq\frac{k}{2^{k}n^{5/13}}\,\Big{|}\,{\textbf{{U}}% }={\textbf{{u}}}\Big{\}}
\displaystyle\leq 2\exp\Big{(}-\frac{k^{2}n^{3/13}}{2^{2k-1}}\Big{)}. (47)

Combining (46) and (47), for every {\textbf{{u}}}\in B_{1} we have

\displaystyle P(Y^{n}\in(E_{1})^{c}|{\textbf{{U}}}={\textbf{{u}}}) \displaystyle\leq P(Y^{n}\in(E_{{\textbf{{u}}}})^{c}|{\textbf{{U}}}={\textbf{{% u}}})
\displaystyle\overset{(a)}{\leq}\sum_{i=1}^{L}P\Big{\{}\Big{|}V_{i}-\sum_{j=1}% ^{k}nu_{j}q_{ij}\Big{|}\geq\frac{kn^{8/13}}{2^{k}}\,\Big{|}\,{\textbf{{U}}}={% \textbf{{u}}}\Big{\}}
\displaystyle\leq 2L\exp\Big{\{}-\frac{k^{2}n^{3/13}}{2^{2k-1}}\Big{\}}
\displaystyle\overset{(b)}{\leq}2^{k+1}\exp\Big{\{}-\frac{k^{2}n^{3/13}}{2^{2k% -1}}\Big{\}}

where (a) follows from the union bound and (b) follows from the fact that L\leq 2^{k}. Therefore,

P(Y^{n}\in E_{1})\geq 1-2^{k+1}\exp\Big{\{}-\frac{k^{2}n^{3/13}}{2^{2k-1}}\Big% {\}}>1-\frac{1}{n^{1/13}},

where the last inequality holds for all n\geq N(k,\epsilon) for a suitably chosen N(k,\epsilon). ∎

V-C Gaussian approximation to the posterior distribution

The expression (38) for the function g_{2}({\textbf{{u}}},y^{n}) can also be written as

\displaystyle g_{2}({\textbf{{u}}},y^{n})=-\sum_{i=1}^{L}\frac{n}{2q_{i}}\Big{% (}\sum_{j=1}^{k}u_{j}q_{ij}-\frac{v_{i}}{n}\Big{)}^{2}+\sum_{i=1}^{L}\frac{v_{% i}^{2}}{2nq_{i}}.

Given {\textbf{{v}}}\in\mathbb{R}^{L}, let us define a function h_{{\textbf{{v}}}}:\mathbb{R}^{k-1}\to\mathbb{R} and a constant C_{{\textbf{{v}}}} as

h_{{\textbf{{v}}}}({\textbf{{u}}})=\sum_{i=1}^{L}\frac{n}{q_{i}}\Big{(}\sum_{j% =1}^{k}u_{j}q_{ij}-\frac{v_{i}}{n}\Big{)}^{2}\text{~{}for all~{}}{\textbf{{u}}% }\in\mathbb{R}^{k-1},\quad C_{{\textbf{{v}}}}=\exp\Big{(}\sum_{i=1}^{L}\frac{v% _{i}^{2}}{2nq_{i}}\Big{)}. (48)

(see (23); note that C_{\textbf{{v}}} does not depend on u). Using this notation, we have

\displaystyle g_{2}({\textbf{{u}}},y^{n})=-\frac{1}{2}h_{{\textbf{{v}}}}({% \textbf{{u}}})+\sum_{i=1}^{L}\frac{v_{i}^{2}}{2nq_{i}}, (49)
\displaystyle\exp(g_{2}({\textbf{{u}}},y^{n}))=C_{{\textbf{{v}}}}\exp(-h_{{% \textbf{{v}}}}({\textbf{{u}}})/2).

In order to estimate u from y^{n}, we need to find the conditional distribution f_{{\textbf{{U}}}|Y^{n}}({\textbf{{u}}}|y^{n}). As a first step, we need to calculate

P_{Y^{n}}(y^{n})=\int_{\mathbb{R}^{k-1}}f_{{\textbf{{U}}},Y^{n}}({\textbf{{u}}% },y^{n})d{\textbf{{u}}}=C_{{\textbf{{v}}}^{\prime}}\int_{B_{1}}\exp(g({\textbf% {{u}}},y^{n}))d{\textbf{{u}}},

where C_{{\textbf{{v}}}^{\prime}} is defined in Prop. V.1; however this appears difficult (while it is possible to find the asymptotics of this integral using, for instance, the multi-dimensional version of the Laplace method [18], controlling the error terms presents a problem, so we proceed in an ad-hoc way). According to (40) and (45), when n is sufficiently large, with large probability the ratio between \exp(g_{2}({\textbf{{u}}},y^{n})) and \exp(g({\textbf{{u}}},y^{n})) is very close to 1. So we can use the following integral to approximate P_{Y^{n}}(y^{n}):

G(y^{n})=C_{{\textbf{{v}}}^{\prime}}\int_{B_{1}}\exp(g_{2}({\textbf{{u}}},y^{n% }))d{\textbf{{u}}}=C_{{\textbf{{v}}}^{\prime}}C_{{\textbf{{v}}}}\int_{B_{1}}% \exp\Big{(}-\frac{1}{2}h_{{\textbf{{v}}}}({\textbf{{u}}})\Big{)}d{\textbf{{u}}}. (50)

The integrand in (50) is proportional to the probability density function of some Gaussian distribution. We will make use of this property to obtain an approximation to G(y^{n}), which is in turn an approximation of P_{Y^{n}}(y^{n}).

Define

\delta=\delta({\textbf{{Q}}})=\min_{{\textbf{{u}}}\in\mathbb{R}^{k-1}:\sum_{i=% 1}^{k}u_{i}^{2}=1}\Big{(}\sum_{i=1}^{L}\frac{1}{q_{i}}\Big{(}\sum_{j=1}^{k}u_{% j}q_{ij}\Big{)}^{2}\Big{)}^{1/2} (51)

(repeated here from (29)). Note that we are minimizing a continuous function over a compact set, so the minimum is attained. It is easy to verify that for all {\textbf{{u}}}\in\mathbb{R}^{k-1}

\Big{(}\sum_{i=1}^{L}\frac{1}{q_{i}}\Big{(}\sum_{j=1}^{k}u_{j}q_{ij}\Big{)}^{2% }\Big{)}^{1/2}\geq\delta\Big{(}\sum_{i=1}^{k}u_{i}^{2}\Big{)}^{1/2}. (52)

Given k and \epsilon, let define a constant

\delta_{0}=\delta_{0}(k,\epsilon):=\sqrt{\frac{1}{32M(k,\epsilon)}} (53)

where M(k,\epsilon) is given by (13). The remainder of the proof of Theorem III.1 depends on whether \delta\geq\delta_{0} or not. We divide the proof into these two cases because without this division, we can only prove a weaker version of (19), where the constants C(k,\epsilon) and N(k,\epsilon) are replaced with constants that depend on Q.

V-D Case 1: \delta\geq\delta_{0}

We use the Bayes estimation loss to bound below r_{k,n}^{\ell_{2}^{2}}({\textbf{{Q}}}). It is well known that under \ell_{2}^{2} loss, the optimal Bayes estimator for u_{i} is \mathbb{E}(U_{i}|Y^{n}). Therefore, the optimal Bayes estimation loss is

\sum_{i=1}^{k}\mathbb{E}(U_{i}-\mathbb{E}(U_{i}|Y^{n}))^{2}, (54)

and this is equal to the sum of the variances of the posterior distributions of U_{i} given Y^{n}. The main idea is to approximate the posterior distribution of U given Y^{n} by a multivariate Gaussian distribution. More specifically, we define a (k-1)\times(k-1) matrix \Phi(n,{\textbf{{Q}}}):=\sum_{i=1}^{L}{\textbf{{z}}}_{i}^{T}{\textbf{{z}}}_{i}, where

{\textbf{{z}}}_{i}=\sqrt{\frac{n}{q_{i}}}(q_{i,1}-q_{i,k},q_{i,2}-q_{i,k},% \dots,q_{i,k-1}-q_{i,k}). (55)

Equivalently, \Phi(n,{\textbf{{Q}}}) is defined by its associated quadratic form as follows:

{\textbf{{u}}}^{T}\Phi(n,{\textbf{{Q}}}){\textbf{{u}}}=\sum_{i=1}^{L}\frac{n}{% q_{i}}\Big{(}\sum_{j=1}^{k}u_{j}q_{ij}\Big{)}^{2}=\sum_{i=1}^{L}\frac{n}{q_{i}% }\Big{(}\sum_{j=1}^{k-1}u_{j}(q_{ij}-q_{ik})\Big{)}^{2}\text{~{}for all~{}}{% \textbf{{u}}}\in\mathbb{R}^{k-1}. (56)

Since \delta\geq\delta_{0}>0, we have {\textbf{{u}}}^{T}\Phi(n,{\textbf{{Q}}}){\textbf{{u}}}>0 for all {\textbf{{u}}}\neq{\textbf{{0}}}, which shows that \Phi(n,{\textbf{{Q}}}) is positive definite. For simplicity of notation, we write it as \Phi, omitting the arguments. We will show that when n is large enough, there is a set E_{2}\subseteq{\mathscr{Y}}^{n} with P(Y^{n}\in E_{2}) close to 1, such that conditional on every y^{n}\in E_{1}\cap E_{2}, the covariance matrix of U is close to \Phi^{-1}. This will enable us to conclude that \sum_{i=1}^{k}\mathbb{E}(U_{i}-\mathbb{E}(U_{i}|Y^{n}))^{2} can be approximated as \operatorname{tr}(\Phi^{-1})+{\textbf{{1}}}^{T}\Phi^{-1}{\textbf{{1}}}, where 1 is the all-ones column vector of length k-1.

More precisely, our proof of Case 1 consists of two steps, summarized in the following proposition.

Proposition V.4.

There are two positive constants C_{3}(k,\epsilon),C_{4}(k,\epsilon) and an integer N(k,\epsilon) such that when n\geq N(k,\epsilon), the following lower bound holds for all {\textbf{{Q}}}\in{\mathscr{D}}_{\epsilon,E}:

r_{k,n}^{\ell_{2}^{2}}({\textbf{{Q}}})\geq\Big{(}1-\frac{C_{4}(k,\epsilon)}{n^% {1/13}}\Big{)}(\operatorname{tr}(\Phi^{-1})+{\textbf{{1}}}^{T}\Phi^{-1}{% \textbf{{1}}})-\frac{C_{3}(k,\epsilon)}{n^{14/13}}. (57)

Furthermore, for all {\textbf{{Q}}}\in{\mathscr{D}}_{\epsilon,E} and all positive integer n,

\operatorname{tr}(\Phi^{-1})+{\textbf{{1}}}^{T}\Phi^{-1}{\textbf{{1}}}\geq% \frac{1}{n}M(k,\epsilon). (58)

Once both (57) and (58) are established, this will complete the proof of Theorem III.1 for Case 1 because of (13), (14) and (19).

V-D1 Proof of (57)

In order to prove inequality (57), we develop a refined version of the Local Asymptotic Normality approach [17, 16]222A detailed discussion of the connection between our proof and LAN appears in Appendix J.. Consider the ellipsoid (see (18))

B_{2}=B\Big{(}\frac{1}{n^{5/13}}-\frac{3/\delta_{0}}{n^{6/13}}\Big{)},

and define a subset E_{2}\subseteq{\mathscr{Y}}^{n} as follows:

E_{2}=\Big{\{}y^{n}:\exists\tilde{{\textbf{{u}}}}\in B_{2}\text{~{}such that~{% }}\sum_{i=1}^{L}\frac{n}{q_{i}}\Big{(}\sum_{j=1}^{k}\tilde{u}_{j}q_{ij}-\frac{% v_{i}}{n}\Big{)}^{2}<n^{1/13}\Big{\}}.

Recall our convention that \tilde{u}_{k}=\tilde{u}_{k}(\tilde{{\textbf{{u}}}})=-\sum_{i=1}^{k-1}\tilde{u}% _{i}, which stems from the fact that we are working with PMFs.

Proposition V.5.

There is an integer N(k,\epsilon) such that for all n>N(k,\epsilon),

P(Y^{n}\in E_{2})\geq 1-\frac{2^{k+1}+3k/\delta_{0}}{n^{1/13}}. (59)

The proof is given in Appendix B.

Our goal will be to show that for y^{n}\in E_{2}, the ratio between G(y^{n}) in (50) and

H(y^{n})=C_{{\textbf{{v}}}^{\prime}}C_{{\textbf{{v}}}}\int_{\mathbb{R}^{k-1}}% \exp\Big{(}-\frac{1}{2}h_{{\textbf{{v}}}}({\textbf{{u}}})\Big{)}d{\textbf{{u}}} (60)

is very close to 1. Specifically, we have

Proposition V.6.

There is an integer N(k,\epsilon) such that for all n>N(k,\epsilon) and all y^{n}\in E_{2},

\Big{(}1-\frac{256}{n^{2/13}}\Big{)}H(y^{n})\leq G(y^{n})\leq H(y^{n}) (61)

The upper bound on G(y^{n}) in (61) is obvious. To prove the lower bound, we need several auxiliary definitions and propositions.

Given \alpha>0 and \tilde{{\textbf{{u}}}}\in\mathbb{R}^{k-1}, define the set

E_{{\textbf{{v}}}}(\tilde{{\textbf{{u}}}},\alpha)=\{{\textbf{{u}}}\in\mathbb{R% }^{k-1}:\sqrt{h_{{\textbf{{v}}}}({\textbf{{u}}})}-\sqrt{h_{{\textbf{{v}}}}(% \tilde{{\textbf{{u}}}})}>\alpha\}. (62)
Proposition V.7.

For every y^{n}\in E_{2} and \alpha\geq n^{-5/13}, there exists \tilde{{\textbf{{u}}}}\in B_{2} such that the ellipsoid B(\alpha) defined in (18) satisfies

(B(\alpha))^{c}\subseteq E_{{\textbf{{v}}}}(\tilde{{\textbf{{u}}}},\delta_{0}n% ^{1/2}(\alpha-n^{-5/13})+n^{1/26}). (63)

The proof is given in Appendix C.

Proposition V.8.

Let \Phi be an s\times s positive definite matrix, and let t be a column vector in \mathbb{R}^{s}. Define a quadratic function h:\mathbb{R}^{s}\to\mathbb{R} as follows:

h({\textbf{{u}}})=({\textbf{{u}}}-{\textbf{{t}}})^{T}\Phi({\textbf{{u}}}-{% \textbf{{t}}})+C, (64)

where C\geq 0 is a nonnegative constant. For a given \tilde{{\textbf{{u}}}}\in\mathbb{R}^{s}, define the set

E(\tilde{{\textbf{{u}}}},\alpha)=\{{\textbf{{u}}}\in\mathbb{R}^{s}:\sqrt{h({% \textbf{{u}}})}-\sqrt{h(\tilde{{\textbf{{u}}}})}>\alpha\}.

For all \alpha\geq\sqrt{s} and all \tilde{{\textbf{{u}}}}\in\mathbb{R}^{s} the following inequality holds true:

\frac{\int_{E(\tilde{{\textbf{{u}}}},\alpha)}\exp(-\frac{1}{2}h({\textbf{{u}}}% ))d{\textbf{{u}}}}{\int_{\mathbb{R}^{s}}\exp(-\frac{1}{2}h({\textbf{{u}}}))d{% \textbf{{u}}}}\leq e^{-\frac{(\alpha-\sqrt{s})^{2}}{2}}. (65)

The proof is given in Appendix D.

Proof of Proposition V.6: Our plan is to use Prop. V.8 for the function h_{{\textbf{{v}}}} defined in (48) in order to estimate G(y^{n}). First let us show that h_{{\textbf{{v}}}} can be written as a quadratic form of the type (64). Define a vector {\textbf{{w}}}\in\mathbb{R}^{k-1} as follows:

{\textbf{{w}}}={\textbf{{w}}}({\textbf{{v}}},{\textbf{{Q}}})=\Big{(}\sum_{i=1}% ^{L}\frac{(q_{im}-q_{ik})v_{i}}{q_{i}},m=1,\dots,k-1\Big{)}^{T}. (66)

Then we have

\displaystyle h_{{\textbf{{v}}}}({\textbf{{u}}}) \displaystyle=\sum_{i=1}^{L}\frac{n}{q_{i}}\Big{(}\sum_{j=1}^{k}u_{j}q_{ij}% \Big{)}^{2}-2\sum_{i=1}^{L}\sum_{j=1}^{k}\frac{v_{i}q_{ij}}{q_{i}}u_{j}+\sum_{% i=1}^{L}\frac{v_{i}^{2}}{nq_{i}}
\displaystyle={\textbf{{u}}}^{T}\Phi{\textbf{{u}}}-2\sum_{j=1}^{k}\Big{(}\sum_% {i=1}^{L}\frac{q_{ij}v_{i}}{q_{i}}\Big{)}u_{j}+\sum_{i=1}^{L}\frac{v_{i}^{2}}{% nq_{i}}
\displaystyle={\textbf{{u}}}^{T}\Phi{\textbf{{u}}}-2\sum_{j=1}^{k-1}\Big{(}% \sum_{i=1}^{L}\frac{(q_{ij}-q_{ik})v_{i}}{q_{i}}\Big{)}u_{j}+\sum_{i=1}^{L}% \frac{v_{i}^{2}}{nq_{i}}
\displaystyle={\textbf{{u}}}^{T}\Phi{\textbf{{u}}}-{\textbf{{w}}}^{T}{\textbf{% {u}}}-{\textbf{{u}}}^{T}{\textbf{{w}}}+\sum_{i=1}^{L}\frac{v_{i}^{2}}{nq_{i}}
\displaystyle=({\textbf{{u}}}-\Phi^{-1}{\textbf{{w}}})^{T}\Phi({\textbf{{u}}}-% \Phi^{-1}{\textbf{{w}}})-{\textbf{{w}}}^{T}\Phi^{-1}{\textbf{{w}}}+\sum_{i=1}^% {L}\frac{v_{i}^{2}}{nq_{i}}, (67)

where the second equality follows from the definition of \Phi in (56).

Since h_{{\textbf{{v}}}}({\textbf{{u}}})\geq 0 for all {\textbf{{u}}}\in\mathbb{R}^{k-1}, the constant term C:=-{\textbf{{w}}}^{T}\Phi^{-1}{\textbf{{w}}}+\sum_{i=1}^{L}\frac{v_{i}^{2}}{% nq_{i}} is nonnegative. This shows that h_{{\textbf{{v}}}} satisfies the conditions in Prop. V.8.

To estimate G(y^{n}) we first note that, by Prop. V.7, for every y^{n}\in E_{2}, there exists \tilde{{\textbf{{u}}}}\in B_{2} such that (B_{1})^{c}\subseteq E_{{\textbf{{v}}}}(\tilde{{\textbf{{u}}}},n^{1/26}). We then obtain

\displaystyle G(y^{n}) \displaystyle=C_{{\textbf{{v}}}^{\prime}}C_{{\textbf{{v}}}}\int_{B_{1}}\exp% \Big{(}-\frac{1}{2}h_{{\textbf{{v}}}}({\textbf{{u}}})\Big{)}d{\textbf{{u}}}
\displaystyle\geq C_{{\textbf{{v}}}^{\prime}}C_{{\textbf{{v}}}}\Big{(}\int_{% \mathbb{R}^{k-1}}\exp\Big{(}-\frac{1}{2}h_{{\textbf{{v}}}}({\textbf{{u}}})\Big% {)}d{\textbf{{u}}}-\int_{E_{{\textbf{{v}}}}(\tilde{{\textbf{{u}}}},n^{1/26})}% \exp\Big{(}-\frac{1}{2}h_{{\textbf{{v}}}}({\textbf{{u}}})\Big{)}d{\textbf{{u}}% }\Big{)}
\displaystyle\geq H(y^{n})\Big{(}1-\exp\Big{(}-\frac{1}{2}\big{(}n^{1/26}-% \sqrt{k-1}\big{)}^{2}\Big{)}\Big{)}
\displaystyle\geq H(y^{n})\Big{(}1-\frac{256}{n^{2/13}}\Big{)}.

Here the next-to-last step follows from Prop. V.8, and the last inequality holds for all n\geq N(k,\epsilon) as long as we choose N(k,\epsilon) large enough. This proves the lower bound in (61). ∎

We have shown that G(y^{n}) can be approximated by H(y^{n}). Based on this, we can further show that P_{Y^{n}}(y^{n}) can be approximated by H(y^{n}).

Lemma V.9.

There is an integer N(k,\epsilon) such that for all n>N(k,\epsilon) and all y^{n}\in E_{1}\cap E_{2},

\Big{|}P_{Y^{n}}(y^{n})-H(y^{n})\Big{|}\leq\frac{4k^{3}+256}{n^{2/13}}H(y^{n}). (68)
Proof.

For every y^{n}\in E_{1} we have

\displaystyle|P_{Y^{n}}(y^{n})-G(y^{n})| \displaystyle=\Big{|}C_{{\textbf{{v}}}^{\prime}}\int_{B_{1}}\exp(g({\textbf{{u% }}},y^{n}))d{\textbf{{u}}}-C_{{\textbf{{v}}}^{\prime}}\int_{B_{1}}\exp(g_{2}({% \textbf{{u}}},y^{n}))d{\textbf{{u}}}\Big{|}
\displaystyle\leq C_{{\textbf{{v}}}^{\prime}}\int_{B_{1}}|\exp(g({\textbf{{u}}% },y^{n}))-\exp(g_{2}({\textbf{{u}}},y^{n}))|d{\textbf{{u}}}
\displaystyle\overset{(a)}{\leq}\frac{4k^{3}}{n^{2/13}}C_{{\textbf{{v}}}^{% \prime}}\int_{B_{1}}\exp(g_{2}({\textbf{{u}}},y^{n}))d{\textbf{{u}}}=\frac{4k^% {3}}{n^{2/13}}G(y^{n})
\displaystyle\leq\frac{4k^{3}}{n^{2/13}}H(y^{n}),

where (a) follows from (40). On account of (61), this concludes the proof. ∎

Define

f_{1}({\textbf{{u}}},y^{n})=\frac{C_{{\textbf{{v}}}^{\prime}}\exp(g_{2}({% \textbf{{u}}},y^{n}))}{H(y^{n})}. (69)

For every y^{n}\in E_{1}\cap E_{2}, we can approximate the conditional probability density function f_{{\textbf{{U}}}|Y^{n}}({\textbf{{u}}}|Y^{n}=y^{n}) as f_{1}({\textbf{{u}}},y^{n}). At the same time, we will also show that for a fixed y^{n}, f_{1}({\textbf{{u}}},y^{n}) is the probability density function of a Gaussian random vector with mean vector \Phi^{-1}{\textbf{{w}}} and covariance matrix \Phi^{-1}.

The difference is bounded as follows: for all {\textbf{{u}}}\in B_{1},

\displaystyle|f_{{\textbf{{U}}}|Y^{n}}({\textbf{{u}}}|Y^{n}=y^{n})-f_{1}({% \textbf{{u}}},y^{n})|=\Big{|}\frac{C_{{\textbf{{v}}}^{\prime}}\exp(g({\textbf{% {u}}},y^{n}))}{P_{Y^{n}}(y^{n})}-\frac{C_{{\textbf{{v}}}^{\prime}}\exp(g_{2}({% \textbf{{u}}},y^{n}))}{H(y^{n})}\Big{|}
\displaystyle\overset{(a)}{\leq}\frac{C_{{\textbf{{v}}}^{\prime}}\exp(g_{2}({% \textbf{{u}}},y^{n}))}{H(y^{n})}\max\Big{(}\frac{1+4k^{3}/(n^{2/13})}{1-(4k^{3% }+256)/(n^{2/13})}-1,1-\frac{1-4k^{3}/(n^{2/13})}{1+(4k^{3}+256)/(n^{2/13})}% \Big{)}
\displaystyle\overset{(b)}{\leq}\frac{16k^{3}+512}{n^{2/13}}f_{1}({\textbf{{u}% }},y^{n})
\displaystyle\overset{(c)}{=}\frac{C_{1}(k,\epsilon)}{n^{2/13}}f_{1}({\textbf{% {u}}},y^{n}), (70)

where (a) follows from (40) and (68); (b) holds for all n\geq N(k,\epsilon) as long as we take N(k,\epsilon)>(8k^{3}+512)^{13/2}; in (c) we define C_{1}(k,\epsilon):=16k^{3}+512. By definition,

\displaystyle f_{1}({\textbf{{u}}},y^{n}) \displaystyle=\frac{C_{{\textbf{{v}}}^{\prime}}C_{{\textbf{{v}}}}\exp\Big{(}-% \frac{1}{2}h_{{\textbf{{v}}}}({\textbf{{u}}})\Big{)}}{C_{{\textbf{{v}}}^{% \prime}}C_{{\textbf{{v}}}}\int_{\mathbb{R}^{k-1}}\exp\Big{(}-\frac{1}{2}h_{{% \textbf{{v}}}}({\textbf{{u}}})\Big{)}d{\textbf{{u}}}}
\displaystyle\overset{(a)}{=}\frac{\exp\Big{(}-\frac{1}{2}({\textbf{{u}}}-\Phi% ^{-1}{\textbf{{w}}})^{T}\Phi({\textbf{{u}}}-\Phi^{-1}{\textbf{{w}}})\Big{)}}{% \int_{\mathbb{R}^{k-1}}\exp\Big{(}-\frac{1}{2}({\textbf{{u}}}-\Phi^{-1}{% \textbf{{w}}})^{T}\Phi({\textbf{{u}}}-\Phi^{-1}{\textbf{{w}}})d{\textbf{{u}}}% \Big{)}}
\displaystyle=\frac{\sqrt{|\Phi|}}{\sqrt{(2\pi)^{k-1}}}\exp\Big{(}-\frac{1}{2}% ({\textbf{{u}}}-\Phi^{-1}{\textbf{{w}}})^{T}\Phi({\textbf{{u}}}-\Phi^{-1}{% \textbf{{w}}})\Big{)}, (71)

where (a) follows from (67). Thus f_{1} indeed represents a Gaussian pdf.

Recall that our goal is to estimate the Bayes estimation loss (54). Proceeding in this direction, we show next that for every y^{n}\in E_{1}\cap E_{2}, the \ell_{1} distance between \mathbb{E}({\textbf{{U}}}|Y^{n}=y^{n}) and \Phi^{-1}{\textbf{{w}}} is very small.

Lemma V.10.

There is an integer N(k,\epsilon) such that for all n>N(k,\epsilon) and all y^{n}\in E_{1}\cap E_{2},

\|\mathbb{E}({\textbf{{U}}}|Y^{n}=y^{n})-\Phi^{-1}{\textbf{{w}}}\|_{1}\leq% \frac{C_{2}(k,\epsilon)}{n^{7/13}},

where C_{2}(k,\epsilon):=\frac{32k^{4}e^{\epsilon}(k^{3}+32)}{\delta_{0}^{2}}+\sqrt{% k-1}.

Proof.
\displaystyle\|\mathbb{E}({\textbf{{U}}}|Y^{n}=y^{n})-\Phi^{-1}{\textbf{{w}}}% \|_{1}
\displaystyle=\Big{\|}\int_{B_{1}}{\textbf{{u}}}f_{{\textbf{{U}}}|Y^{n}}({% \textbf{{u}}}|Y^{n}=y^{n})d{\textbf{{u}}}-\int_{\mathbb{R}^{k-1}}{\textbf{{u}}% }f_{1}({\textbf{{u}}},y^{n})d{\textbf{{u}}}\Big{\|}_{1}
\displaystyle=\sum_{i=1}^{k-1}\Big{|}\int_{B_{1}}u_{i}f_{{\textbf{{U}}}|Y^{n}}% ({\textbf{{u}}}|Y^{n}=y^{n})d{\textbf{{u}}}-\int_{\mathbb{R}^{k-1}}u_{i}f_{1}(% {\textbf{{u}}},y^{n})d{\textbf{{u}}}\Big{|}
\displaystyle\leq\sum_{i=1}^{k-1}\Big{|}\int_{B_{1}}u_{i}(f_{{\textbf{{U}}}|Y^% {n}}({\textbf{{u}}}|Y^{n}=y^{n})-f_{1}({\textbf{{u}}},y^{n}))d{\textbf{{u}}}% \Big{|}+\sum_{i=1}^{k-1}\Big{|}\int_{(B_{1})^{c}}u_{i}f_{1}({\textbf{{u}}},y^{% n})d{\textbf{{u}}}\Big{|}
\displaystyle\overset{(a)}{\leq}\frac{16k^{3}+512}{n^{2/13}}\sum_{i=1}^{k-1}% \int_{B_{1}}|u_{i}|f_{1}({\textbf{{u}}},y^{n})d{\textbf{{u}}}+\sum_{i=1}^{k-1}% \int_{(B_{1})^{c}}|u_{i}|f_{1}({\textbf{{u}}},y^{n})d{\textbf{{u}}}
\displaystyle\leq\frac{16k^{3}+512}{n^{2/13}}\int_{\mathbb{R}^{k-1}}\|{\textbf% {{u}}}\|_{1}f_{1}({\textbf{{u}}},y^{n})d{\textbf{{u}}}+\int_{(B_{1})^{c}}\|{% \textbf{{u}}}\|_{1}f_{1}({\textbf{{u}}},y^{n})d{\textbf{{u}}}
\displaystyle\overset{(b)}{\leq}\frac{16k^{3}+512}{n^{2/13}}\biggl{(}\sqrt{k-1% }\|\Phi^{-1}{\textbf{{w}}}\|_{2}+\sqrt{\frac{2(k-1)}{\pi}\operatorname{tr}(% \Phi^{-1})}\biggr{)}+\int_{(B_{1})^{c}}\|{\textbf{{u}}}\|_{1}f_{1}({\textbf{{u% }}},y^{n})d{\textbf{{u}}}, (72)

where (a) follows from (70); and (b) is justified in Appendix E.

Now let us write out w from its definition in (66):

\displaystyle\|{\textbf{{w}}}\|_{2} \displaystyle\leq\|{\textbf{{w}}}\|_{1}=\sum_{j=1}^{k-1}\Big{|}\sum_{i=1}^{L}% \frac{(q_{ij}-q_{ik})v_{i}}{q_{i}}\Big{|}
\displaystyle\leq ke^{\epsilon}\sum_{j=1}^{k-1}\sum_{i=1}^{L}|v_{i}|
\displaystyle\leq 2k^{2}(k-1)e^{\epsilon}n^{8/13}\text{~{}for all~{}}y^{n}\in E% _{1},

where the second line follows from the fact that the vector (q_{i,1},q_{i,2},\dots,q_{ik}) is proportional to one of the vectors in \{1,e^{\epsilon}\}^{k} and the third one follows directly from the definition of E_{1}.

Next we claim that all the eigenvalues of \Phi^{-1} are no greater than 1/(n\delta_{0}^{2}). Indeed, according to (52) and (56), for all {\textbf{{u}}}\in\mathbb{R}^{k-1}

{\textbf{{u}}}^{T}\Phi{\textbf{{u}}}=n\sum_{i=1}^{L}\frac{1}{q_{i}}\Big{(}\sum% _{j=1}^{k}u_{j}q_{ij}\Big{)}^{2}\geq n\delta^{2}\Big{(}\sum_{i=1}^{k}u_{i}^{2}% \Big{)}\geq n\delta_{0}^{2}\|{\textbf{{u}}}\|_{2}^{2}. (73)

Using this, we have

\displaystyle\|\Phi^{-1}{\textbf{{w}}}\|_{2}\leq\frac{1}{n\delta_{0}^{2}}\|{% \textbf{{w}}}\|_{2}<\frac{2k^{3}e^{\epsilon}}{n^{5/13}\delta_{0}^{2}}\text{~{}% for all~{}}y^{n}\in E_{1},
\displaystyle\operatorname{tr}(\Phi^{-1})\leq\frac{k-1}{n\delta_{0}^{2}}<\frac% {k}{n^{10/13}\delta_{0}^{4}},

where the last inequality holds for all n\geq N(k,\epsilon) as long as we set N(k,\epsilon) to be large enough. Combining the two inequalities above with (72), we deduce that for every y^{n}\in E_{1}\cap E_{2},

\displaystyle\Big{\|}\mathbb{E}({\textbf{{U}}}|Y^{n} \displaystyle=y^{n})-\Phi^{-1}{\textbf{{w}}}\Big{\|}_{1}
\displaystyle\leq\frac{16k^{3}+512}{n^{2/13}}\frac{2k^{4}e^{\epsilon}}{n^{5/13% }\delta_{0}^{2}}+\int_{(B_{1})^{c}}\|{\textbf{{u}}}\|_{1}f_{1}({\textbf{{u}}},% y^{n})d{\textbf{{u}}}
\displaystyle=\frac{32k^{4}e^{\epsilon}(k^{3}+32)}{n^{7/13}\delta_{0}^{2}}+% \int_{(B_{1})^{c}}\|{\textbf{{u}}}\|_{1}f_{1}({\textbf{{u}}},y^{n})d{\textbf{{% u}}}
\displaystyle\leq\frac{32k^{4}e^{\epsilon}(k^{3}+32)}{n^{7/13}\delta_{0}^{2}}+% \sqrt{k-1}\int_{(B_{1})^{c}}\|{\textbf{{u}}}\|_{2}f_{1}({\textbf{{u}}},y^{n})d% {\textbf{{u}}}
\displaystyle\leq\frac{32k^{4}e^{\epsilon}(k^{3}+32)}{n^{7/13}\delta_{0}^{2}}+% \sqrt{k-1}\int_{(B_{1})^{c}}\Big{(}\sum_{i=1}^{k}u_{i}^{2}\Big{)}^{1/2}f_{1}({% \textbf{{u}}},y^{n})d{\textbf{{u}}}
\displaystyle<\frac{32k^{4}e^{\epsilon}(k^{3}+32)}{n^{7/13}\delta_{0}^{2}}+% \frac{\sqrt{k-1}}{n^{7/13}}, (74)

where the last inequality follows by the estimate (99) proved in Appendix F below. This completes the proof of the lemma. ∎

In the next step we bound the conditional expectation \sum_{i=1}^{k}\mathbb{E}(U_{i}-\mathbb{E}(U_{i}|Y^{n}))^{2} when Y^{n}\in E_{1}\cap E_{2}.

Lemma V.11.

There is an integer N(k,\epsilon) such that for all n>N(k,\epsilon) and all y^{n}\in E_{1}\cap E_{2},

\sum_{i=1}^{k}\mathbb{E}[(U_{i}-\mathbb{E}(U_{i}|Y^{n}=y^{n}))^{2}|Y^{n}=y^{n}% ]\geq\Big{(}1-\frac{C_{1}(k,\epsilon)}{n^{2/13}}\Big{)}\Big{(}\operatorname{tr% }(\Phi^{-1})+{\textbf{{1}}}^{T}\Phi^{-1}{\textbf{{1}}}\Big{)}-\frac{C_{3}(k,% \epsilon)}{n^{14/13}}, (75)

where C_{3}(k,\epsilon):=2(C_{2}(k,\epsilon))^{2}+1.

Proof.

As in Appendix F, given y^{n}\in{\mathscr{Y}}^{n}, define \bar{{\textbf{{U}}}}(y^{n})=(\bar{U}_{1}(y^{n}),\dots,\bar{U}_{k-1}(y^{n})) to be a (k-1)-dimensional Gaussian random vector with density function f_{\bar{{\textbf{{U}}}}(y^{n})}(\cdot)=f_{1}(\cdot,y^{n}), mean vector \Phi^{-1}{\textbf{{w}}} and covariance matrix \Phi^{-1}. By Lemma V.10, for every y^{n}\in E_{1}\cap E_{2} we have

\|\mathbb{E}({\textbf{{U}}}|Y^{n}=y^{n})-\mathbb{E}\big{(}\bar{{\textbf{{U}}}}% (y^{n})\big{)}\|_{1}=\|\mathbb{E}({\textbf{{U}}}|Y^{n}=y^{n})-\Phi^{-1}{% \textbf{{w}}}\|_{1}<\frac{C_{2}(k,\epsilon)}{n^{7/13}}.

As usual, let \bar{U}_{k}(y^{n}):=-\sum_{i=1}^{k-1}\bar{U}_{i}(y^{n}). For every y^{n}\in E_{1}\cap E_{2},

\displaystyle\sum_{i=1}^{k} \displaystyle(\mathbb{E}(U_{i}|Y^{n}=y^{n})-\mathbb{E}\bar{U}_{i}(y^{n}))^{2}
\displaystyle=\sum_{i=1}^{k-1}(\mathbb{E}(U_{i}|Y^{n}=y^{n})-\mathbb{E}\bar{U}% _{i}(y^{n}))^{2}+\Big{(}\sum_{i=1}^{k-1}\mathbb{E}(U_{i}|Y^{n}=y^{n})-\sum_{i=% 1}^{k-1}\mathbb{E}\bar{U}_{i}(y^{n})\Big{)}^{2}
\displaystyle\leq 2\|\mathbb{E}({\textbf{{U}}}|Y^{n}=y^{n})-\mathbb{E}\bar{{% \textbf{{U}}}}(y^{n})\|_{1}^{2}<\frac{2(C_{2}(k,\epsilon))^{2}}{n^{14/13}}
\displaystyle=\frac{C_{3}(k,\epsilon)-1}{n^{14/13}}

(on the second line we use the definition of \bar{U}_{k}). As a result, for every y^{n}\in E_{1}\cap E_{2},

\displaystyle\sum_{i=1}^{k} \displaystyle\mathbb{E}[(U_{i}-\mathbb{E}(U_{i}|Y^{n}=y^{n}))^{2}|Y^{n}=y^{n}]
\displaystyle\overset{(a)}{=}\sum_{i=1}^{k}\mathbb{E}[(U_{i}-\mathbb{E}(\bar{U% }_{i}(y^{n})))^{2}|Y^{n}=y^{n}]-\sum_{i=1}^{k}(\mathbb{E}(U_{i}|Y^{n}=y^{n})-% \mathbb{E}(\bar{U}_{i}(y^{n})))^{2}
\displaystyle>\sum_{i=1}^{k}\mathbb{E}[[(U_{i}-\mathbb{E}(\bar{U}_{i}(y^{n}))]% )^{2}|Y^{n}=y^{n}]-\frac{C_{3}(k,\epsilon)-1}{n^{14/13}}
\displaystyle=\sum_{i=1}^{k}\int_{B_{1}}(u_{i}-\mathbb{E}(\bar{U}_{i}(y^{n})))% ^{2}f_{{\textbf{{U}}}|Y^{n}}({\textbf{{u}}}|Y^{n}=y^{n})d{\textbf{{u}}}-\frac{% C_{3}(k,\epsilon)-1}{n^{14/13}}
\displaystyle\overset{(b)}{\geq}\Big{(}1-\frac{C_{1}(k,\epsilon)}{n^{2/13}}% \Big{)}\sum_{i=1}^{k}\int_{B_{1}}(u_{i}-\mathbb{E}(\bar{U}_{i}(y^{n})))^{2}f_{% 1}({\textbf{{u}}},y^{n})d{\textbf{{u}}}-\frac{C_{3}(k,\epsilon)-1}{n^{14/13}}
\displaystyle\geq\Big{(}1-\frac{C_{1}(k,\epsilon)}{n^{2/13}}\Big{)}\sum_{i=1}^% {k}\int_{\mathbb{R}^{k-1}}(u_{i}-\mathbb{E}(\bar{U}_{i}(y^{n})))^{2}f_{1}({% \textbf{{u}}},y^{n})d{\textbf{{u}}}-\frac{C_{3}(k,\epsilon)-1}{n^{14/13}}
\displaystyle-\sum_{i=1}^{k}\int_{(B_{1})^{c}}(u_{i}-\mathbb{E}(\bar{U}_{i}(y^% {n})))^{2}f_{1}({\textbf{{u}}},y^{n})d{\textbf{{u}}}
\displaystyle\overset{(c)}{=}\Big{(}1-\frac{C_{1}(k,\epsilon)}{n^{2/13}}\Big{)% }\sum_{i=1}^{k}\operatorname{Var}(\bar{U}_{i}(y^{n}))-\frac{C_{3}(k,\epsilon)-% 1}{n^{14/13}}
\displaystyle                                            -\int_{(B_{1})^{c}}% \sum_{i=1}^{k}(u_{i}-\mathbb{E}(\bar{U}_{i}(y^{n})))^{2}f_{1}({\textbf{{u}}},y% ^{n})d{\textbf{{u}}}, (76)

where (a) follows by a straightforward rewriting; in (b) we use (70); (c) follows from the fact that the probability density function of \bar{{\textbf{{U}}}}(y^{n}) is f_{1}(\cdot,y^{n}). Further, since \bar{U}_{k}(y^{n})=-\sum_{i=1}^{k-1}\bar{U}_{i}(y^{n}), we have

\displaystyle\sum_{i=1}^{k}\operatorname{Var}(\bar{U}_{i}(y^{n})) \displaystyle=\sum_{i=1}^{k-1}\operatorname{Var}(\bar{U}_{i}(y^{n}))+% \operatorname{Var}\Big{(}\sum_{i=1}^{k-1}\bar{U}_{i}(y^{n})\Big{)}
\displaystyle=\operatorname{tr}(\Phi^{-1})+{\textbf{{1}}}^{T}\Phi^{-1}{\textbf% {{1}}}. (77)

Finally, the integral in (76) is estimated in (104) in Appendix G. Using this estimate together with (77) in (76), we arrive at the claimed inequality (75). ∎

Now let us take the final step toward our goal of proving (57). Combining (75) with (45) and (59), we bound below the optimal Bayes estimation loss as follows:

\displaystyle\sum_{i=1}^{k}\mathbb{E}\Big{(}U_{i} \displaystyle-\mathbb{E}(U_{i}|Y^{n})\Big{)}^{2}
\displaystyle\geq \displaystyle P(Y^{n}\in E_{1}\cap E_{2})\Big{(}\Big{(}1-\frac{C_{1}(k,% \epsilon)}{n^{2/13}}\Big{)}(\operatorname{tr}(\Phi^{-1})+{\textbf{{1}}}^{T}% \Phi^{-1}{\textbf{{1}}})-\frac{C_{3}(k,\epsilon)}{n^{14/13}}\Big{)}
\displaystyle\geq \displaystyle\Big{(}1-\frac{1+2^{k+1}+3k/\delta_{0}}{n^{1/13}}\Big{)}\Big{(}% \Big{(}1-\frac{C_{1}(k,\epsilon)}{n^{2/13}}\Big{)}(\operatorname{tr}(\Phi^{-1}% )+{\textbf{{1}}}^{T}\Phi^{-1}{\textbf{{1}}})-\frac{C_{3}(k,\epsilon)}{n^{14/13% }}\Big{)}
\displaystyle\geq \displaystyle\Big{(}1-\frac{1+2^{k+1}+3k/\delta_{0}}{n^{1/13}}-\frac{C_{1}(k,% \epsilon)}{n^{2/13}}\Big{)}(\operatorname{tr}(\Phi^{-1})+{\textbf{{1}}}^{T}% \Phi^{-1}{\textbf{{1}}})-\frac{C_{3}(k,\epsilon)}{n^{14/13}}
\displaystyle> \displaystyle\Big{(}1-\frac{C_{4}(k,\epsilon)}{n^{1/13}}\Big{)}(\operatorname{% tr}(\Phi^{-1})+{\textbf{{1}}}^{T}\Phi^{-1}{\textbf{{1}}})-\frac{C_{3}(k,% \epsilon)}{n^{14/13}},

where in the last inequality we define C_{4}(k,\epsilon):=1+2^{k+1}+3k/\delta_{0}+C_{1}(k,\epsilon). This establishes (57) and completes the first step of the proof of Proposition V.4.

V-D2 Proof of (58)

We use the Cauchy-Schwarz inequality to bound \operatorname{tr}(\Phi^{-1})+{\textbf{{1}}}^{T}\Phi^{-1}{\textbf{{1}}}. The result is given in the following proposition, which is proved in Appendix H.

Proposition V.12.
\operatorname{tr}(\Phi^{-1})+{\textbf{{1}}}^{T}\Phi^{-1}{\textbf{{1}}}\geq(k-1% )\Big{(}\frac{\sum_{i=1}^{k-1}\Phi_{ii}}{k}-\frac{\sum_{i\neq j}\Phi_{ij}}{k(k% -1)}\Big{)}^{-1}. (78)

Now let us further bound the term on right-hand side of (78). Recalling the definition of \Phi in (55), we write it out explicitly and perform a series of straightforward manipulations:

\displaystyle\frac{\sum_{i=1}^{k-1}\Phi_{ii}}{k} \displaystyle-\frac{\sum_{i\neq j}\Phi_{ij}}{k(k-1)}
\displaystyle=\frac{n}{k(k-1)}\sum_{i=1}^{L}\frac{1}{q_{i}}\Big{(}(k-1)\sum_{j% =1}^{k-1}(q_{ij}-q_{ik})^{2}-\Big{(}\sum_{j=1}^{k-1}\Big{(}\sum_{1\leq j^{% \prime}\leq k-1,j^{\prime}\neq j}(q_{ij^{\prime}}-q_{ik})\Big{)}(q_{ij}-q_{ik}% )\Big{)}\Big{)}
\displaystyle=\frac{n}{k(k-1)}\sum_{i=1}^{L}\frac{1}{q_{i}}\Big{(}k\sum_{j=1}^% {k-1}(q_{ij}-q_{ik})^{2}-\Big{(}\sum_{j=1}^{k-1}\Big{(}\sum_{j^{\prime}=1}^{k-% 1}(q_{ij^{\prime}}-q_{ik})\Big{)}(q_{ij}-q_{ik})\Big{)}\Big{)}
\displaystyle=\frac{n}{k(k-1)}\sum_{i=1}^{L}\frac{1}{q_{i}}\Big{(}k\sum_{j=1}^% {k-1}(q_{ij}-q_{ik})^{2}-\Big{(}\sum_{j=1}^{k-1}(q_{ij}-q_{ik})\Big{)}^{2}\Big% {)}
\displaystyle=\frac{n}{k(k-1)}\sum_{i=1}^{L}\frac{1}{q_{i}}\Big{\{}k\sum_{j=1}% ^{k-1}q_{ij}^{2}-2kq_{ik}\sum_{j=1}^{k-1}q_{ij}+k(k-1)q_{ik}^{2}
\displaystyle                      -\Big{(}\Big{(}\sum_{j=1}^{k-1}q_{ij}\Big{)% }^{2}-2(k-1)q_{ik}\sum_{j=1}^{k-1}q_{ij}+(k-1)^{2}q_{ik}^{2}\Big{)}\Big{\}}
\displaystyle=\frac{n}{k(k-1)}\sum_{i=1}^{L}\frac{1}{q_{i}}\Big{(}k\sum_{j=1}^% {k-1}q_{ij}^{2}-2q_{ik}\sum_{j=1}^{k-1}q_{ij}+(k-1)q_{ik}^{2}-\Big{(}\sum_{j=1% }^{k-1}q_{ij}\Big{)}^{2}\Big{)}
\displaystyle=\frac{n}{k(k-1)}\sum_{i=1}^{L}\frac{1}{q_{i}}\Big{(}k\sum_{j=1}^% {k}q_{ij}^{2}-\Big{(}\sum_{j=1}^{k}q_{ij}\Big{)}^{2}\Big{)}
\displaystyle=\frac{n}{k(k-1)}\sum_{i=1}^{L}q_{i}\Big{(}k\sum_{j=1}^{k}\Big{(}% \frac{q_{ij}}{q_{i}}\Big{)}^{2}-k^{2}\Big{)} (79)
\displaystyle\leq\frac{n}{k(k-1)}k^{2}(e^{\epsilon}-1)^{2}\frac{d^{\ast}(k-d^{% \ast})}{(d^{\ast}e^{\epsilon}+k-d^{\ast})^{2}}\sum_{i=1}^{L}q_{i} (80)
\displaystyle=\frac{n}{k-1}k(e^{\epsilon}-1)^{2}\frac{d^{\ast}(k-d^{\ast})}{(d% ^{\ast}e^{\epsilon}+k-d^{\ast})^{2}} (81)

where (79) follows by definition of q_{i} in Table I, and the inequality used to arrive at (80) is proved in (114), Appendix I.

Substituting (81) into (78), we obtain inequality (58).

Now we are ready to prove (19) using (57) and (58). For this, we set

C(k,\epsilon):=C_{4}(k,\epsilon)M(k,\epsilon)+C_{3}(k,\epsilon).

We obtain

r_{k,n}^{\ell_{2}^{2}}({\textbf{{Q}}})\geq\sum_{i=1}^{k}\mathbb{E}(U_{i}-% \mathbb{E}(U_{i}|Y^{n}))^{2}\geq\frac{1}{n}M(k,\epsilon)-\frac{C(k,\epsilon)}{% n^{14/13}}.

This completes the proof of (19) for the case \delta\geq\delta_{0}.

V-E Case 2: \delta<\delta_{0}

This case is much easier to handle: We can rely on a straightforward application of Le Cam’s method [15] (see also [19, Lemma 1]).

Our goal is again to prove (19). We use standard distance functions on distributions defined on finite sets {\mathscr{Y}}. The KL divergence between two such distributions {\textbf{{m}}}_{1} and {\textbf{{m}}}_{2} is defined as

D_{\operatorname{kl}}({\textbf{{m}}}_{1}\|{\textbf{{m}}}_{2}):=\sum_{y\in{% \mathscr{Y}}}{\textbf{{m}}}_{1}(y)\log\frac{{\textbf{{m}}}_{1}(y)}{{\textbf{{m% }}}_{2}(y)}.

The total variation distance between {\textbf{{m}}}_{1} and {\textbf{{m}}}_{2} is defined as

\|{\textbf{{m}}}_{1}-{\textbf{{m}}}_{2}\|_{\operatorname{TV}}:=\max_{A% \subseteq{\mathscr{Y}}}|{\textbf{{m}}}_{1}(A)-{\textbf{{m}}}_{2}(A)|=\frac{1}{% 2}\sum_{y\in{\mathscr{Y}}}|{\textbf{{m}}}_{1}(y)-{\textbf{{m}}}_{2}(y)|. (82)

According to (51), there is \tilde{{\textbf{{u}}}}\in\mathbb{R}^{k-1} such that333we again use the convention that \tilde{u}_{k}=-\sum_{i=1}^{k-1}\tilde{u}_{i}.

\sum_{i=1}^{k}\tilde{u}_{i}^{2}=1\text{~{}and~{}}\Big{(}\sum_{i=1}^{L}\frac{1}% {q_{i}}\Big{(}\sum_{j=1}^{k}\tilde{u}_{j}q_{ij}\Big{)}^{2}\Big{)}^{1/2}=\delta. (83)

With this \tilde{{\textbf{{u}}}} in mind, let

{\textbf{{p}}}_{1}={\textbf{{p}}}_{U}\text{~{}and~{}}{\textbf{{p}}}_{2}={% \textbf{{p}}}_{U}+\tilde{{\textbf{{u}}}}/\sqrt{n\delta^{2}}

where as before, {\textbf{{p}}}_{U}=(1/k,1/k,\dots,1/k) denotes the uniform pmf.

Note that some of the coordinates of \tilde{{\textbf{{u}}}} can be negative, but we can ensure that {\textbf{{p}}}_{2}\in\Delta_{k} for all n\geq N(k,\epsilon) as long as N(k,\epsilon) is sufficiently large. Consequently,

r_{k,n}^{\ell_{2}^{2}}({\textbf{{Q}}})=\inf_{\hat{{\textbf{{p}}}}}\sup_{{% \textbf{{p}}}\in\Delta_{k}}\underset{Y^{n}\sim({\textbf{{p}}}{\textbf{{Q}}})^{% n}}{\mathbb{E}}\ell_{2}^{2}(\hat{{\textbf{{p}}}}(Y^{n}),{\textbf{{p}}})\geq% \frac{1}{2}\inf_{\hat{{\textbf{{p}}}}}\Big{(}\underset{Y^{n}\sim{\textbf{{m}}}% _{1}^{n}}{\mathbb{E}}\ell_{2}^{2}(\hat{{\textbf{{p}}}}(Y^{n}),{\textbf{{p}}}_{% 1})+\underset{Y^{n}\sim{\textbf{{m}}}_{2}^{n}}{\mathbb{E}}\ell_{2}^{2}(\hat{{% \textbf{{p}}}}(Y^{n}),{\textbf{{p}}}_{2})\Big{)},

where {\textbf{{m}}}_{1}={\textbf{{p}}}_{1}{\textbf{{Q}}} and {\textbf{{m}}}_{2}={\textbf{{p}}}_{2}{\textbf{{Q}}}.

For a given estimator \hat{{\textbf{{p}}}}:{\mathscr{Y}}^{n}\to\mathbb{R}^{k}, define the set

{\mathscr{K}}_{1}:=\{y^{n}\in{\mathscr{Y}}^{n}:\ell_{2}(\hat{{\textbf{{p}}}}(y% ^{n}),{\textbf{{p}}}_{1})\geq\ell_{2}(\hat{{\textbf{{p}}}}(y^{n}),{\textbf{{p}% }}_{2})\}

By the triangle inequality, we have

\displaystyle\ell_{2}(\hat{{\textbf{{p}}}}(y^{n}),{\textbf{{p}}}_{1})\geq\frac% {1}{2}\ell_{2}({\textbf{{p}}}_{1},{\textbf{{p}}}_{2})=\frac{1}{2\sqrt{n\delta^% {2}}}\text{~{}for all~{}}y^{n}\in{\mathscr{K}}_{1},
\displaystyle\ell_{2}(\hat{{\textbf{{p}}}}(y^{n}),{\textbf{{p}}}_{2})\geq\frac% {1}{2}\ell_{2}({\textbf{{p}}}_{1},{\textbf{{p}}}_{2})=\frac{1}{2\sqrt{n\delta^% {2}}}\text{~{}for all~{}}y^{n}\in{\mathscr{K}}_{1}^{c}.

Therefore, for any estimator \hat{{\textbf{{p}}}},

\displaystyle E \displaystyle:=\underset{Y^{n}\sim{\textbf{{m}}}_{1}^{n}}{\mathbb{E}}\ell_{2}^% {2}(\hat{{\textbf{{p}}}}(Y^{n}),{\textbf{{p}}}_{1})+\underset{Y^{n}\sim{% \textbf{{m}}}_{2}^{n}}{\mathbb{E}}\ell_{2}^{2}(\hat{{\textbf{{p}}}}(Y^{n}),{% \textbf{{p}}}_{2})
\displaystyle\geq\frac{1}{4n\delta^{2}}\Big{(}{\textbf{{m}}}_{1}^{n}(Y^{n}\in{% \mathscr{K}}_{1})+{\textbf{{m}}}_{2}^{n}(Y^{n}\in{\mathscr{K}}_{1}^{c})\Big{)}
\displaystyle=\frac{1}{4n\delta^{2}}\Big{(}1-{\textbf{{m}}}_{2}^{n}(Y^{n}\in{% \mathscr{K}}_{1})+{\textbf{{m}}}_{1}^{n}(Y^{n}\in{\mathscr{K}}_{1})\Big{)}
\displaystyle\geq\frac{1}{4n\delta^{2}}\Big{(}1-\sup_{{\mathscr{K}}\subseteq{% \mathscr{Y}}^{n}}\Big{(}{\textbf{{m}}}_{2}^{n}(Y^{n}\in{\mathscr{K}})-{\textbf% {{m}}}_{1}^{n}(Y^{n}\in{\mathscr{K}})\Big{)}\Big{)}
\displaystyle=\frac{1}{4n\delta^{2}}\Big{(}1-\|{\textbf{{m}}}_{2}^{n}-{\textbf% {{m}}}_{1}^{n}\|_{\operatorname{TV}}\Big{)},

where the last step follows by definition (82). Using Pinsker’s inequality444Pinsker’s inequality asserts that \|P_{1}-P_{2}\|_{\operatorname{TV}}^{2}\leq\frac{1}{2}D_{\operatorname{kl}}(P_% {1}\|P_{2}) for any two probability measures P_{1},P_{2}., we obtain

\displaystyle E \displaystyle\geq\frac{1}{4n\delta^{2}}\Big{(}1-\sqrt{\frac{1}{2}D_{% \operatorname{kl}}({\textbf{{m}}}_{2}^{n}\|{\textbf{{m}}}_{1}^{n})}\Big{)}=% \frac{1}{4n\delta^{2}}\Big{(}1-\sqrt{\frac{n}{2}D_{\operatorname{kl}}({\textbf% {{m}}}_{2}\|{\textbf{{m}}}_{1})}\Big{)}.

Let us write out this expression explicitly, recalling the fact that the original output alphabet is {\mathscr{Y}}=\{1,2,\dots,L^{\prime}\} and using in succession (33), (35), (34):

\displaystyle E \displaystyle\geq\frac{1}{4n\delta^{2}}\Biggl{(}1-\sqrt{\frac{n}{2}\sum_{i=1}^% {L^{\prime}}\Biggl{\{}\Big{(}\sum_{j=1}^{k}{\textbf{{Q}}}(i|j)\Big{(}\frac{1}{% k}+\frac{\tilde{u}_{j}}{\sqrt{n\delta^{2}}}\Big{)}\Big{)}\log\frac{\sum_{j=1}^% {k}{\textbf{{Q}}}(i|j)\Big{(}\frac{1}{k}+\frac{\tilde{u}_{j}}{\sqrt{n\delta^{2% }}}\Big{)}}{\sum_{j=1}^{k}{\textbf{{Q}}}(i|j)\frac{1}{k}}\Biggr{\}}}\Biggr{)}
\displaystyle=\frac{1}{4n\delta^{2}}\Biggl{(}1-\sqrt{\frac{n}{2}\sum_{i=1}^{L^% {\prime}}\Biggl{\{}\Big{(}q_{i}^{\prime}+\sum_{j=1}^{k}{\textbf{{Q}}}(i|j)% \frac{\tilde{u}_{j}}{\sqrt{n\delta^{2}}}\Big{)}\log\frac{q_{i}^{\prime}+\sum_{% j=1}^{k}{\textbf{{Q}}}(i|j)\frac{\tilde{u}_{j}}{\sqrt{n\delta^{2}}}}{q_{i}^{% \prime}}\Biggr{\}}}\Biggr{)}
\displaystyle=\frac{1}{4n\delta^{2}}\Biggl{(}1-\sqrt{\frac{n}{2}\sum_{i=1}^{L^% {\prime}}\Biggl{\{}q_{i}^{\prime}\Big{(}1+\sum_{j=1}^{k}\frac{{\textbf{{Q}}}(i% |j)}{q_{i}^{\prime}}\frac{\tilde{u}_{j}}{\sqrt{n\delta^{2}}}\Big{)}\log\Big{(}% 1+\sum_{j=1}^{k}\frac{{\textbf{{Q}}}(i|j)}{q_{i}^{\prime}}\frac{\tilde{u}_{j}}% {\sqrt{n\delta^{2}}}\Big{)}\Biggr{\}}}\Biggr{)}
\displaystyle=\frac{1}{4n\delta^{2}}\Biggl{(}1-\sqrt{\frac{n}{2}\sum_{i=1}^{L}% \Biggl{\{}\Big{(}\sum_{a\in A_{i}}q_{a}^{\prime}\Big{)}\Big{(}1+\sum_{j=1}^{k}% \frac{q_{ij}}{q_{i}}\frac{\tilde{u}_{j}}{\sqrt{n\delta^{2}}}\Big{)}\log\Big{(}% 1+\sum_{j=1}^{k}\frac{q_{ij}}{q_{i}}\frac{\tilde{u}_{j}}{\sqrt{n\delta^{2}}}% \Big{)}\Biggr{\}}}\Biggr{)}
\displaystyle=\frac{1}{4n\delta^{2}}\Biggl{(}1-\sqrt{\frac{n}{2}\sum_{i=1}^{L}% \Biggl{\{}\Big{(}q_{i}+\sum_{j=1}^{k}q_{ij}\frac{\tilde{u}_{j}}{\sqrt{n\delta^% {2}}}\Big{)}\log\Big{(}1+\sum_{j=1}^{k}\frac{q_{ij}}{q_{i}}\frac{\tilde{u}_{j}% }{\sqrt{n\delta^{2}}}\Big{)}\Biggr{\}}}\Biggr{)}.

Bounding above the logarithm by \log(1+1)\leq x, we further obtain

\displaystyle E \displaystyle\geq\frac{1}{4n\delta^{2}}\Biggl{(}1-\sqrt{\frac{n}{2}\sum_{i=1}^% {L}\Biggl{\{}\Big{(}q_{i}+\sum_{j=1}^{k}q_{ij}\frac{\tilde{u}_{j}}{\sqrt{n% \delta^{2}}}\Big{)}\Big{(}\sum_{j=1}^{k}\frac{q_{ij}}{q_{i}}\frac{\tilde{u}_{j% }}{\sqrt{n\delta^{2}}}\Big{)}\Biggr{\}}}\Biggr{)}
\displaystyle=\frac{1}{4n\delta^{2}}\Biggl{(}1-\sqrt{\frac{n}{2}\Big{(}\sum_{i% =1}^{L}\sum_{j=1}^{k}q_{ij}\frac{\tilde{u}_{j}}{\sqrt{n\delta^{2}}}+\sum_{i=1}% ^{L}\frac{1}{q_{i}}\Big{(}\sum_{j=1}^{k}q_{ij}\frac{\tilde{u}_{j}}{\sqrt{n% \delta^{2}}}\Big{)}^{2}\Big{)}}\Biggr{)}
\displaystyle=\frac{1}{4n\delta^{2}}\Biggl{(}1-\sqrt{\frac{1}{2\delta^{2}}\sum% _{i=1}^{L}\frac{1}{q_{i}}\Big{(}\sum_{j=1}^{k}q_{ij}\tilde{u}_{j}\Big{)}^{2}}% \Biggr{)} (84)
\displaystyle=\frac{1}{4n\delta^{2}}\Big{(}1-\sqrt{\frac{1}{2}}\Big{)}>\frac{1% }{16n\delta^{2}}>\frac{1}{16n\delta_{0}^{2}} (85)
\displaystyle=\frac{2}{n}M(k,\epsilon) (86)
\displaystyle{\geq\frac{1}{n}M(k,\epsilon)-\frac{C(k,\epsilon)}{n^{14/13}}},

where (84) follows from (17), namely from the equality \sum_{i=1}^{L}\sum_{j=1}^{k}q_{ij}\tilde{u}_{j}=0; (85) follows from the definition of \tilde{{\textbf{{u}}}} in (83); (86) follows from (53). (In the last step we make a somewhat arbitrary transition to match the inequality (57) established for the case \delta\geq\delta_{0}.)

This completes the proof of (19) for the case \delta<\delta_{0}.

VI Outlook: Open questions

VI-A Uniqueness of the optimal privatization scheme

We believe that our privatization scheme {\textbf{{Q}}}_{k,\epsilon,d^{\ast}} is essentially the unique optimal choice. More precisely, recall that we reduced the output alphabet from the original set to the set of equivalence classes; see Sec. V-A. We conjecture that, under the assumption that {\textbf{{Q}}}\in{\mathscr{D}}_{\epsilon,E} and after the alphabet reduction, any scheme Q that is different from {\textbf{{Q}}}_{k,\epsilon,d^{\ast}}, will entail a strictly larger estimation loss value. In other words, for n large enough

r_{k,n}^{\ell_{2}^{2}}({\textbf{{Q}}})\geq r_{k,n}^{\ell_{2}^{2}}({\textbf{{Q}% }}_{k,\epsilon,d^{\ast}},\hat{{\textbf{{p}}}}),

where \hat{{\textbf{{p}}}} is given in (8). Formally, we can phrase this conjecture as follows:

Conjecture 1.

For {\textbf{{Q}}}\in{\mathscr{D}}_{\epsilon,E},

\lim_{n\to\infty}nr_{k,n}^{\ell_{2}^{2}}({\textbf{{Q}}})=M(k,\epsilon) (87)

if and only if {\textbf{{Q}}}={\textbf{{Q}}}_{k,\epsilon,d^{\ast}} (after accounting for the alphabet reduction).

Let us list some necessary conditions for this to hold.

(i) From the proof in Section V we know that

\liminf_{n\to\infty}nr_{k,n}^{\ell_{2}^{2}}({\textbf{{Q}}})\geq n(% \operatorname{tr}(\Phi^{-1})+{\textbf{{1}}}^{T}\Phi^{-1}{\textbf{{1}}}). (88)

Therefore, a necessary condition for (87) to hold is

\operatorname{tr}(\Phi^{-1})+{\textbf{{1}}}^{T}\Phi^{-1}{\textbf{{1}}}=\frac{1% }{n}M(k,\epsilon). (89)

According to (80), (114), equality in (89) implies that any (asymptotically) optimal privatization scheme satisfies the condition that each vector (Q(i|j),j=1,\dots,k) is proportional to one of the vectors in the set \{(v_{1},v_{2},\dots,v_{k})\in\{1,e^{\epsilon}\}^{k}:v_{1}+\dots+v_{k}=d^{\ast% }e^{\epsilon}+k-d^{\ast}\}, i.e., after normalizing, it contains exactly d^{\ast} entries of e^{\epsilon} and k-d^{\ast} entries 1.

(ii) The converse part of Prop. H.1 (specifically, the claim in (109)) gives another set of necessary conditions for (89) to hold.

(iii) Note that (88) is obtained by choosing p from a neighborhood of the uniform distribution {\textbf{{p}}}_{U}. A similar bound can be obtained by choosing p to be in the neighborhood of any point in the probability simplex \Delta_{k}. Formally speaking, the observable random variables in our problem are Y^{n}, and the unknown parameters are (p_{1},p_{2},\dots,p_{k-1}). Denoting by I(p_{1},p_{2},\dots,p_{k-1}) the Fisher information matrix. As shown in Appendix J,

\Phi(n,{\textbf{{Q}}})=I(1/k,1/k,\dots,1/k).

Similarly to (88), one can show that

\liminf_{n\to\infty}nr_{k,n}^{\ell_{2}^{2}}({\textbf{{Q}}})\geq n\operatorname% {tr}((I(p_{1},p_{2},\dots,p_{k-1}))^{-1}+{\textbf{{1}}}^{T}(I(p_{1},p_{2},% \dots,p_{k-1}))^{-1}{\textbf{{1}}})

for all (p_{1},p_{2},\dots,p_{k-1},1-p_{1}-\dots-p_{k-1})\in\Delta_{k}. Therefore another set of necessary conditions for (87) to hold is

\operatorname{tr}((I(p_{1},p_{2},\dots,p_{k-1}))^{-1})+{\textbf{{1}}}^{T}(I(p_% {1},p_{2},\dots,p_{k-1}))^{-1}{\textbf{{1}}}\leq\frac{1}{n}M(k,\epsilon)

for all (p_{1},p_{2},\dots,p_{k-1},1-p_{1}-\dots-p_{k-1})\in\Delta_{k}.

We believe that the conditions listed above imply the uniqueness claim of the privatization mechanism {\textbf{{Q}}}_{k,\epsilon,d^{\ast}}.

We conclude by suggesting an even stronger conjecture which we also believe to be true.

Conjecture 2.

For all Q with finite output alphabet,

\lim_{n\to\infty}nr_{k,n}^{\ell_{2}^{2}}({\textbf{{Q}}})=M(k,\epsilon)

if and only if {\textbf{{Q}}}={\textbf{{Q}}}_{k,\epsilon,d^{\ast}} (after accounting for the alphabet reduction).

VI-B Asymptotically tight lower bound for the \ell_{1} loss

Another open question is to find an asymptotically optimal privatization mechanism/estimation procedure for the \ell_{1} loss. We similarly believe that our privatization scheme {\textbf{{Q}}}_{k,\epsilon,d^{\ast}} and the empirical estimator \hat{{\textbf{{p}}}} given by (8) are asymptotically optimal in this case as well. More precisely, in [13], we have shown that

r_{k,n}^{\ell_{1}}({\textbf{{Q}}}_{k,\epsilon,d},\hat{{\textbf{{p}}}})=\frac{k% -1}{e^{\epsilon}-1}\sqrt{\frac{2}{\pi n}}\sqrt{\frac{(de^{\epsilon}+k-d)^{2}}{% d(k-d)}}+o\Big{(}\frac{1}{\sqrt{n}}\Big{)}.

It is clear that

r_{k,n}^{\ell_{1}}({\textbf{{Q}}}_{k,\epsilon,d^{\ast}},\hat{{\textbf{{p}}}})=% \min_{1\leq d\leq k-1}r_{k,n}^{\ell_{1}}({\textbf{{Q}}}_{k,\epsilon,d},\hat{{% \textbf{{p}}}}).

Our conjecture is

\lim_{n\to\infty}\sqrt{n}r_{\epsilon,k,n}^{\ell_{1}}=\frac{k-1}{e^{\epsilon}-1% }\sqrt{\frac{2}{\pi}}\sqrt{\frac{(d^{\ast}e^{\epsilon}+k-d^{\ast})^{2}}{d^{% \ast}(k-d^{\ast})}}. (90)

Within the frame of our approach, the obstacle in the way of proving this conjecture can be described as follows. Given a positive definite matrix M, let S(M):=\sqrt{\sum_{i}M_{ii}}. Similarly to (88), one can show that

\liminf_{n\to\infty}\sqrt{n}r_{k,n}^{\ell_{1}}({\textbf{{Q}}})\geq\sqrt{n}(S(% \Phi^{-1})+({\textbf{{1}}}^{T}\Phi^{-1}{\textbf{{1}}})^{1/2}).

To prove (90), we need a lower bound on the quantity S(\Phi^{-1})+({\textbf{{1}}}^{T}\Phi^{-1}{\textbf{{1}}})^{1/2}, which is similar to (78). So far we have not been able to find a useful bound of this kind.

acknowledgement

We are grateful to our colleague Itzhak Tamo for his help with the proof of Prop. H.1.

Appendices

Appendix A A calculus inequality

Proposition A.1.

For all x\geq-\frac{2}{3}, we have the following inequality:

\Big{|}\log(1+x)-(x-\frac{x^{2}}{2})\Big{|}\leq\Big{|}x^{3}\Big{|}
Proof.

Let

h_{1}(x)=\log(1+x)-(x-\frac{x^{2}}{2}),\text{~{}and~{}}h_{2}(x)=\Big{|}\log(1+% x)-(x-\frac{x^{2}}{2})\Big{|}-\Big{|}x^{3}\Big{|}.

Then

h^{\prime}_{1}(x)=\frac{x^{2}}{x+1}\geq 0\text{~{}for all~{}}x>-1.

Since h_{1}(0)=0, we have h_{1}(x)\geq 0 for all x>0, and h_{1}(x)\leq 0 for all -1<x<0. Consequently,

h_{2}(x)=h_{1}(x)-x^{3}\text{~{}for~{}}x>0,\text{~{}and~{}}h_{2}(x)=x^{3}-h_{1% }(x)\text{~{}for~{}}-1<x<0.

As a result,

h^{\prime}_{2}(x)=\frac{x^{2}}{x+1}-3x^{2}<0\text{~{}for~{}}x>0,\text{~{}and~{% }}h^{\prime}_{2}(x)=3x^{2}-\frac{x^{2}}{x+1}\geq 0\text{~{}for~{}}-\frac{2}{3}% \leq x<0.

Since h_{2}(0)=0, we conclude that h_{2}(x)\leq 0 for all x\geq-\frac{2}{3}.

Appendix B Proof of Proposition V.5

We need the following simple proposition about the volume of ellipsoids.

Proposition B.1.

Let \Lambda be an s\times s positive definite matrix. Let \{{\textbf{{u}}}\in\mathbb{R}^{s}:{\textbf{{u}}}^{T}\Lambda{\textbf{{u}}}\leq% \alpha_{1}^{2}\} and \{{\textbf{{u}}}\in\mathbb{R}^{s}:{\textbf{{u}}}^{T}\Lambda{\textbf{{u}}}\leq% \alpha_{2}^{2}\} be two ellipsoids. Then the ratio of their volumes equals (\alpha_{1}/\alpha_{2})^{s}.

Proof.

Consider the ellipsoid B_{\Lambda}(\alpha):=\{{\textbf{{u}}}\in\mathbb{R}^{s}:{\textbf{{u}}}^{T}% \Lambda{\textbf{{u}}}\leq\alpha^{2}\}. Let \lambda_{1},\dots,\lambda_{s} be the eigenvalues of \Lambda, taken with multiplicities. Since \Lambda is positive definite, we can diagonalize it as \Lambda=P^{T}DP, where P is an orthogonal matrix and D=\text{diag}(\lambda_{1},\dots,\lambda_{s}). Then

\displaystyle\text{Vol}(B_{\Lambda}(\alpha)) \displaystyle=\int_{{\textbf{{u}}}^{T}\Lambda{\textbf{{u}}}\leq\alpha^{2}}d{% \textbf{{u}}}=\int_{{\textbf{{v}}}^{T}{\textbf{{v}}}\leq 1}|\alpha P^{T}D^{-1/% 2}|d{\textbf{{v}}}=\alpha^{s}|D^{-1/2}|\frac{\pi^{s/2}}{\Gamma(\frac{s}{2}+1)}
\displaystyle=\alpha^{s}\frac{\pi^{s/2}}{\Gamma(\frac{s}{2}+1)}\prod_{i=1}^{s}% \lambda_{i}^{-1/2}, (91)

where the second equality follows by a change of variable {\textbf{{v}}}=\frac{1}{\alpha}D^{1/2}P{\textbf{{u}}}, and the third one uses the expression for the volume of the unit sphere in \mathbb{R}^{s}. The proposition follows immediately from (91). ∎

Now we are ready to prove Prop. V.5.

Proof of Prop. V.5. As mentioned before, conditional on {\textbf{{U}}}={\textbf{{u}}}, the random variable t_{i}(Y^{n}) has binomial distribution B(n,q_{i}+\sum_{j=1}^{k}u_{j}q_{ij}). With this in mind, using (17) several times, we have for all {\textbf{{u}}}\in B_{1} the following upper bound:

\displaystyle\mathbb{E} \displaystyle\Big{[}\sum_{i=1}^{L}\frac{n}{q_{i}}\Big{(}\sum_{j=1}^{k}u_{j}q_{% ij}-\frac{V_{i}}{n}\Big{)}^{2}\Big{|}{\textbf{{U}}}={\textbf{{u}}}\Big{]}
\displaystyle=\mathbb{E}\Big{[}\sum_{i=1}^{L}\frac{n}{q_{i}}\Big{(}\frac{t_{i}% (Y^{n})}{n}-\Big{(}q_{i}+\sum_{j=1}^{k}u_{j}q_{ij}\Big{)}\Big{)}^{2}\Big{|}{% \textbf{{U}}}={\textbf{{u}}}\Big{]}
\displaystyle=\sum_{i=1}^{L}\frac{1}{q_{i}}\Big{(}q_{i}+\sum_{j=1}^{k}u_{j}q_{% ij}\Big{)}\Big{(}1-q_{i}-\sum_{j=1}^{k}u_{j}q_{ij}\Big{)}
\displaystyle\leq\sum_{i=1}^{L}\frac{1}{q_{i}}q_{i}\Big{(}1-q_{i}-\sum_{j=1}^{% k}u_{j}q_{ij}\Big{)}+\sum_{i=1}^{L}\frac{1}{q_{i}}\Big{|}\sum_{j=1}^{k}u_{j}q_% {ij}\Big{|}\Big{(}1-q_{i}-\sum_{j=1}^{k}u_{j}q_{ij}\Big{)}
\displaystyle=L-1+\sum_{i=1}^{L}\Big{|}\sum_{j=1}^{k}\frac{u_{j}q_{ij}}{q_{i}}% \Big{|}\Big{(}1-q_{i}-\sum_{j=1}^{k}u_{j}q_{ij}\Big{)}
\displaystyle\overset{(a)}{\leq}L-1+\frac{k}{n^{5/13}}\sum_{i=1}^{L}\Big{(}1-q% _{i}-\sum_{j=1}^{k}u_{j}q_{ij}\Big{)}
\displaystyle=(L-1)\Big{(}1+\frac{k}{n^{5/13}}\Big{)}<2L\leq 2^{k+1}, (92)

where (a) follows from (41).

For all {\textbf{{u}}}\in B_{2}, we have

\displaystyle P(Y^{n}\in E_{2}|{\textbf{{U}}}={\textbf{{u}}}) \displaystyle\geq P\Big{(}\sum_{i=1}^{L}\frac{n}{q_{i}}\Big{(}\sum_{j=1}^{k}u_% {j}q_{ij}-\frac{V_{i}}{n}\Big{)}^{2}<n^{1/13}\Big{|}{\textbf{{U}}}={\textbf{{u% }}}\Big{)}
\displaystyle=1-P\Big{(}\sum_{i=1}^{L}\frac{n}{q_{i}}\Big{(}\sum_{j=1}^{k}u_{j% }q_{ij}-\frac{V_{i}}{n}\Big{)}^{2}\geq n^{1/13}\Big{|}{\textbf{{U}}}={\textbf{% {u}}}\Big{)}
\displaystyle\geq 1-\frac{2^{k+1}}{n^{1/13}}, (93)

where the last step follows by the Markov inequality and (92). By our assumption, P_{U}(B_{1})=1, so from (20),(26), and Prop. B.1 we obtain

P({\textbf{{U}}}\in B_{2})=\frac{\text{Vol}(B_{2})}{\text{Vol}(B_{1})}=\Big{(}% 1-\frac{3/\delta_{0}}{n^{1/13}}\Big{)}^{k-1}\geq 1-\frac{3(k-1)/\delta_{0}}{n^% {1/13}}, (94)

where the inequality follows from the fact that (1-x)^{k}\geq 1-kx for all 0\leq x<1. Therefore the inequality holds for all n\geq N(k,\epsilon) as long as we set N(k,\epsilon)>(3/\delta_{0})^{13}. Now using (93) and (94) we obtain

\displaystyle P(Y^{n}\in E_{2}) \displaystyle=\int_{B_{1}}P(Y^{n}\in E_{2}|{\textbf{{U}}}={\textbf{{u}}})f_{{% \textbf{{U}}}}({\textbf{{u}}})d{\textbf{{u}}}\geq\int_{B_{2}}P(Y^{n}\in E_{2}|% {\textbf{{U}}}={\textbf{{u}}})f_{{\textbf{{U}}}}({\textbf{{u}}})d{\textbf{{u}}}
\displaystyle\geq\Big{(}1-\frac{2^{k+1}}{n^{1/13}}\Big{)}\int_{B_{2}}f_{{% \textbf{{U}}}}({\textbf{{u}}})d{\textbf{{u}}}=\Big{(}1-\frac{2^{k+1}}{n^{1/13}% }\Big{)}P({\textbf{{U}}}\in B_{2})
\displaystyle\geq 1-\frac{2^{k+1}+3k/\delta_{0}}{n^{1/13}}.

This completes the proof of Prop. V.5.

Appendix C Proof of Proposition V.7

For every y^{n}\in E_{2}, there exists \tilde{{\textbf{{u}}}}\in B_{2} such that

\sqrt{h_{{\textbf{{v}}}}(\tilde{{\textbf{{u}}}})}=\Big{(}\sum_{i=1}^{L}\frac{n% }{q_{i}}\Big{(}\sum_{j=1}^{k}\tilde{u}_{j}q_{ij}-\frac{v_{i}}{n}\Big{)}^{2}% \Big{)}^{1/2}<n^{1/26}. (95)

By definition of B_{2}, we have \Big{(}\sum_{i=1}^{k}\tilde{u}_{i}^{2}\Big{)}^{1/2}<\frac{1}{n^{5/13}}-\frac{3% /\delta_{0}}{n^{6/13}}. According to triangle inequality,

\Big{(}\sum_{i=1}^{k}(u_{i}-\tilde{u}_{i})^{2}\Big{)}^{1/2}\geq\Big{(}\sum_{i=% 1}^{k}u_{i}^{2}\Big{)}^{1/2}-\Big{(}\sum_{i=1}^{k}\tilde{u}_{i}^{2}\Big{)}^{1/% 2}>\alpha-\frac{1}{n^{5/13}}+\frac{3/\delta_{0}}{n^{6/13}}\text{~{}for all~{}}% {\textbf{{u}}}\notin B(\alpha).

By (52),

\Big{(}\sum_{i=1}^{L}\frac{1}{q_{i}}\Big{(}\sum_{j=1}^{k}(u_{j}-\tilde{u}_{j})% q_{ij}\Big{)}^{2}\Big{)}^{1/2}\geq\delta_{0}\Big{(}\sum_{i=1}^{k}(u_{i}-\tilde% {u}_{i})^{2}\Big{)}^{1/2}>\delta_{0}(\alpha-\frac{1}{n^{5/13}})+\frac{3}{n^{6/% 13}}\text{~{}for all~{}}{\textbf{{u}}}\notin B(\alpha).

As a result,

\Big{(}\sum_{i=1}^{L}\frac{n}{q_{i}}\Big{(}\sum_{j=1}^{k}(u_{j}-\tilde{u}_{j})% q_{ij}\Big{)}^{2}\Big{)}^{1/2}>\delta_{0}n^{1/2}(\alpha-n^{-5/13})+3n^{1/26}% \text{~{}for all~{}}{\textbf{{u}}}\notin B(\alpha).

Again by triangle inequality,

\displaystyle\sqrt{h_{{\textbf{{v}}}}({\textbf{{u}}})}= \displaystyle\Big{(}\sum_{i=1}^{L}\frac{n}{q_{i}}\Big{(}\sum_{j=1}^{k}u_{j}q_{% ij}-\frac{v_{i}}{n}\Big{)}^{2}\Big{)}^{1/2} (96)
\displaystyle\geq \displaystyle\Big{(}\sum_{i=1}^{L}\frac{n}{q_{i}}\Big{(}\sum_{j=1}^{k}(u_{j}-% \tilde{u}_{j})q_{ij}\Big{)}^{2}\Big{)}^{1/2}-\Big{(}\sum_{i=1}^{L}\frac{n}{q_{% i}}\Big{(}\sum_{j=1}^{k}\tilde{u}_{j}q_{ij}-\frac{v_{i}}{n}\Big{)}^{2}\Big{)}^% {1/2}
\displaystyle> \displaystyle\delta_{0}n^{1/2}(\alpha-n^{-5/13})+2n^{1/26}\text{~{}for all~{}}% {\textbf{{u}}}\notin B(\alpha).

Combining (95) and (96), we have

\sqrt{h_{{\textbf{{v}}}}({\textbf{{u}}})}-\sqrt{h_{{\textbf{{v}}}}(\tilde{{% \textbf{{u}}}})}>\delta_{0}n^{1/2}(\alpha-n^{-5/13})+n^{1/26}\text{~{}for all~% {}}{\textbf{{u}}}\notin B(\alpha).

Thus we conclude that for every y^{n}\in E_{2}, there exists \tilde{{\textbf{{u}}}}\in B_{2} such that

(B(\alpha))^{c}\subseteq E_{{\textbf{{v}}}}(\tilde{{\textbf{{u}}}},\delta_{0}n% ^{1/2}(\alpha-n^{-5/13})+n^{1/26}).

This completes the proof of Prop. V.7.

Appendix D Proof of Proposition V.8

We will rely on a concentration result for random Gaussian vectors.

Proposition D.1.

Let {\textbf{{Z}}}\sim{\mathcal{N}}({\textbf{{0}}},I_{s}) be a standard Gaussian random vector. We have

P(\|{\textbf{{Z}}}\|_{2}\geq\alpha)\leq e^{-\frac{(\alpha-\sqrt{s})^{2}}{2}}% \text{~{}for all~{}}\alpha\geq\sqrt{s}. (97)

This is a very special case of the following general concentration inequality due to Sudakov and Tsirel’son [20] and Borell [21] (see also Pisier and Maurey [22, page 176]). Recall that a function f:\mathbb{R}^{s}\to\mathbb{R} is called \rho-Lipschitz if

|f({\textbf{{z}}})-f({\textbf{{z}}}^{\prime})|\leq\rho\|{\textbf{{z}}}-{% \textbf{{z}}}^{\prime}\|_{2}\text{~{}for all~{}}{\textbf{{z}}},{\textbf{{z}}}^% {\prime}\in\mathbb{R}^{s}.
Theorem D.2 ([20, 21]).

For a \rho-Lipschitz function f:\mathbb{R}^{s}\to\mathbb{R} and a standard Gaussian random vector {\textbf{{Z}}}\sim{\mathcal{N}}({\textbf{{0}}},I_{s}), the random variable f({\textbf{{Z}}}) satisfies the following concentration inequality:

P(f({\textbf{{Z}}})-\mathbb{E}f({\textbf{{Z}}})\geq\alpha)\leq e^{-\frac{% \alpha^{2}}{2\rho^{2}}}\text{~{}for all~{}}\alpha\geq 0. (98)

Proof of Prop. D.1: Take f above to be f({\textbf{{z}}})=\|{\textbf{{z}}}\|_{2} for all {\textbf{{z}}}\in\mathbb{R}^{s}, and note that it is 1-Lipschitz by the triangle inequality, so (98) holds true. We have

P(\|{\textbf{{Z}}}\|_{2}-\mathbb{E}\|{\textbf{{Z}}}\|_{2}\geq\alpha)\leq e^{-% \frac{\alpha^{2}}{2}}\text{~{}for all~{}}\alpha\geq 0.

By Jensen’s inequality,

\mathbb{E}\|{\textbf{{Z}}}\|_{2}=\mathbb{E}\sqrt{Z_{1}^{2}+\dots+Z_{s}^{2}}% \leq\sqrt{\mathbb{E}(Z_{1}^{2}+\dots+Z_{s}^{2})}=\sqrt{s}.

Therefore,

P(\|{\textbf{{Z}}}\|_{2}-\sqrt{s}\geq\alpha)\leq P(\|{\textbf{{Z}}}\|_{2}-% \mathbb{E}\|{\textbf{{Z}}}\|_{2}\geq\alpha)\leq e^{-\frac{\alpha^{2}}{2}}\text% {~{}for all~{}}\alpha\geq 0.

Proof of Prop. V.8. Observe that \sqrt{h(\tilde{{\textbf{{u}}}})}\geq\sqrt{h({\textbf{{t}}})}=\sqrt{C} for all \tilde{{\textbf{{u}}}}\in\mathbb{R}^{s}, and so E(\tilde{{\textbf{{u}}}},\alpha)\subseteq E({\textbf{{t}}},\alpha). Define the following set:

E(\alpha)=\{{\textbf{{u}}}\in\mathbb{R}^{s}:\sqrt{({\textbf{{u}}}-{\textbf{{t}% }})^{T}\Phi({\textbf{{u}}}-{\textbf{{t}}})}>\alpha\}.

Since \sqrt{h({\textbf{{u}}})}\leq\sqrt{({\textbf{{u}}}-{\textbf{{t}}})^{T}\Phi({% \textbf{{u}}}-{\textbf{{t}}})}+\sqrt{C} for all {\textbf{{u}}}\in\mathbb{R}^{s}, we have E({\textbf{{t}}},\alpha)\subseteq E(\alpha). Let {\textbf{{U}}}\sim{\mathcal{N}}({\textbf{{t}}},\Phi^{-1}) be an s-dimensional Gaussian random vector with mean vector t and covariance matrix \Phi^{-1}. Let {\textbf{{Z}}}=\Phi^{1/2}({\textbf{{U}}}-{\textbf{{t}}}). Then {\textbf{{Z}}}\sim{\mathcal{N}}({\textbf{{0}}},I_{s}). Indeed, the mean vector of Z is trivially zero, and the covariance matrix is found as

\mathbb{E}({\textbf{{Z}}}{\textbf{{Z}}}^{T})=\mathbb{E}(\Phi^{1/2}({\textbf{{U% }}}-{\textbf{{t}}})({\textbf{{U}}}-{\textbf{{t}}})^{T}\Phi^{1/2})=\Phi^{1/2}% \Phi^{-1}\Phi^{1/2}=I_{s},

respectively. We obtain, for all \alpha\geq\sqrt{s},

\displaystyle\frac{\int_{E(\tilde{{\textbf{{u}}}},\alpha)}\exp(-\frac{1}{2}h({% \textbf{{u}}}))d{\textbf{{u}}}}{\int_{\mathbb{R}^{s}}\exp(-\frac{1}{2}h({% \textbf{{u}}}))d{\textbf{{u}}}} \displaystyle=\frac{\int_{E(\tilde{{\textbf{{u}}}},\alpha)}\exp(-\frac{1}{2}({% \textbf{{u}}}-{\textbf{{t}}})^{T}\Phi({\textbf{{u}}}-{\textbf{{t}}}))d{\textbf% {{u}}}}{\int_{\mathbb{R}^{s}}\exp(-\frac{1}{2}({\textbf{{u}}}-{\textbf{{t}}})^% {T}\Phi({\textbf{{u}}}-{\textbf{{t}}}))d{\textbf{{u}}}}
\displaystyle=P({\textbf{{U}}}\in E(\tilde{{\textbf{{u}}}},\alpha))\leq P({% \textbf{{U}}}\in E(\alpha))
\displaystyle=P(\sqrt{({\textbf{{U}}}-{\textbf{{t}}})^{T}\Phi({\textbf{{U}}}-{% \textbf{{t}}})}>\alpha)
\displaystyle=P(\sqrt{{\textbf{{Z}}}^{T}{\textbf{{Z}}}}>\alpha)
\displaystyle\leq e^{-\frac{(\alpha-\sqrt{s})^{2}}{2}}.

where the last step follows by (97). ∎

Appendix E Proof of Step (b) in Eq. (72)

Let \Lambda be an s\times s positive definite matrix, and let t be a column vector in \mathbb{R}^{s}. It is easily seen that for a Gaussian random variable X\sim{\mathcal{N}}(\mu,\sigma^{2}),

\mathbb{E}|X|\leq|\mu|+\sqrt{\frac{2\sigma^{2}}{\pi}}.

Therefore, we have

\displaystyle\|{\textbf{{U}}}\|_{1} \displaystyle\leq\|{\textbf{{t}}}\|_{1}+\sqrt{\frac{2}{\pi}}\sum_{i=1}^{s}% \sqrt{\operatorname{Var}(U_{i})}\leq\sqrt{s}\|{\textbf{{t}}}\|_{2}+\sqrt{\frac% {2s}{\pi}}\Big{(}\sum_{i=1}^{s}\operatorname{Var}(U_{i})\Big{)}^{1/2}
\displaystyle=\sqrt{s}\|{\textbf{{t}}}\|_{2}+\sqrt{\frac{2s}{\pi}\operatorname% {tr}(\Lambda)}

which is what we used to obtain the last line in (72).

Appendix F Proof of (74)

To prove the last estimate in (74), we will show that there exists an integer N(k,\epsilon) such that for every n\geq N(k,\epsilon) and every y^{n}\in E_{2},

\int_{(B_{1})^{c}}\Big{(}\sum_{i=1}^{k}u_{i}^{2}\Big{)}^{1/2}f_{1}({\textbf{{u% }}},y^{n})d{\textbf{{u}}}<\frac{1}{n^{7/13}}. (99)

Given y^{n}\in{\mathscr{Y}}^{n}, define \bar{{\textbf{{U}}}}(y^{n})=(\bar{U}_{1}(y^{n}),\dots,\bar{U}_{k-1}(y^{n})) as a (k-1)-dimensional Gaussian random vector with density function f_{\bar{{\textbf{{U}}}}(y^{n})}(\cdot)=f_{1}(\cdot,y^{n}), mean vector \Phi^{-1}{\textbf{{w}}} and covariance matrix \Phi^{-1}. Note \Phi^{-1}{\textbf{{w}}} depends on y^{n}. According to (48), (49), (60), and (69),

f_{1}(\bar{{\textbf{{u}}}},y^{n})=\frac{\exp(-\frac{1}{2}h_{{\textbf{{v}}}}(% \bar{{\textbf{{u}}}}))}{\int_{\mathbb{R}^{k-1}}\exp(-\frac{1}{2}h_{{\textbf{{v% }}}}({\textbf{{u}}}))d{\textbf{{u}}}}.

With this, we can use the result of Prop. V.8, taking s=k-1. Namely, for all \alpha\geq\sqrt{k-1},\tilde{{\textbf{{u}}}}\in{\mathbb{R}}^{k-1}, inequality (65) gives the estimate

P\big{(}\bar{{\textbf{{U}}}}(y^{n})\in E_{{\textbf{{v}}}}(\tilde{{\textbf{{u}}% }},\alpha)\big{)}=\int_{E_{{\textbf{{v}}}}(\tilde{{\textbf{{u}}}},\alpha)}f_{1% }(\bar{{\textbf{{u}}}},y^{n})d\bar{{\textbf{{u}}}}\leq\exp\Big{(}-\frac{1}{2}(% \alpha-\sqrt{k-1})^{2}\Big{)}.

For a given y^{n}\in E_{2}, take \tilde{{\textbf{{u}}}} whose existence is established in Prop. V.7. Using (63), we deduce that for every y^{n}\in E_{2} and \alpha\geq n^{-5/13}

P\big{(}\bar{{\textbf{{U}}}}(y^{n})\in(B(\alpha))^{c}\big{)}\leq\exp\Big{\{}-% \frac{1}{2}\Big{(}\delta_{0}n^{1/2}(\alpha-n^{-5/13})+n^{1/26}-\sqrt{k-1}\Big{% )}^{2}\Big{\}}. (100)

At this point we wish to ensure that n^{1/26}-\sqrt{k-1}>0. Since n is taken to be sufficiently large, in particular, n\geq N(k,\epsilon), at it suffices to ensure that N(k,\epsilon)>k^{13}, which entails no loss of generality.

Define a random variable

\bar{Z}(y^{n})=\Big{(}\sum_{i=1}^{k-1}\big{(}\bar{U}_{i}(y^{n})\big{)}^{2}+% \big{(}\sum_{i=1}^{k-1}\bar{U}_{i}(y^{n})\big{)}^{2}\Big{)}^{1/2}.

Observe that

\{\bar{{\textbf{{U}}}}(y^{n})\in(B(\alpha))^{c}\}=\{\bar{Z}(y^{n})\geq\alpha\}

and thus by (100), for every y^{n}\in E_{2} and \alpha\geq n^{-5/13},

P\Big{(}\bar{Z}(y^{n})\geq\alpha\Big{)}\leq\exp\Big{\{}-\frac{1}{2}\Big{(}% \delta_{0}n^{1/2}(\alpha-n^{-5/13})+n^{1/26}-\sqrt{k-1}\Big{)}^{2}\Big{\}}. (101)

Note that this implies that \lim_{\bar{z}\to\infty}\bar{z}P(\bar{Z}(y^{n})\geq\bar{z})=0. Then we have, for every y^{n}\in E_{2},

\displaystyle\int_{(B_{1})^{c}}\Big{(}\sum_{i=1}^{k}u_{i}^{2}\Big{)}^{1/2} \displaystyle f_{1}({\textbf{{u}}},y^{n})d{\textbf{{u}}}={\int_{\bar{{\textbf{% {u}}}}\in(B_{1})^{c}}\Big{(}\sum_{i=1}^{k-1}\bar{u}_{i}^{2}+\Big{(}\sum_{i=1}^% {k-1}\bar{u}_{i}\Big{)}^{2}\Big{)}^{1/2}f_{\bar{{\textbf{{U}}}}(y^{n})}(\bar{{% \textbf{{u}}}})d\bar{{\textbf{{u}}}}}
\displaystyle=\int_{\bar{z}\geq n^{-5/13}}\bar{z}f_{\bar{Z}(y^{n})}(\bar{z})d% \bar{z}=\int_{\bar{z}\geq n^{-5/13}}\bar{z}dP(\bar{Z}(y^{n})\leq\bar{z})
\displaystyle=-\int_{\bar{z}\geq n^{-5/13}}\bar{z}dP(\bar{Z}(y^{n})\geq\bar{z})
\displaystyle=-\Big{(}\bar{z}P(\bar{Z}(y^{n})\geq\bar{z})\Big{|}_{n^{-5/13}}^{% \infty}-\int_{\bar{z}\geq n^{-5/13}}P(\bar{Z}(y^{n})\geq\bar{z})d\bar{z}\Big{)}
\displaystyle=n^{-5/13}P(\bar{Z}(y^{n})\geq n^{-5/13})+\int_{\bar{z}\geq n^{-5% /13}}P(\bar{Z}(y^{n})\geq\bar{z})d\bar{z}-\lim_{\bar{z}\to\infty}\bar{z}P(\bar% {Z}(y^{n})\geq\bar{z})
\displaystyle\leq n^{-5/13}\exp\Big{(}-\frac{1}{2}\Big{(}n^{1/26}-\sqrt{k-1}% \Big{)}^{2}\Big{)}
\displaystyle           +\int_{\bar{z}\geq n^{-5/13}}\exp\Big{(}-\frac{1}{2}% \Big{(}\delta_{0}n^{1/2}(\bar{z}-n^{-5/13})+n^{1/26}-\sqrt{k-1}\Big{)}^{2}\Big% {)}d\bar{z}, (102)

where in the last two steps we used (101).

To bound the second term on the last line of (102), we interpret it as an integral of a (univariate) Gaussian random variable Z with mean n^{-5/13}-\delta_{0}^{-1}n^{-6/13}+\delta_{0}^{-1}n^{-1/2}\sqrt{k-1} and variance \delta_{0}^{-2}n^{-1}. Then

\displaystyle\int_{\bar{z}\geq n^{-5/13}} \displaystyle\exp\Big{(}-\frac{1}{2}\Big{(}\delta_{0}n^{1/2}(\bar{z}-n^{-5/13}% )+n^{1/26}-\sqrt{k-1}\Big{)}^{2}\Big{)}d\bar{z}
\displaystyle=\frac{\sqrt{2\pi}}{\delta_{0}\sqrt{n}}\int_{z\geq n^{-5/13}}% \frac{\delta_{0}\sqrt{n}}{\sqrt{2\pi}}\exp\Big{(}-\frac{\delta_{0}^{2}n}{2}% \Big{(}z-\big{(}n^{-5/13}-\delta_{0}^{-1}n^{-6/13}+\delta_{0}^{-1}n^{-1/2}% \sqrt{k-1}\big{)}\Big{)}^{2}\Big{)}dz
\displaystyle=\frac{\sqrt{2\pi}}{\delta_{0}\sqrt{n}}P(Z\geq n^{-5/13})=\frac{% \sqrt{2\pi}}{\delta_{0}\sqrt{n}}P\Big{(}Z-\mathbb{E}Z\geq\delta_{0}^{-1}n^{-6/% 13}-\delta_{0}^{-1}n^{-1/2}\sqrt{k-1}\Big{)}
\displaystyle{\leq}\frac{\sqrt{2\pi}}{\delta_{0}\sqrt{n}}\exp\Big{(}-\frac{% \delta_{0}^{2}n}{2}\big{(}\delta_{0}^{-1}n^{-6/13}-\delta_{0}^{-1}n^{-1/2}% \sqrt{k-1}\big{)}^{2}\Big{)}
\displaystyle=\frac{\sqrt{2\pi}}{\delta_{0}\sqrt{n}}\exp\Big{(}-\frac{1}{2}% \big{(}n^{1/26}-\sqrt{k-1}\big{)}^{2}\Big{)}, (103)

where the inequality follows from the Gaussian tail bound (a one-sided version of (24)). It remains to use (103) in (102): We obtain that for every y^{n}\in E_{2},

\displaystyle\int_{(B_{1})^{c}}\Big{(}\sum_{i=1}^{k}u_{i}^{2}\Big{)}^{1/2}f_{1% }({\textbf{{u}}},y^{n})d{\textbf{{u}}} \displaystyle\leq\Big{(}\frac{1}{n^{5/13}}+\frac{\sqrt{2\pi}}{\delta_{0}\sqrt{% n}}\Big{)}\exp\Big{(}-\frac{1}{2}\big{(}n^{1/26}-\sqrt{k-1}\big{)}^{2}\Big{)}
\displaystyle<\frac{1}{n^{7/13}},

where the last inequality holds for all n\geq N(k,\epsilon) as long as N(k,\epsilon) is large enough.

Appendix G Bounding the integral in (76)

Let

I:=\int_{(B_{1})^{c}}\sum_{i=1}^{k}\Big{(}u_{i}-\mathbb{E}\big{(}\bar{U}_{i}(y% ^{n})\big{)}\Big{)}^{2}f_{1}({\textbf{{u}}},y^{n})d{\textbf{{u}}}

Here we prove the following bound:

There exists an integer N(k,\epsilon) such that for every n\geq N(k,\epsilon) and every y^{n}\in E_{2},

I\leq\frac{1}{n^{14/13}}. (104)

Using \bar{U}_{k}=-\sum_{i=1}^{k-1}\bar{U}_{i}, we have

\displaystyle I=\int_{\bar{{\textbf{{u}}}}\in(B_{1})^{c}}\sum_{i=1}^{k-1}(\bar% {u}_{i} \displaystyle-\mathbb{E}(\bar{U}_{i}(y^{n})))^{2}f_{1}(\bar{{\textbf{{u}}}},y^% {n})d\bar{{\textbf{{u}}}}
\displaystyle+\int_{\bar{{\textbf{{u}}}}\in(B_{1})^{c}}\Big{(}\sum_{i=1}^{k-1}% (\bar{u}_{i}-\mathbb{E}(\bar{U}_{i}(y^{n})))\Big{)}^{2}f_{1}(\bar{{\textbf{{u}% }}},y^{n})d\bar{{\textbf{{u}}}}

Since (\sum_{i=1}^{m}x_{i})^{2}\leq m\sum_{i=1}^{m}x_{i}^{2}, we obtain

I\leq k\int_{\bar{{\textbf{{u}}}}\in(B_{1})^{c}}\sum_{i=1}^{k-1}\Big{(}\bar{u}% _{i}-\mathbb{E}(\bar{U}_{i}(y^{n}))\Big{)}^{2}f_{1}(\bar{{\textbf{{u}}}},y^{n}% )d\bar{{\textbf{{u}}}}

Since \mathbb{E}(\bar{{\textbf{{U}}}}(y^{n}))=\Phi^{-1}{\textbf{{w}}}, we can use inequality (73) which says that for every {\textbf{{z}}}\in{\mathbb{R}}^{k-1}, {\textbf{{z}}}^{T}\Phi{\textbf{{z}}}\geq n\delta_{0}^{2}\|{\textbf{{z}}}\|^{2}. Taking {\textbf{{z}}}=(\bar{{\textbf{{u}}}}-\Phi^{-1}{\textbf{{w}}}) we can continue as follows:

\displaystyle I \displaystyle\leq k\int_{\bar{{\textbf{{u}}}}\in(B_{1})^{c}}(\bar{{\textbf{{u}% }}}-\Phi^{-1}{\textbf{{w}}})^{T}(\bar{{\textbf{{u}}}}-\Phi^{-1}{\textbf{{w}}})% f_{1}(\bar{{\textbf{{u}}}},y^{n})d\bar{{\textbf{{u}}}}
\displaystyle\leq\frac{k}{n\delta_{0}^{2}}\int_{\bar{{\textbf{{u}}}}\in(B_{1})% ^{c}}(\bar{{\textbf{{u}}}}-\Phi^{-1}{\textbf{{w}}})^{T}\Phi(\bar{{\textbf{{u}}% }}-\Phi^{-1}{\textbf{{w}}})f_{1}(\bar{{\textbf{{u}}}},y^{n})d\bar{{\textbf{{u}% }}}, (105)

To bound the integral in (105), for a given \alpha\geq 0, we define another set

E_{{\textbf{{v}}}}(\alpha)=\{{\textbf{{u}}}\in\mathbb{R}^{k-1}:({\textbf{{u}}}% -\Phi^{-1}{\textbf{{w}}})^{T}\Phi({\textbf{{u}}}-\Phi^{-1}{\textbf{{w}}})>% \alpha\}.

Now turning to the remark after (67), we note that h_{{\textbf{{v}}}}(\Phi^{-1}{\textbf{{w}}})=C\geq 0. Thus for every \tilde{{\textbf{{u}}}}\in\mathbb{R}^{k-1} we obtain

\displaystyle({\textbf{{u}}}-\Phi^{-1}{\textbf{{w}}})^{T}\Phi({\textbf{{u}}}-% \Phi^{-1}{\textbf{{w}}}) \displaystyle=h_{{\textbf{{v}}}}({\textbf{{u}}})-h_{{\textbf{{v}}}}(\Phi^{-1}{% \textbf{{w}}})
\displaystyle\geq h_{{\textbf{{v}}}}({\textbf{{u}}})-h_{{\textbf{{v}}}}(\tilde% {{\textbf{{u}}}})\geq\big{(}\sqrt{h_{{\textbf{{v}}}}({\textbf{{u}}})}-\sqrt{h_% {{\textbf{{v}}}}(\tilde{{\textbf{{u}}}})}\big{)}^{2}.

As a result, for every \tilde{{\textbf{{u}}}}\in\mathbb{R}^{k-1},

E_{{\textbf{{v}}}}(\tilde{{\textbf{{u}}}},\alpha)\subseteq E_{{\textbf{{v}}}}(% \alpha^{2}),\quad\alpha\geq 0.

According to Prop. V.7, for every y^{n}\in E_{2}, there exists \tilde{{\textbf{{u}}}}\in B_{2} such that (B_{1})^{c}\subseteq E_{{\textbf{{v}}}}(\tilde{{\textbf{{u}}}},n^{1/26}). Therefore for all y^{n}\in E_{2}, (B_{1})^{c}\subseteq E_{{\textbf{{v}}}}(n^{1/13}). Consequently, for every y^{n}\in E_{2},

\displaystyle\int_{(B_{1})^{c}} \displaystyle(\bar{{\textbf{{u}}}}-\Phi^{-1}{\textbf{{w}}})^{T}\Phi(\bar{{% \textbf{{u}}}}-\Phi^{-1}{\textbf{{w}}})f_{1}(\bar{{\textbf{{u}}}},y^{n})d\bar{% {\textbf{{u}}}}
\displaystyle\leq\int_{E_{{\textbf{{v}}}}(n^{1/13})}(\bar{{\textbf{{u}}}}-\Phi% ^{-1}{\textbf{{w}}})^{T}\Phi(\bar{{\textbf{{u}}}}-\Phi^{-1}{\textbf{{w}}})f_{1% }(\bar{{\textbf{{u}}}},y^{n})d\bar{{\textbf{{u}}}}
\displaystyle\overset{(a)}{=}\int_{E_{{\textbf{{v}}}}(n^{1/13})}(\bar{{\textbf% {{u}}}}-\Phi^{-1}{\textbf{{w}}})^{T}\Phi(\bar{{\textbf{{u}}}}-\Phi^{-1}{% \textbf{{w}}})\sqrt{\frac{{|\Phi|}}{{(2\pi)^{k-1}}}}\exp\Big{(}-\frac{1}{2}(% \bar{{\textbf{{u}}}}-\Phi^{-1}{\textbf{{w}}})^{T}\Phi(\bar{{\textbf{{u}}}}-% \Phi^{-1}{\textbf{{w}}})\Big{)}d\bar{{\textbf{{u}}}}
\displaystyle\overset{(b)}{=}\frac{1}{\sqrt{(2\pi)^{k-1}}}\int_{{\textbf{{z}}}% :\;{\textbf{{z}}}^{T}{\textbf{{z}}}>n^{1/13}}{\textbf{{z}}}^{T}{\textbf{{z}}}% \exp\Big{(}-\frac{1}{2}{\textbf{{z}}}^{T}{\textbf{{z}}}\Big{)}d{\textbf{{z}}}, (106)

where (a) follows from (71); (b) follows from the change of variable {\textbf{{z}}}=\Phi^{1/2}(\bar{{\textbf{{u}}}}-\Phi^{-1}{\textbf{{w}}}) and the fact that |\Phi^{1/2}|=|\Phi|^{1/2}. Let {\textbf{{Z}}}\sim{\mathcal{N}}({\textbf{{0}}},I_{k-1}) be a standard Gaussian random vector. We proceed similarly to (102):

\displaystyle\frac{1}{\sqrt{(2\pi)^{k-1}}} \displaystyle\int_{\{{\textbf{{z}}}:\;{\textbf{{z}}}^{T}{\textbf{{z}}}>n^{1/13% }\}}{\textbf{{z}}}^{T}{\textbf{{z}}}\exp\Big{(}-\frac{1}{2}{\textbf{{z}}}^{T}{% \textbf{{z}}}\Big{)}d{\textbf{{z}}}
\displaystyle=\int_{a=n^{1/13}}^{\infty}adP({\textbf{{Z}}}^{T}{\textbf{{Z}}}% \leq a)
\displaystyle=-\int_{n^{1/13}}^{\infty}adP({\textbf{{Z}}}^{T}{\textbf{{Z}}}>a)% =-\Big{(}aP({\textbf{{Z}}}^{T}{\textbf{{Z}}}>a)\Big{|}_{n^{1/13}}^{\infty}-% \int_{n^{1/13}}^{\infty}P({\textbf{{Z}}}^{T}{\textbf{{Z}}}>a)da\Big{)}
\displaystyle=n^{1/13}P({\textbf{{Z}}}^{T}{\textbf{{Z}}}>n^{1/13})+\int_{n^{1/% 13}}^{\infty}P({\textbf{{Z}}}^{T}{\textbf{{Z}}}>a)da-\lim_{a\to\infty}aP({% \textbf{{Z}}}^{T}{\textbf{{Z}}}>a)
\displaystyle\overset{(a)}{\leq}n^{1/13}\exp\Big{(}-\frac{1}{2}(n^{1/26}-\sqrt% {k-1})^{2}\Big{)}+\int_{n^{1/13}}^{\infty}\exp\Big{(}-\frac{1}{2}(\sqrt{a}-% \sqrt{k-1})^{2}\Big{)}da
\displaystyle\overset{(b)}{\leq}n^{1/13}\exp\Big{(}-\frac{1}{2}(n^{1/26}-\sqrt% {k-1})^{2}\Big{)}+\int_{n^{1/13}}^{\infty}\exp\Big{(}-\frac{a}{8}\Big{)}da
\displaystyle=n^{1/13}\exp\Big{(}-\frac{1}{2}(n^{1/26}-\sqrt{k-1})^{2}\Big{)}+% 8\exp\Big{(}-\frac{n^{1/13}}{8}\Big{)}, (107)

where in (a) we used Prop. D.1 and the fact that \lim_{a\to\infty}aP({\textbf{{Z}}}^{T}{\textbf{{Z}}}>a)=0, and in (b) we take a sufficiently large n, greater than a suitably chosen value N(k,\epsilon).

Finally, using (106) and (107) in (105), we conclude that for every y^{n}\in E_{2},

\displaystyle I\leq \displaystyle\frac{k}{n^{12/13}\delta_{0}^{2}}\exp\Big{(}-\frac{1}{2}(n^{1/26}% -\sqrt{k-1})^{2}\Big{)}+\frac{8k}{n\delta_{0}^{2}}\exp\Big{(}-\frac{n^{1/13}}{% 8}\Big{)}
\displaystyle< \displaystyle\frac{1}{n^{14/13}},

and this implies (104) for all n>N(k,\epsilon) as long as we take a sufficiently large N(k,\epsilon).

Appendix H Proof of Proposition V.12

In this section we prove a somewhat stronger result which immediately implies Prop. V.12.

Proposition H.1.

Let A=(a_{ij}) be a (k-1)\times(k-1) positive definite matrix and let B=A^{-1}. Then

(\operatorname{tr}(B)+{\textbf{{1}}}^{T}B{\textbf{{1}}})\Big{(}\frac{\sum_{i=1% }^{k-1}a_{ii}}{k}-\frac{\sum_{i\neq j}a_{ij}}{k(k-1)}\Big{)}\geq k-1. (108)

The equality holds if and only if

a_{ij}=\begin{cases}2a&\text{if }i=j\\ a&\text{otherwise}.\end{cases} (109)

for some a>0.

Proof.

Let \lambda_{1},\dots,\lambda_{k-1} be the eigenvalues of A and let v_{1},v_{2},\dots,v_{k-1} be the corresponding eigenvectors. Since A\succ 0, the vectors v_{1},\dots,v_{k-1} form an orthogonal basis for \mathbb{R}^{k-1}. Let us expand the all-ones vector in \mathbb{R}^{k-1} in this basis:

{\textbf{{1}}}=\sum_{i=1}^{k-1}\alpha_{i}v_{i}.

It is clear that

\sum_{i=1}^{k-1}\alpha_{i}^{2}=k-1 (110)

and the coefficients \alpha_{i} satisfy 0\leq\alpha_{i}^{2}\leq k-1,i=1,\dots,k-1. Furthermore, A=\sum_{i=1}^{k-1}\lambda_{i}v_{i}v_{i}^{T} and B=\sum_{i=1}^{k-1}\frac{1}{\lambda_{i}}v_{i}v_{i}^{T}. We obtain

\operatorname{tr}(B)+{\textbf{{1}}}^{T}B{\textbf{{1}}}=\sum_{i=1}^{k-1}\frac{1% }{\lambda_{i}}+\sum_{i=1}^{k-1}\frac{\alpha_{i}^{2}}{\lambda_{i}}

and since

\frac{\sum_{i=1}^{k-1}a_{ii}}{k}-\frac{\sum_{i\neq j}a_{ij}}{k(k-1)}=\frac{% \operatorname{tr}(A)}{k-1}-\frac{{\textbf{{1}}}^{T}A{\textbf{{1}}}}{k(k-1)}=% \frac{1}{k(k-1)}\Big{(}k\sum_{i=1}^{k-1}\lambda_{i}-\sum_{i=1}^{k-1}\lambda_{i% }\alpha_{i}^{2}\Big{)},

the left-hand side of (108) can be written as

\displaystyle\Big{(}\sum_{i=1}^{k-1}\frac{1}{\lambda_{i}}+\sum_{i=1}^{k-1}% \frac{\alpha_{i}^{2}}{\lambda_{i}}\Big{)}\Big{(}\frac{\sum_{i=1}^{k-1}\lambda_% {i}}{k-1} \displaystyle-\frac{\sum_{i=1}^{k-1}\lambda_{i}\alpha_{i}^{2}}{k(k-1)}\Big{)}=% \frac{1}{k(k-1)}\Big{(}\sum_{i=1}^{k-1}\frac{1+\alpha_{i}^{2}}{\lambda_{i}}% \Big{)}\Big{(}\sum_{i=1}^{k-1}\lambda_{i}(k-\alpha_{i}^{2})\Big{)}
\displaystyle\geq\frac{1}{k(k-1)}\Big{(}\sum_{i=1}^{k-1}\sqrt{(1+\alpha_{i}^{2% })(k-\alpha_{i}^{2})}\Big{)}^{2} (111)
\displaystyle\geq\frac{1}{k(k-1)}\Big{(}\sum_{i=1}^{k-1}\sqrt{k}\Big{)}^{2}=k-1, (112)

where (111) follows from the Cauchy-Schwarz inequality and (112) follows from the fact that (1+x)(k-x)\geq k for all 0\leq x\leq k-1. This completes the proof of (108).

It is easy to verify that the equality in (108) holds if (109) holds. Let us prove the “only if” part. Note that equality in (112) holds only if \alpha_{i}^{2}=0 or k-1 for all i=1,2,\dots,k-1. By (110) this means that one of the \alpha_{i}’s is \sqrt{k-1}, and all the other \alpha_{i}’s are 0. Without loss of generality, suppose that \alpha_{1}=\sqrt{k-1} and \alpha_{i}=0 for 2\leq i\leq k-1, and thus

v_{1}=\frac{1}{\sqrt{k-1}}{\textbf{{1}}}. (113)

Moreover, if (111) holds with equality, then

\frac{\lambda_{1}^{2}(k-\alpha_{1}^{2})}{1+\alpha_{1}^{2}}=\frac{\lambda_{2}^{% 2}(k-\alpha_{2}^{2})}{1+\alpha_{2}^{2}}=\dots=\frac{\lambda_{k-1}^{2}(k-\alpha% _{k-1}^{2})}{1+\alpha_{k-1}^{2}}

and thus

\lambda_{2}=\lambda_{3}=\dots=\lambda_{k-1}=\frac{\lambda_{1}}{k}.

Let \tilde{J} be a (k-1)\times(k-1) matrix given by

\tilde{J}=\frac{\lambda_{1}}{k}\left[\begin{array}[]{cccc}1&1&\dots&1\\ 1&1&\dots&1\\ \vdots&\vdots&\vdots&\vdots\\ 1&1&\dots&1\end{array}\right]

By (113), the basis vectors v_{i},i=2,3,\dots,k-1 are orthogonal to {\textbf{{1}}}. This implies that

(A-\tilde{J})v_{i}=\frac{\lambda_{1}}{k}v_{i},\quad i=2,3,\dots,k-1.

On the other hand, \tilde{J}v_{1}=\frac{(k-1)\lambda_{1}}{k}v_{1}. Therefore, (A-\tilde{J})v_{1}=\frac{\lambda_{1}}{k}v_{1}. As a result, (A-\tilde{J})v_{i}=\frac{\lambda_{1}}{k}v_{i} for all i=1,2,\dots,k-1. Since v_{1},\dots,v_{k-1} span \mathbb{R}^{k-1}, we deduce that A-\tilde{J}=\frac{\lambda_{1}}{k}I_{k-1}, where I_{k-1} is the identity matrix. This implies that

a_{ij}=\begin{cases}2\frac{\lambda_{1}}{k}&\text{if }i=1,2,\dots,k-1,\\ \frac{\lambda_{1}}{k}&\text{otherwise.}\end{cases}

Since \lambda_{1}>0, this concludes the proof. ∎

Appendix I Proof of (80)

Let d^{\ast} be as defined in (11). To prove (80), we only need to prove the following inequality:

\sum_{j=1}^{k}\Big{(}\frac{q_{ij}}{q_{i}}\Big{)}^{2}\leq k\Big{(}1+(e^{% \epsilon}-1)^{2}\frac{d^{\ast}(k-d^{\ast})}{(d^{\ast}e^{\epsilon}+k-d^{\ast})^% {2}}\Big{)},\;\;i=1,2,\dots,L. (114)
Proof.

Since {\textbf{{Q}}}\in{\mathscr{D}}_{\epsilon,E}, by definition we have

\frac{q_{ij}}{\min_{j^{\prime}\in[k]}q_{ij^{\prime}}}\in\{1,e^{\epsilon}\}

for all i\in[L]. Given i\in[L], define

d_{i}=|\{j:\frac{q_{ij}}{\min_{j^{\prime}\in[k]}q_{ij^{\prime}}}=e^{\epsilon}% \}|.

Then

|\{j:q_{ij}=\min_{j^{\prime}\in[k]}q_{ij^{\prime}}\}|=k-d_{i}.

Therefore,

\displaystyle\sum_{j=1}^{k}\big{(}\frac{q_{ij}}{q_{i}}\big{)}^{2} \displaystyle=d_{i}\big{(}\frac{ke^{\epsilon}}{d_{i}e^{\epsilon}+k-d_{i}}\big{% )}^{2}+(k-d_{i})\big{(}\frac{k}{d_{i}e^{\epsilon}+k-d_{i}}\big{)}^{2}
\displaystyle=k^{2}\frac{d_{i}e^{2\epsilon}+k-d_{i}}{(d_{i}e^{\epsilon}+k-d_{i% })^{2}}
\displaystyle=k\frac{(d_{i}e^{2\epsilon}+k-d_{i})(d_{i}+k-d_{i})}{(d_{i}e^{% \epsilon}+k-d_{i})^{2}}
\displaystyle=k\frac{d_{i}^{2}e^{2\epsilon}+d_{i}(k-d_{i})(e^{2\epsilon}+1)+(k% -d_{i})^{2}}{d_{i}^{2}e^{2\epsilon}+2d_{i}(k-d_{i})e^{\epsilon}+(k-d_{i})^{2}}
\displaystyle=k\Big{(}1+\frac{d_{i}(k-d_{i})(e^{\epsilon}-1)^{2}}{d_{i}^{2}e^{% 2\epsilon}+2d_{i}(k-d_{i})e^{\epsilon}+(k-d_{i})^{2}}\Big{)}
\displaystyle=k\Big{(}1+(e^{\epsilon}-1)^{2}\Big{(}2e^{\epsilon}+\frac{d_{i}}{% k-d_{i}}e^{2\epsilon}+\frac{k-d_{i}}{d_{i}}\Big{)}^{-1}\Big{)}
\displaystyle\overset{(a)}{\leq}k\Big{(}1+(e^{\epsilon}-1)^{2}\Big{(}2e^{% \epsilon}+\frac{d^{\ast}}{k-d^{\ast}}e^{2\epsilon}+\frac{k-d^{\ast}}{d^{\ast}}% \Big{)}^{-1}\Big{)}
\displaystyle=k\Big{(}1+(e^{\epsilon}-1)^{2}\frac{d^{\ast}(k-d^{\ast})}{(d^{% \ast}e^{\epsilon}+k-d^{\ast})^{2}}\Big{)},

where (a) follows directly from the definition of d^{\ast} in (11). ∎

Appendix J Relation to Local Asymptotic Normality and Fisher information matrix

Let {\textbf{{p}}}=(p_{1},p_{2},\dots,p_{k-1},1-\sum_{i=1}^{k-1}p_{i}) be a distribution (probability mass function) on {\mathscr{X}}. Denote by P(y^{n};p_{1},p_{2},\dots,p_{k-1}) the probability mass function of a random vector Y^{n} formed of i.i.d. samples Y^{(i)} drawn according to the distribution pQ. Recall that in the beginning of Section V, we assumed that the output alphabet of Q is {\mathscr{Y}}=\{1,2,\dots,L^{\prime}\}, and we defined t^{\prime}_{i}(y^{n}) to be the number of times that symbol i appears in y^{n}. Therefore,

\displaystyle\log \displaystyle P(y^{n};p_{1},p_{2},\dots,p_{k-1})=\sum_{i=1}^{L^{\prime}}t_{i}^% {\prime}(y^{n})\log\Big{(}{\textbf{{Q}}}(i|k)+\sum_{j=1}^{k-1}p_{j}\big{(}{% \textbf{{Q}}}(i|j)-{\textbf{{Q}}}(i|k)\big{)}\Big{)}
\displaystyle=\sum_{i=1}^{L^{\prime}}t_{i}^{\prime}(y^{n})\log q_{i}^{\prime}+% \sum_{i=1}^{L^{\prime}}t_{i}^{\prime}(y^{n})\log\Big{(}\frac{{\textbf{{Q}}}(i|% k)}{q_{i}^{\prime}}+\sum_{j=1}^{k-1}p_{j}\Big{(}\frac{{\textbf{{Q}}}(i|j)}{q_{% i}^{\prime}}-\frac{{\textbf{{Q}}}(i|k)}{q_{i}^{\prime}}\Big{)}\Big{)}
\displaystyle=\log P(y^{n};1/k,1/k,\dots,1/k)+\sum_{i=1}^{L}\sum_{a\in A_{i}}t% _{a}^{\prime}(y^{n})\log\Big{(}\frac{{\textbf{{Q}}}(a|k)}{q_{a}^{\prime}}+\sum% _{j=1}^{k-1}p_{j}\Big{(}\frac{{\textbf{{Q}}}(a|j)}{q_{a}^{\prime}}-\frac{{% \textbf{{Q}}}(a|k)}{q_{a}^{\prime}}\Big{)}\Big{)}
\displaystyle=\log P(y^{n};1/k,1/k,\dots,1/k)+\sum_{i=1}^{L}t_{i}(y^{n})\log% \Big{(}\frac{q_{ik}}{q_{i}}+\sum_{j=1}^{k-1}p_{j}\Big{(}\frac{q_{ij}}{q_{i}}-% \frac{q_{ik}}{q_{i}}\Big{)}\Big{)}, (115)

where the last equality follows from (34) and (35). Therefore, the function g({\textbf{{u}}},y^{n}) defined in (37) can be written as

g({\textbf{{u}}},y^{n})=\log\frac{P(y^{n};1/k+u_{1},1/k+u_{2},\dots,1/k+u_{k-1% })}{P(y^{n};1/k,1/k,\dots,1/k)}. (116)

In Section V, one of the main steps in the proof of (19) is to approximate g({\textbf{{u}}},y^{n}) as g_{2}({\textbf{{u}}},y^{n}). According to (38) and (56),

g_{2}({\textbf{{u}}},y^{n})=\sum_{i=1}^{L}v_{i}\sum_{j=1}^{k}\frac{u_{j}q_{ij}% }{q_{i}}-\frac{1}{2}\sum_{i=1}^{L}nq_{i}\Big{(}\sum_{j=1}^{k}\frac{u_{j}q_{ij}% }{q_{i}}\Big{)}^{2}=\sum_{i=1}^{L}v_{i}\sum_{j=1}^{k}\frac{u_{j}q_{ij}}{q_{i}}% -\frac{1}{2}{\textbf{{u}}}^{T}\Phi(n,{\textbf{{Q}}}){\textbf{{u}}}.

The observable random variables in our problem are the coordinates of Y^{n}, and the unknown parameters are p_{1},p_{2},\dots,p_{k-1}. Denote by I(p_{1},p_{2},\dots,p_{k-1}) the Fisher information matrix of p vs. Y^{n}, where

[I(1/k,1/k,\dots,1/k)]_{ij}=-\mathbb{E}\Big{[}\frac{\partial^{2}}{\partial p_{% i}\partial p_{j}}\log P(Y^{n};p_{1},p_{2},\dots,p_{k-1})\Big{]}_{p_{1}=p_{2}=% \dots=p_{k-1}=1/k}. (117)

We claim that

\Phi(n,{\textbf{{Q}}})=I(1/k,1/k,\dots,1/k).

Indeed, by (117) we have

\displaystyle[I(1/k, \displaystyle 1/k,\dots,1/k)]_{ij}
\displaystyle=-\mathbb{E}\Big{[}\frac{\partial^{2}}{\partial p_{i}\partial p_{% j}}\Big{(}\sum_{m=1}^{L}t_{m}(y^{n})\log\Big{(}\frac{q_{mk}}{q_{m}}+\sum_{j=1}% ^{k-1}p_{j}\Big{(}\frac{q_{mj}}{q_{m}}-\frac{q_{mk}}{q_{m}}\Big{)}\Big{)}\Big{% )}\Big{]}_{p_{1}=p_{2}=\dots=p_{k-1}=1/k}
\displaystyle=\mathbb{E}\Big{[}\sum_{m=1}^{L}t_{m}(Y^{n})\frac{(q_{mi}-q_{mk})% (q_{mj}-q_{mk})}{q_{m}^{2}}\Big{]}_{p_{1}=p_{2}=\dots=p_{k-1}=1/k}
\displaystyle=n\sum_{m=1}^{L}\frac{(q_{mi}-q_{mk})(q_{mj}-q_{mk})}{q_{m}},

where the first step follows from (115). Comparing this with (55), we can easily see that \Phi(n,{\textbf{{Q}}})=I(1/k,1/k,\dots,1/k). Furthermore, it is also easy to check that

\sum_{i=1}^{L}v_{i}\sum_{j=1}^{k}\frac{u_{j}q_{ij}}{q_{i}}=\sum_{i=1}^{k-1}u_{% i}\Big{(}\frac{\partial}{\partial p_{i}}\log P(y^{n};p_{1},p_{2},\dots,p_{k-1}% )\Big{|}_{p_{1}=p_{2}=\dots=p_{k-1}=1/k}\Big{)}.

Therefore,

g_{2}({\textbf{{u}}},y^{n})=\sum_{i=1}^{k-1}u_{i}\Big{(}\frac{\partial}{% \partial p_{i}}\log P(y^{n};p_{1},p_{2},\dots,p_{k-1})\Big{|}_{p_{1}=p_{2}=% \dots=p_{k-1}=1/k}\Big{)}-\frac{1}{2}{\textbf{{u}}}^{T}I(1/k,1/k,\dots,1/k){% \textbf{{u}}}.

Combining this with (116), we conclude that g({\textbf{{u}}},y^{n})\approx g_{2}({\textbf{{u}}},y^{n}) for small u can be written as

\displaystyle\log\frac{P(y^{n};1/k+u_{1},1/k+u_{2},\dots,1/k+u_{k-1})}{P(y^{n}% ;1/k,1/k,\dots,1/k)}
\displaystyle\approx \displaystyle\sum_{i=1}^{k-1}u_{i}\Big{(}\frac{\partial}{\partial p_{i}}\log P% (y^{n};p_{1},p_{2},\dots,p_{k-1})\Big{|}_{p_{1}=p_{2}=\dots=p_{k-1}=1/k}\Big{)% }-\frac{1}{2}{\textbf{{u}}}^{T}I(1/k,1/k,\dots,1/k){\textbf{{u}}}.

This relation is a classic form of expressing Le Cam’s result on local asymptotic normality (see [16, Chapter 2, Theorem 1.1]). As we already remarked in Sec. III, we could use this result to give a simpler proof that a relation of the form (16) holds for an individual privatization scheme Q; however along this route there seems to be no immediate way to establish the uniform bound (16).

References

  • [1] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in private data analysis,” in Theory of Cryptography Conference.   Springer, 2006, pp. 265–284.
  • [2] C. Dwork, “Differential privacy: A survey of results,” in International Conference on Theory and Applications of Models of Computation.   Springer, 2008, pp. 1–19.
  • [3] A. Ghosh, T. Roughgarden, and M. Sundararajan, “Universally utility-maximizing privacy mechanisms,” SIAM Journal on Computing, vol. 41, no. 6, pp. 1673–1693, 2012.
  • [4] J. C. Duchi, M. I. Jordan, and M. J. Wainwright, “Local privacy and statistical minimax rates,” in 54th Annual IEEE Symposium on the Foundations of Computer Science (FOCS), 2013, pp. 429–438.
  • [5] S. Kamath, A. Orlitsky, V. Pichapati, and A. T. Suresh, “On learning distributions from their samples,” Jounral of Machine Learning Research: Workshop and Conference Proceedings, vol. 40, pp. 1–35, 2015.
  • [6] E. L. Lehmann and G. Casella, Theory of point estimation.   Springer Science & Business Media, 2006.
  • [7] J. Duchi, M. J. Wainwright, and M. I. Jordan, “Local privacy and minimax bounds: Sharp rates for probability estimation,” in Advances in Neural Information Processing Systems, 2013, pp. 1529–1537.
  • [8] Ú. Erlingsson, V. Pihur, and A. Korolova, “RAPPOR: Randomized aggregatable privacy-preserving ordinal response,” in Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security.   ACM, 2014, pp. 1054–1067.
  • [9] S. L. Warner, “Randomized response: A survey technique for eliminating evasive answer bias,” Journal of the American Statistical Association, vol. 60, no. 309, pp. 63–69, 1965.
  • [10] P. Kairouz, S. Oh, and P. Viswanath, “Extremal mechanisms for local differential privacy,” Jounral of Machine Learning Research, vol. 17, pp. 1–51, 2016.
  • [11] P. Kairouz, K. Bonawitz, and D. Ramage, “Discrete distribution estimation under local privacy,” in Proc. 33rd Int. Conf. Machine Learning, 2016, arXiv:1602.07387.
  • [12] S. Wang, L. Huang, P. Wang, Y. Nie, H. Xu, W. Yang, X. Li, and C. Qiao, “Mutual information optimally local private discrete distribution estimation,” 2016, arXiv:1607.08025.
  • [13] M. Ye and A. Barg, “Optimal schemes for discrete distribution estimation under locally differential privacy,” 2017, arXiv:1702.00610.
  • [14] J. C. Duchi, M. I. Jordan, and M. J. Wainwright, “Minimax optimal procedures for locally private estimation,” 2016, arXiv:1604.02390.
  • [15] L. Le Cam, Asymptotic methods in statistical decision theory.   Springer Science & Business Media, 2012.
  • [16] I. A. Ibragimov and R. Z. Has’minskii, “Statistical estimation,” 1981.
  • [17] J. Hájek, “Local asymptotic minimax and admissibility in estimation,” in Proceedings of the sixth Berkeley symposium on mathematical statistics and probability, vol. 1, 1972, pp. 175–194.
  • [18] R. Wong, Asymptotic approximations of integrals.   SIAM, 2001.
  • [19] B. Yu, “Assouad, Fano, and Le Cam,” in Festschrift for Lucien Le Cam.   Springer, 1997, pp. 423–435.
  • [20] V. N. Sudakov and B. S. Tsirel’son, “Extremal properties of half-spaces for spherically invariant measures,” Journal of Soviet Mathematics, vol. 9, no. 1, pp. 9–18, 1978, translated from Zap. Nauch. Sem., 41, 14-24, 1974.
  • [21] C. Borell, “The Brunn-Minkowski inequality in Gauss space,” Inventiones Mathematicae, vol. 30, no. 2, pp. 207–216, 1975.
  • [22] G. Pisier, “Probabilistic methods in the geometry of Banach spaces,” in Probability and analysis.   Springer, 1986, pp. 167–241.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
322071
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description