A Minimax Optimal Algorithm for Crowdsourcing

A Minimax Optimal Algorithm for Crowdsourcing

Thomas Bonald
Telecom ParisTech
thomas.bonald@telecom-paristech.fr
&Richard Combes
Centrale-Supelec / L2S
richard.combes@supelec.fr
Abstract

We consider the problem of accurately estimating the reliability of workers based on noisy labels they provide, which is a fundamental question in crowdsourcing. We propose a novel lower bound on the minimax estimation error which applies to any estimation procedure. We further propose Triangular Estimation (TE), an algorithm for estimating the reliability of workers. TE has low complexity, may be implemented in a streaming setting when labels are provided by workers in real time, and does not rely on an iterative procedure. We prove that TE is minimax optimal and matches our lower bound. We conclude by assessing the performance of TE and other state-of-the-art algorithms on both synthetic and real-world data.

 

A Minimax Optimal Algorithm for Crowdsourcing


  Thomas Bonald Telecom ParisTech thomas.bonald@telecom-paristech.fr Richard Combes Centrale-Supelec / L2S richard.combes@supelec.fr

\@float

noticebox[b]31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\end@float

1 Introduction

The performance of many machine learning techniques, and in particular data classification, strongly depends on the quality of the labeled data used in the initial training phase. A common way to label new datasets is through crowdsourcing: many workers are asked to label data, typically texts or images, in exchange of some low payment. Of course, crowdsourcing is prone to errors due to the difficulty of some classification tasks, the low payment per task and the repetitive nature of the job. Some workers may even introduce errors on purpose. Thus it is essential to assign the same classification task to several workers and to learn the reliability of each worker through her past activity so as to minimize the overall error rate and to improve the quality of the labeled dataset.

Learning the reliability of each worker is a tough problem because the true label of each task, the so-called ground truth, is unknown; it is precisely the objective of crowdsourcing to guess the true label. Thus the reliability of each worker must be inferred from the comparison of her labels on some set of tasks with those of other workers on the same set of tasks.

In this paper, we consider binary labels and study the problem of estimating the workers reliability based on the answers they provide to tasks. We make two novel contributions to that problem:

(i) We derive a lower bound on the minimax estimation error which applies to any estimator of the workers reliability. In doing so we identify "hard" instances of the problem, and show that the minimax error depends on two factors: the reliability of the three most informative workers and the mean reliability of all workers.

(ii) We propose TE (Triangular Estimation), a novel algorithm for estimating the reliability of each worker based on the correlations between triplets of workers. We analyze the performance of TE and prove that it is minimax optimal in the sense that it matches the lower bound we previously derived. Unlike most prior work, we provide non-asymptotic performance guarantees which hold even for a finite number of workers and tasks. As our analysis reveals, non-asymptotic performance guarantees require to use finer concentration arguments than asymptotic ones.

TE has low complexity in terms of memory space and computation time, does not require to store the whole data set in memory and can be easily applied in a setting in which answers to tasks arrive sequentially, i.e., in a streaming setting. Finally, we compare the performance of TE to state-of-the-art algorithms through numerical experiments using both synthetic and real datasets.

2 Related Work

The first problems of data classification using independent workers appeared in the medical context, where each label refers to the state of a patient (e.g., sick or sane) and the workers are clinicians. (Dawid and Skene, 1979) proposed an expectation-maximization (EM) algorithm, admitting that the accuracy of the estimate was unknown. Several versions and extensions of this algorithm have since been proposed and tested in various settings (Hui and Walter, 1980; Smyth et al., 1995; Albert and Dodd, 2004; Raykar et al., 2010; Liu et al., 2012).

A number of Bayesian techniques have also been proposed and applied to this problem by (Raykar et al., 2010; Welinder and Perona, 2010; Karger et al., 2011; Liu et al., 2012; Karger et al., 2014, 2013) and references therein. Of particular interest is the belief-propagation (BP) algorithm of (Karger et al., 2011), which is provably order-optimal in terms of the number of workers required per task for any given target error rate, in the limit of an infinite number of tasks and an infinite population of workers.

Another family of algorithms is based on the spectral analysis of some matrix representing the correlations between tasks or workers. (Ghosh et al., 2011) work on the task-task matrix whose entries correspond to the number of workers having labeled two tasks in the same manner, while (Dalvi et al., 2013) work on the worker-worker matrix whose entries correspond to the number of tasks labeled in the same manner by two workers. Both obtain performance guarantees by the perturbation analysis of the top eigenvector of the corresponding expected matrix. The BP algorithm of Karger, Oh and Shah is in fact closely related to these spectral algorithms: their message-passing scheme is very similar to the power-iteration method applied to the task-worker matrix, as observed in (Karger et al., 2011).

Two notable recent contributions are (Chao and Dengyong, 2015) and (Zhang et al., 2014). The former provides performance guarantees for two versions of EM, and derives lower bounds on the attainable prediction error (the probability of estimating labels incorrectly). The latter provides lower bounds on the estimation error of the workers’ reliability as well as performance guarantees for an improved version of EM relying on spectral methods in the initialization phase. Our lower bound cannot be compared to that of (Chao and Dengyong, 2015) because it applies to the workers’ reliability and not the prediction error; and our lower bound is tighter than that of (Zhang et al., 2014). Our estimator shares some features of the algorithm proposed by (Zhang et al., 2014) to initialize EM, which suggests that the EM phase itself is not essential to attain minimax optimality.

All these algorithms require the storage of all labels in memory and, to the best of our knowledge, the only known streaming algorithm is the recursive EM algorithm of (Wang et al., 2013), for which no performance guarantees are available.

The remainder of the paper is organized as follows. In section 3 we state the problem and introduce our notations. The important question of identifiability is addressed in section 4. In section 5 we present a lower bound on the minimax error rate of any estimator. In section 6 we present TE, discuss its compexity and prove that it is minimax optimal. In section 7 we present numerical experiments on synthetic and real-world data sets and section 8 concludes the paper. Due to space constraints, we only provide proof outlines for our two main results in this document. Complete proofs are presented in the appendix.

3 Model

Consider workers, for some integer . Each task consists in determining the answer to a binary question. The answer to task , the “ground-truth", is denoted by . We assume that the random variables are i.i.d. and centered, so that there is no bias towards one of the answers.

Each worker provides an answer with probability . When worker provides an answer, this answer is correct with probability , independently of the other workers, for some parameter that we refer to as the reliability of worker . If then worker tends to provide correct answers; if then worker tends to provide incorrect anwsers; if then worker is non-informative. We denote by the reliability vector. Both and are unknown.

Let be the output of worker for task , where the output corresponds to the absence of an answer. We have:

(1)

Since the workers are independent, the random variables are independent given , for each task . We denote by the corresponding vector. The goal is to estimate the ground-truth as accurately as possible by designing an estimator that minimizes the error probability . The estimator is adaptive and may be a function of but not of the unknown parameters .

It is well-known that, given and , an optimal estimator of is the weighted majority vote (Nitzan and Paroush, 1982; Shapley and Grofman, 1984), namely

(2)

where , is the weight of worker (possibly infinite), and is a Bernoulli random variable of parameter over (for random tie-breaking). We prove this result for any .

Proposition 1

Assuming that is known, the estimator (2) is an optimal estimator of .

Proof. Finding an optimal estimator of amounts to finding an optimal statistical test between hypotheses and , under a symmetry constraint so that type I and type II error probability are equal. For any , let and be the probabilities that under hypotheses and , respectively. We have

where is the number of answers and . We deduce the log-likelihood ratio . By the Neyman-Pearson theorem, for any level of significance, there exists and such that the uniformly most powerful test for that level is: , where is a Bernoulli random variable of parameter over . By symmetry, we must have and , as announced.

This result shows that estimating the true answer reduces to estimating the unknown parameters and , which is the focus of the paper. Note that the problem of estimating is important in itself, due to the presence of "spammers" (i.e., workers with low reliability); a good estimator can be used by the crowdsourcing platform to incentivize good workers.

4 Identifiability

Estimating and from is not possible unless we have identifiability, namely there cannot exist two distinct sets of parameters and under which the distribution of is the same. Let be any sample, for some parameters and . The parameter is clearly identifiable since . The identifiability of is less obvious. Assume for instance that for all . It follows from (1) that for any , with defined as in the proof of Proposition 1:

In particular, two parameters such that and for all cannot be distinguished. Similarly, by symmetry, two parameters such that cannot be distinguished. Let:

The first condition states that there are at least 3 informative workers, the second that the average reliability is positive.

Proposition 2

Any parameter is identifiable.

Proof. Any parameter can be expressed as a function of the covariance matrix of (section 6 below): the absolute value and the sign of follow from (4) and (5), respectively.

5 Lower bound on the minimax error

The estimation of is straightforward and we here focus on the best estimation of one can expect, assuming is known. Specifically, we derive a lower bound on the minimax error of any estimator of . Define and for all , and .

Observe that . This suggests that the estimation of becomes hard when either or is small. Define for any , . We have the following lower bound on the minimax error. As the proof reveals, the parameters and characterize the difficulty of estimating the absolute value and the sign of , respectively.

Theorem 1 (Minimax error)

Consider any estimator of .

For any and , we have

with , and two universal constants.

Outline of proof.

The proof is based on an information theoretic argument. Denote by the distribution of under parameter , and the Kullback-Leibler (KL) divergence. The main element of proof is lemma 1, where we bound for two well chosen pairs of parameters. The pair in statement (i) is hard to distinguish when is small, hence it is hard to estimate the absolute value of . The pair of statement (ii) is also hard to distinguish when or are small, which shows that it is difficult to estimate the sign of . Proving lemma 1 is involved because of the particular form of distribution , and requires careful manipulations of the likelihood ratio. We conclude by reduction to a binary hypothesis test between and using lemma 2.

Lemma 1

(i) Let , and .

Then: (ii) Let , define , and Then: .

Lemma 2

(Tsybakov, 2008, Theorem 2.2) Consider any estimator .

For any with we have:

Relation with prior work.

The lower bound derived in (Zhang et al., 2014)[Theorem 3] shows that the minimax error of any estimator must be greater than . Our lower bound is stricter, and shows that the minimax error is in fact greater than . Another lower bound was derived in (Chao and Dengyong, 2015)[Theorems 3.4 and 3.5], but this concerns the prediction error rate, that is , so that it cannot be easily compared to our result.

6 Triangular estimation

We here present our estimator. The absolute value of the reliability of each worker is estimated through the correlation of her answers with those of the most informative pair . We refer to this algorithm as triangular estimation (TE). The sign of the reliability of each worker is estimated in a second step. We use the convention that .

Covariance matrix.

Let be any sample, for some parameters and . We shall see that the parameter could be recovered exactly if the covariance matrix of were perfectly known. For any , let be the covariance of and given that (that is, both workers and provide an answer). In view of (1),

(3)

In particular, for any distinct indices , . We deduce that, for any and any pair such that ,

(4)

Note that such a pair exists for each because . To recover the sign of , we use the fact that . Since , we get

(5)

The TE algorithm consists in estimating the covariance matrix to recover from the above expressions.

TE algorithm.

At any time , define

(6)

For all , find the most informative pair and let

Next, define and let

Complexity.

First note that the TE algorithm is a streaming algorithm since can be written

Thus TE requires memory space (to store the matrices and ) and has a time complexity of per task: operations to update , operations to sort the entries of , operations to compute , operations to compute the sign of .

Minimax optimality.

The following result shows that the proposed estimator is minimax optimal. Namely the sample complexity of our estimator matches the lower bound up to an additive logarithmic term and a multiplicative constant.

Theorem 2

Let and denote by the estimator defined above. For any and , we have

with , , and two universal constants.

Outline of proof.

Define . The TE estimator is a function of the empirical pairwise correlations and the sums . The main difficulty is to prove lemma  3, a concentration inequality for .

Lemma 3

For all and all ,

Consider fixed. We dissociate the set of tasks answered by each worker from the actual answers and the truth. Let be i.i.d Bernoulli random variables with and be independent random variables on with . One may readily check that has the same distribution as . Hence, in distribution:

We prove lemma 3 by conditionning with respect to . Denote by the conditional probability with respect to . Define . We prove that for all :

The quantity is an upper bound on the conditional variance of , which we control by applying Chernoff’s inequality to both and . We get:

Removing the conditionning on yields the result. We conclude the proof of theorem 2 by linking the fluctuations of to that of in lemma 4.

Lemma 4

If (a) and (b) , then .

Relation with prior work.

Our upper bound brings improvement over (Zhang et al., 2014) as follows. Two conditions are required for the upper bound of (Zhang et al., 2014)[Theorem 4] to hold: (i) it is required that , and (ii) the number of workers must grow with both and , and in fact must depend on and , so that has to be large if is smaller than . Our result does not require condition (i) to hold. Further there are values of and such that condition (ii) is never satisfied, for instance , , and . For (Zhang et al., 2014)[Theorem 4] to hold, should satisfy with a universal constant (see discussion in the supplement) and for or large enough no such exists. It is noted that for such values of and , our result remains informative. Our result shows that one can obtain a minimax optimal algorithm for crowdsourcing which does not involve any EM step.

The analysis of (Chao and Dengyong, 2015) also imposes to grow with and conditions on the minimal value of . Specifically the first and the last condition of (Chao and Dengyong, 2015)[Theorem 3.3], require that and that . Using the previous example (even for ), this translates to .

In fact, the value seems to mark the transition between "easy" and "hard" instances of the crowdsourcing problem. Indeed, when is large and is large with respect to , then the majority vote outputs the truth with high probability by the Central Limit Theorem.

7 Numerical Experiments

Synthetic data. We consider three instances: (i) , , , if and otherwise; (ii) , , , ; (iii) , , , , .

Instance (i) is an "easy" instance where half of the workers are informative, with and . Instance (ii) is a "hard" instance, the difficulty being to estimate the absolute value of accurately by identifying the informative workers. Instance (iii) is another "hard" instance, where estimating the sign of the components of is difficult. In particular, one must distinguish from , otherwise a large error occurs.

Both "hard" instances (ii) and (iii) are inspired by our derivation of the lower bound and constitute the hardest instances in . For each instance we average the performance of algorithms on independent runs and apply a random permutation of the components of before each run. We consider the following algorithms: KOS (the BP algorithm of (Karger et al., 2011)), Maj (majority voting), Oracle (weighted majority voting with optimal weights, the optimal estimator of the ground truth), RoE (first spectral algorithm of (Dalvi et al., 2013)), EoR (second spectral algorithm of (Dalvi et al., 2013)), GKM (spectral algorithm of (Ghosh et al., 2011)), S-EM (EM algorithm with spectral initialization of (Zhang et al., 2014) with iterations of EM) and TE (our algorithm). We do not present the estimation error of KOS, Maj and Oracle since these algorithms only predict the ground truth but do not estimate directly.

The results are shown in Tables 1 and 2, where the best results are indicated in bold. The spectral algorithms RoE, EoR and GKM tend to be outperformed by the other algorithms. To perform well, GKM needs to be positive and large (see (Ghosh et al., 2011)); whenever or is small, GKN tends to make a sign mistake causing a large error. Also the analysis of RoE and EoR assumes that the task-worker graph is a random -regular graph (so that the worker-worker matrix has a large spectral gap). Here this assumption is violated and the practical performance suffers noticeably, so that this limitation is not only theoretical. KOS performs consistently well, and seems immune to sign ambiguity, see instance (iii). Further, while the analysis of KOS also assumes that the task-worker graph is random -regular, its practical performance does not seem sensitive to that assumption. The performance of S-EM is good except when sign estimation is hard (instance (iii), ). This seems due to the fact that the initialization of S-EM (see the algorithm description) is not good in this case. Hence the limitation of being of order is not only theoretical but practical as well. In fact (combining our results and the ideas of (Zhang et al., 2014)), this suggests a new algorithm where one uses EM with TE as the initial value of .

Further, the number of iterations of EM brings significant gains in some cases and should affect the universal constants in front of the various error bounds (providing theoretical evidence for this seems non trival). TE performs consistently well except for (i) (which we believe is due to the fact that is relatively small in that instance). In particular when sign estimation is hard TE clearly outperforms the competing algorithms. This indeed suggests two regimes for sign estimation: (hard regime) and (easy regime).

Real-world data. We next consider publicly available data sets (see (Whitehill et al., 2009; Zhou et al., 2015) and summary information in Table 3), each consisting of labels provided by workers and the ground truth. The density is the average number of labels per worker, i.e., in our model. The worker degree is the average number of tasks labeled by a worker.

First, for data sets with more than possible label values, we split the label values into two groups and associate them with and respectively. The partition of the labels is given in Table 3. Second, we remove any worker who provides less than labels. Our preliminary numerical experiments (not shown here for concision) show that without this, none of the studied algorithms even match the majority consistently. Workers with low degree create noise and (to the best of our knowledge) any theoretical analysis of crowdsourcing algorithms assumes that the worker degree is sufficiently large. The performance of various algorithms is reported in Table 4. No information about the workers reliability is available so we only report the prediction error . Further, one cannot compare algorithms to the Oracle, so that the main goal is to outperform the majority.

Apart from "Bird" and "Web", none of the algorithms seem to be able to significantly outperform the majority and are sometimes noticeably worse. For "Web" which has both the largest number of labels and a high worker degree, there is a significant gain over the majority vote, and TE, despite its low complexity, slightly outperforms S-EM and is competitive with KOS and GKM which both perform best on this dataset.

Instance RoE EoR GKM S-EM1 S-EM10 TE
(i) 0.3 0.200 0.131 0.146 0.100 0.041 0.134
(i) 0.9 0.274 0.265 0.271 0.022 0.022 0.038
(ii) 0.55 0.551 0.459 0.479 0.045 0.044 0.050
(ii) 0.95 0.528 0.522 0.541 0.034 0.033 0.039
(iii) 0.253 0.222 0.256 0.533 0.389 0.061
(iii) 0.105 0.075 0.085 0.437 0.030 0.045
Table 1: Synthetic data: estimation error .
Instance Oracle Maj KOS RoE EoR GKM S-EM1 S-EM10 TE
(i) 0.3 0.227 0.298 0.228 0.402 0.398 0.374 0.251 0.228 0.250
(i) 0.9 0.004 0.046 0.004 0.217 0.218 0.202 0.004 0.004 0.004
(ii) 0.55 0.284 0.441 0.292 0.496 0.497 0.495 0.284 0.285 0.284
(ii) 0.95 0.219 0.419 0.220 0.495 0.496 0.483 0.219 0.219 0.219
(iii) 0.181 0.472 0.185 0.443 0.455 0.386 0.388 0.404 0.192
(iii) 0.126 0.315 0.133 0.266 0.284 0.207 0.258 0.127 0.128
Table 2: Synthetic data: prediction error .
Data Set # Tasks # Workers # Labels Density Worker Degree Label Domain
Bird 108 39 4,212 1 108 {0} vs {1}
Dog 807 109 8,070 0.09 74 {0,2} vs {1,3}
Duchenne 159 64 1,221 0.12 19 {0} vs {1}
RTE 800 164 8,000 0.06 49 {0} vs {1}
Temp 462 76 4,620 0.13 61 {1} vs {2}
Web 2,653 177 15,539 0.03 88 {1,2,3} vs {4,5}
Table 3: Summary of the real-world datasets.
Data Set Maj KOS RoE EoR GKM S-EM1 S-EM10 TE
Bird 0.24 0.28 0.29 0.29 0.28 0.20 0.28 0.18
Dog 0.18 0.19 0.18 0.18 0.20 0.24 0.17 0.20
Duchenne 0.28 0.30 0.29 0.28 0.29 0.28 0.30 0.26
RTE 0.10 0.50 0.50 0.89 0.49 0.32 0.16 0.38
Temp 0.06 0.43 0.24 0.10 0.43 0.06 0.06 0.08
Web 0.14 0.02 0.13 0.14 0.02 0.04 0.06 0.03
Table 4: Real-world data: prediction error .

8 Conclusion

We have derived a minimax error lower bound for the crowdsourcing problem and have proposed TE, a low-complexity algorithm which matches this lower bound. Our results open several questions of interest. First, while recent work has shown that one can obtain strong theoretical guarantees by combining one step of EM with a well-chosen initialization, we have shown that, at least in the case of binary labels, one can forgo the EM phase altogether and still obtain both minimax optimality and good numerical performance. It would be interesting to know if this is still possible when there are more than two possible labels, and also if one can do so using a streaming algorithm.

References

  • Albert and Dodd [2004] Paul S Albert and Lori E Dodd. A cautionary note on the robustness of latent class models for estimating diagnostic error without a gold standard. Biometrics, 60(2):427–435, 2004.
  • Chao and Dengyong [2015] Gao Chao and Zhou Dengyong. Minimax optimal convergence rates for estimating ground truth from crowdsourced labels. Tech Report http://arxiv.org/abs/1310.5764, 2015.
  • Dalvi et al. [2013] Nilesh Dalvi, Anirban Dasgupta, Ravi Kumar, and Vibhor Rastogi. Aggregating crowdsourced binary ratings. In Proc. of WWW, 2013.
  • Dawid and Skene [1979] A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the EM algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):20–28, 1979.
  • Ghosh et al. [2011] Arpita Ghosh, Satyen Kale, and R. Preston McAfee. Who moderates the moderators?: crowdsourcing abuse detection in user-generated content. In Proc. of ACM EC, 2011.
  • Hui and Walter [1980] Sui L Hui and Steven D Walter. Estimating the error rates of diagnostic tests. Biometrics, pages 167–171, 1980.
  • Karger et al. [2011] David R. Karger, Sewoong Oh, and Devavrat Shah. Iterative learning for reliable crowdsourcing systems. In Proc. of NIPS, 2011.
  • Karger et al. [2013] David R Karger, Sewoong Oh, and Devavrat Shah. Efficient crowdsourcing for multi-class labeling. ACM SIGMETRICS Performance Evaluation Review, 41(1):81–92, 2013.
  • Karger et al. [2014] David R Karger, Sewoong Oh, and Devavrat Shah. Budget-optimal task allocation for reliable crowdsourcing systems. Operations Research, 62(1):1–24, 2014.
  • Liu et al. [2012] Qiang Liu, Jian Peng, and Alex T Ihler. Variational inference for crowdsourcing. In Proc. of NIPS, 2012.
  • Nitzan and Paroush [1982] Shmuel Nitzan and Jacob Paroush. Optimal decision rules in uncertain dichotomous choice situations. International Economic Review, pages 289–297, 1982.
  • Raykar et al. [2010] Vikas C Raykar, Shipeng Yu, Linda H Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. Learning from crowds. Journal of Machine Learning Research, 11:1297–1322, 2010.
  • Shapley and Grofman [1984] Lloyd Shapley and Bernard Grofman. Optimizing group judgmental accuracy in the presence of interdependencies. Public Choice, 43(3):329–343, 1984.
  • Smyth et al. [1995] Padhraic Smyth, Usama Fayyad, Michael Burl, Pietro Perona, and Pierre Baldi. Inferring ground truth from subjective labelling of venus images. In Proc. of NIPS, 1995.
  • Tsybakov [2008] Alexandre B. Tsybakov. Introduction to non-parametric estimation. Springer, 2008.
  • Wang et al. [2013] Dong Wang, Tarek Abdelzaher, Lance Kaplan, and Charu C Aggarwal. Recursive fact-finding: A streaming approach to truth estimation in crowdsourcing applications. In Proc. of IEEE ICDCS, 2013.
  • Welinder and Perona [2010] Peter Welinder and Pietro Perona. Online crowdsourcing: rating annotators and obtaining cost-effective labels. In Proc. of IEEE CVPR (Workshops), 2010.
  • Whitehill et al. [2009] Jacob Whitehill, Ting-fan Wu, Jacob Bergsma, Javier R Movellan, and Paul L Ruvolo. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Proc. of NIPS, 2009.
  • Zhang et al. [2014] Yuchen Zhang, Xi Chen, Dengyong Zhou, and Michael I Jordan. Spectral methods meet EM: A provably optimal algorithm for crowdsourcing. In Proc. of NIPS, 2014.
  • Zhou et al. [2015] Dengyong Zhou, Qiang Liu, John C Platt, Christopher Meek, and Nihar B Shah. Regularized minimax conditional entropy for crowdsourcing. Tech Report, http://arxiv.org/pdf/1503.07240, 2015.

Appendices

We here provide the proofs of the two main results of the paper, and provide a more in-depth discussion on the relation between our upper bound and that of [Zhang et al., 2014].

Appendix A Proof of Theorem 1

We use the following inequality between the Kullback-Leibler and divergences.

Lemma 5

The Kullback-Leibler divergence of any two discrete distributions satisfies

with .

Proof. Using the inequality , we get

Writing

we deduce


Proof of Theorem 1.

Let be any sample under parameters . We have for any distinct indices ,

Now let and

Observe that for any . Denote by the distributions of under parameters , respectively. We use Lemma 5 to get an upper bound on the Kullback-Leibler divergence between and . Observe that we can restrict the analysis to the case . We calculate for all possible values of :

  • If or then .

  • If or ,

  • If or ,

  • If or ,

  • If or ,

  • Otherwise,

Observing that

we get in cases (b)-(c),

and in cases (d)-(f),

Summing, we obtain

Now let with . Denoting by the distributions of under the respective parameters , we obtain . Since , it follows from [Tsybakov, 2008, Theorem 2.2] that:

Since , we get

Now assume that so that . Let and . Consider the two parameters

Observe that . Denote by and the distributions of under parameters , respectively. Again, we use Lemma 5 to get an upper bound on the Kullback-Leibler divergence between and .

Define , , , , ,