Confidence regions and minimax rates in outlier-robust estimation on the probability simplex

Confidence regions and minimax rates in outlier-robust estimation on the probability simplex

Abstract

We consider the problem of estimating the mean of a distribution supported by the -dimensional probability simplex in the setting where an fraction of observations are subject to adversarial corruption. A simple particular example is the problem of estimating the distribution of a discrete random variable. Assuming that the discrete variable takes values, the unknown parameter is a -dimensional vector belonging to the probability simplex. We first describe various settings of contamination and discuss the relation between these settings. We then establish minimax rates when the quality of estimation is measured by the total-variation distance, the Hellinger distance, or the -distance between two probability measures. We also provide confidence regions for the unknown mean that shrink at the minimax rate. Our analysis reveals that the minimax rates associated to these three distances are all different, but they are all attained by the sample average. Furthermore, we show that the latter is adaptive to the possible sparsity of the unknown vector. Some numerical experiments illustrating our theoretical findings are reported.

[
\kwd
\tcbuselibrary

skins,breakable,xparse \NewTColorBox[auto counter]thm+O enhanced, drop fuzzy shadow, colframe=green!20!black, colback=yellow!10!white, coltitle=green!10!black, fonttitle=, title=Theorem \thetcbcounter, label=theorem@\thetcbcounter, attach title to upper= , lowerbox=ignored, #1 \NewTotalTColorBox\tproofmm boxsep=0mm, left=0mm, right=0mm, enhanced,breakable, colframe=white, colback=white, coltitle=black, fonttitle=, title=Proof of Theorem LABEL:theorem@#1 on page LABEL:theorem@#1, label=proof@#1, attach title to upper=, \NewTColorBox[auto counter]prop+O enhanced, drop fuzzy shadow, colframe=green!20!black, colback=yellow!10!white, coltitle=green!10!black, fonttitle=, title=Proposition \thetcbcounter, label=prop@\thetcbcounter, attach title to upper= , lowerbox=ignored, #1 \NewTotalTColorBox\pproofmm boxsep=0mm, left=0mm, right=0mm, enhanced,breakable, colframe=white, colback=white, coltitle=black, fonttitle=, title=, phantomlabel=proof@#1, attach title to upper=, Proof of Proposition LABEL:prop@#1 on page LABEL:prop@#1 \tcbsetno proof/.style=no recording,after upper= \runtitleOutlier-robust estimation on the probability simplex

{aug}

and

class=MSC] \kwd[Primary ]62F35 \kwd[; secondary ]62H12

sample \kwdLaTeX2e

1 Introduction

Assume are independent random variables taking their values in the -dimensional probability simplex . Our goal is to estimate the unknown vector in the case where the observations are contaminated by outliers. In this introduction, to convey the main messages, we limit ourselves to the Huber contamination model, although our results apply to the more general adversarial contamination. Huber’s contamination model assumes that there are two probability measures , on and a real such that is drawn from

(1)

This amounts to assuming that -fraction of observations, called inliers, are drawn from a reference distribution , whereas -fraction of observations are outliers and are drawn from another distribution . In general, all the three parameters , and are unknown. The parameter of interest is some functional (such as the mean, the standard deviation, etc.) of the reference distribution , whereas and play the role of nuisance parameters.

When the unknown parameter lives on the probability simplex, there are many appealing ways of defining the risk. We focus on the following three metrics: total-variation, Hellinger and distances1

(2)

The Hellinger distance above is well defined when the estimator is non-negative, which will be the case throughout this work. We will further assume that the dimension may be large, but the vector is -sparse, for some , i.e. . Our main interest is in constructing confidence regions and evaluating the minimax risk

(3)

where the inf is over all estimators built upon the observations and the sup is over all distributions , on the probability simplex such that the mean of is -sparse. The subscript of above refers to the distance used in the risk, so that is TV, H, or .

The problem described above arises in many practical situations. One example is an election poll: each participant expresses his intention to vote for one of candidates. Thus, each is the true proportion of electors of candidate . The results of the poll contain outliers, since some participants of the poll prefer to hide their true opinion. Another example, still related to elections, is the problem of counting votes across all constituencies. Each constituency communicates a vector of proportions to a central office, which is in charge of computing the overall proportions. However, in some constituencies (hopefully a small fraction only) the results are rigged. Therefore, the set of observed vectors contains some outliers.

We intend to provide non-asymptotic upper and lower bounds on the minimax risk that match up to numerical constants. In addition, we will provide confidence regions of the form containing the true parameter with probability at least and such that the radius goes to zero at the same rate as the corresponding minimax risk.

When there is no outlier, i.e., , it is well known that the sample mean

(4)

is minimax-rate-optimal and the rates corresponding to various distances are

(5)

This raises several questions in the setting where data contains outliers. In particular, the following three questions will be answered in this work:

Q1.

How the risks depend on ? What is the largest proportion of outliers for which the minimax rate is the same as in the outlier-free case ?

Q2.

Does the sample mean remain optimal in the contaminated setting?

Q3.

What happens if the unknown parameter is -sparse ?

The most important step for answering these questions is to show that

(6)
(7)
(8)

It is surprising to see that all the three rates are different leading to important discrepancies in the answers to the second part of question Q1 for different distances. Indeed, it turns out that the minimax rate is not deteriorated if the proportion of the outliers is smaller than for the TV-distance, for the Hellinger distance and for the distance. Furthermore, we prove that the sample mean is minimax rate optimal. Thus, even when the proportion of outliers and the sparsity are known, it is not possible to improve upon the sample mean. In addition, we show that all these claims hold true for the adversarial contamination and we provide corresponding confidence regions.

The rest of the paper is organized as follows. Section 2 introduces different possible ways of modeling data sets contaminated by outliers. Pointers to relevant prior work are given in Section 3. Main theoretical results and their numerical illustration are reported in Section 4 and Section 5, respectively. Section 6 contains a brief summary of the obtained results and their consequences, whereas the proofs are postponed to the appendix.

2 Various models of contamination

Different mathematical frameworks have been used in the literature to model the outliers. We present here five of them, from the most restrictive one to the most general, and describe their relationship. We present these frameworks in the general setting when the goal is to estimate the parameter of a reference distribution when proportion of the observations are outliers.

2.1 Huber’s contamination

The most popular framework for studying robust estimation methods is perhaps the one of Huber’s contamination. In this framework, there is a distribution defined on the same space as the reference distribution such that all the observations are independent and drawn from the mixture distribution .

This corresponds to the following mechanism: one decides with probabilities whether a given observation is an inlier or an outlier. If the decision is made in favor of being inlier, the observation is drawn from , otherwise it is drawn from . More formally, if we denote by the random set of outliers, then conditionally to ,

(9)

for every . Furthermore, for every subset of the observations, we have . We denote by2 the set of joint probability distributions of the random variables satisfying the foregoing condition.

2.2 Huber’s deterministic contamination

The set of outliers as well as the number of outliers in Huber’s model of contamination are random. This makes it difficult to compare this model to the others that will be described later in this section. To cope with this, we define here another model, termed Huber’s deterministic contamination. As its name indicates, this new model has the advantage of containing a deterministic number of outliers, in the same time being equivalent to Huber’s contamination in a sense that will be made precise below.

We say that the distribution of belongs to the Huber’s deterministic contamination model denoted by , if there are a set of cardinality at most and a distribution such that (9) is true. The apparent similarity of models and can also be formalized mathematically in terms of the orders of magnitude of minimax risks. To ease notation, we let to be the worst-case risk of an estimator , where is either HC or HDC. More precisely, for , we set3

(10)

This definition assumes that the parameter space is endowed with a pseudo-metric . When is a singleton, we write instead of .

\tcbstartrecording{prop}

Let be an arbitrary estimator of . For any ,

(11)
(12)

Proof in the appendix, page A

Denote by the diameter of , . Proposition LABEL:prop@2 implies that

(13)

When is bounded, the last term is typically of smaller order than the minimax risk over . Therefore, the minimax rate of estimation in Huber’s model is not slower than the minimax rate of estimation in Huber’s deterministic contamination model. This entails that a lower bound on the minimax risk established in HC-model furnishes a lower bound in HDC-model.

2.3 Oblivious contamination

A third model of contamination that can be of interest is the oblivious contamination. In this model, it is assumed that the set of cardinality and the joint distribution of outliers are determined in advance, possibly based on the knowledge of the reference distribution . Then, the outliers are drawn randomly from independently of the inliers . The set of all the joint distributions of random variables generated by such a mechanism will be denoted by . The model of oblivious contamination is strictly more general than that of Huber’s deterministic contamination, since it does not assume that the outliers are iid. Therefore, the minimax risk over is larger than the minimax risk over :

(14)

The last inequality holds true for any set , any contamination level and any sample size.

2.4 Parameter contamination

In the three models considered above, the contamination acts on the observations. One can also consider the case where the parameters of the distributions of some observations are contaminated. More precisely, for some set selected in advance (but unobserved), the outliers are independent and independent of the inliers . Furthermore, each outlier is drawn from a distribution belonging to the same family as the reference distribution, but corresponding to a contaminated parameter . Thus, the joint distribution of the observations can be written as . The set of all such distributions will be denoted by , where PC refers to “parameter contamination”.

2.5 Adversarial contamination

The last model of contamination we describe in this work, the adversarial contamination, is the most general one. It corresponds to the following two-stage data generation mechanism. In a first stage, iid random variables are generated from a reference distribution . In a second stage, an adversary having access to chooses a (random) set of (deterministic) cardinality and arbitrarily modifies data points . The resulting sample, , is revealed to the Statistician. In this model, we have for . However, since is random and potentially dependent of , it is not true that conditionally to , are iid drawn from (for any deterministic set of cardinality ).

We denote by the set of all the joint distributions of all the sequences generated by the aforementioned two-stage mechanism. This set is larger than all the four sets of contamination introduced in this section. Therefore, the following inequalities hold:

(15)

for any , , and any distance .

Figure 1: Visual representation of the hierarchy between various contamination model. Note that the inclusion of in is somewhat heuristic, based on the relation on the worst-case risks reported in Proposition LABEL:prop@1.

2.6 Minimax risk “in expectation” versus “in deviation”

Most prior work on robust estimation focused on establishing upper bounds on the minimax risk in deviation4, as opposed to the minimax risk in expectation defined by (3). One of the reasons for dealing with the deviation is that it makes the minimax risk meaningful for models5 having random number of outliers and unbounded parameter space . The formal justification of this claim is provided by the following result.

{prop}

Let be a parameter space such that . Then, for every estimator , every and , we have .

Proof in the appendix, page A

This result shows, in particular, that the last term in (13), involving the diameter of is unavoidable. Such an explosion of the minimax risk occurs because Huber’s model allows the number of outliers to be as large as with a strictly positive probability. One approach to overcome this shortcoming is to use the minimax risk in deviation. Another approach is to limit theoretical developments to the models HDC, PC, OC or AC, in which the number of outliers is deterministic.

3 Prior work

Robust estimation is an area of active research in Statistics since at least five decades (Huber, 1964; Tukey, 1975; Donoho and Huber, 1983; Donoho and Gasko, 1992; Rousseeuw and Hubert, 1999). Until very recently, theoretical guarantees were almost exclusively formulated in terms of the notions of breakdown point, sensitivity curve, influence function, etc. These notions are well suited for accounting for gross outliers, observations that deviate significantly from the data points representative of an important fraction of data set.

More recently, various authors investigated (Nguyen and Tran, 2013; Dalalyan and Chen, 2012; Chen et al., 2013) the behavior of the risk of robust estimators as a function of the rate of contamination . A general methodology for parametric models subject to Huber’s contamination was developed in Chen et al. (2018, 2016). This methodology allowed for determining the rate of convergence of the minimax risk as a function of the sample size , dimension and the rate of contamination . An interesting phenomenon was discovered: in the problem of robust estimation of the Gaussian mean, classic robust estimators such as the coordinatewise median or the geometric median do not attain the optimal rate . This rate is provably attained by Tukey’s median, the computation of which is costly in a high dimensional setting.

In the model analyzed in this paper, we find the same minimax rate, , only when the total-variation distance is considered. A striking difference is that this rate is attained by the sample mean which is efficiently computable in any dimension. This property is to some extent similar to the problem of robust density estimation (Liu and Gao, 2017), in which the standard kernel estimators are minimax optimal in contaminated setting.

Computational intractability of Tukey’s median motivated a large number of studies that aimed at designing computationally tractable methods with nearly optimal statistical guarantees. Many of these works went beyond Huber’s contamination by considering parameter contamination models (Bhatia et al., 2017; Collier and Dalalyan, 2017; Carpentier et al., 2018), oblivious contamination (Feng et al., 2014; Lai et al., 2016) or adversarial contamination (Diakonikolas et al., 2016; Balakrishnan et al., 2017; Diakonikolas et al., 2017, 2018). Interestingly, in the problem of estimating the Gaussian mean, it was proven that the minimax rates under adversarial contamination are within a factor at most logarithmic in and of the minimax rates under Huber’s contamination6. While each of the aforementioned papers introduced clearly the conditions on the contamination, to our knowledge, none of them described different possible models and the relationship between them.

Another line of growing literature on robust estimation aims at robustifying estimators and prediction methods to heavy tailed distributions, see (Audibert and Catoni, 2011; Minsker, 2015; Donoho and Montanari, 2016; Devroye et al., 2016; Joly et al., 2017; Minsker, 2018; Lugosi and Mendelson, 2019; Lecué and Lerasle, 2017; Chinot et al., 2018). The results of those papers are of a different nature, as compared to the present work, not only in terms of the goals, but also in terms of mathematical and algorithmic tools.

4 Minimax rates on the “sparse” simplex and confidence regions

We now specialize the general setting of Section 2 to a reference distribution , with expectation , defined on the simplex . Along with this reference model describing the distribution of inliers, we will use different models of contamination. More precisely, we will establish upper bounds on worst-case risks of the sample mean in the most general, adversarial, contamination setting. Then, matching lower bounds will be provided for minimax risks under Huber’s contamination.

4.1 Upper bounds: worst-case risk of the sample mean

We denote by the set of all having at most non-zero entries.

{thm}

For every triple of positive integers and for every , the sample mean satisfies

(16)
(17)
(18)

Proof in the appendix, page B

An unexpected and curious phenomenon unveiled by this theorem is that all the three rates are different. As a consequence, the answer to the question “what is the largest possible number of outliers, , that does not impact the minimax rate of estimation of ?” crucially depends on the considered distance . Taking into account the relation , we get

(19)

Furthermore, all the claims concerning the total variation distance, in the considered model, yield corresponding claims for the Wasserstein distances , for every . Indeed, one can see an element as the probability distribution of a random vector taking values in the finite set of vectors of the canonical basis of . Since these vectors satisfy , we have

(20)
(21)

where the inf is over all joint distributions on having marginal distributions and . This implies that

(22)

In addition, since the norm is an upper bound on the -norm, we have . Thus, we have obtained upper bounds on the risk of the sample mean for all commonly used distances on the space of probability measures.

4.2 Lower bounds on the minimax risk

A natural question, answered in the next theorem, is how tight are the upper bounds obtained in the last theorem. More importantly, one can wonder whether there is an estimator that has a worst-case risk of smaller order than that of the sample mean.

{thm}

There are universal constants and , such that for any integers , , and for any , we have

(23)
(24)
(25)

where stands for the infimum over all measurable functions from to .

Proof in the appendix, page C

The main consequence of this theorem is that whatever the contamination model is (among those described in Section 2), the rates obtained for the MLE in Section 4.1 are minimax optimal. Indeed, Theorem LABEL:theorem@2 yields this claim for Huber’s contamination. For Huber’s deterministic contamination and and the TV-distance, on the one hand, we have

(26)
(27)

where (1) uses the fact that for all the sets are equal, while (2) follows from the last theorem. On the other hand, in view of Proposition LABEL:prop@1, for (implying that ),

(28)
(29)

Combining these two inequalities, for , we get

(30)

for every and every . The same argument can be used to show that all the inequalities in Theorem LABEL:theorem@2 are valid for Huber’s deterministic contamination model as well. Since the inclusions hold true, we conclude that the lower bounds obtained for HC remain valid for all the other contamination models and are minimax optimal.

The main tool in the proof of Theorem LABEL:theorem@2 is the following result (Chen et al., 2018, Theorem 5.1). There is a universal constant such that for every ,

(31)

where is the modulus of continuity defined by . Choosing and to differ on only to coordinates, one can check that, for any , , and . Combining with the lower bounds in the non-contaminated setting, this result yields the claims of Theorem LABEL:theorem@2. In addition, (20) combined with the results of this section implies that the rate in (22) is minimax optimal.

4.3 Confidence regions

We established so far bounds for the expected value of estimation error. The aim of this section is to present bounds on estimation error of the sample mean holding with high probability. This also leads to constructing confidence regions for the parameter vector . To this end, the contamination rate and the sparsity are assumed to be known. It is an interesting open question whether one can construct optimally shrinking confidence regions for unknown and .

{thm}

Let be the tolerance level. If , then under any contamination model, the regions of defined by each of the following inequalities

(32)
(33)
(34)

contain with probability at least .

To illustrate the shapes of these confidence regions, we depicted them in Figure 2 for a three dimensional example, projected onto the plane containing the probability simplex. The sample mean in this example is equal to .

Figure 2: The shape of confidence sets (white regions) for the distances (left), TV (center), and Hellinger (right) when the sample mean is .

5 Illustration on a numerical example

We provide some numerical experiments which illustrate theoretical results of Section 4. The data set is the collection of 38 books written by Alexandre Dumas (1802-1870) and 38 books written by Emile Zola (1840-1902)7. To each author, we assign a parameter vector corresponding to the distribution of the number of words contained in the sentences used in the author’s books. To be more clear, a sentence containing words is represented by vector , and if the parameter vector of an author is , it means that a sentence used by the author is of size with probability . We carried out synthetic experiments in which the reference parameter to estimate is the probability vector of Dumas, while the distribution of outliers is determined by the probability vector of Zola. Ground truths for these parameters are computed from the aforementioned large corpus of their works. Only the dense case were considered. For various values of and , a contaminated sample was generated by randomly choosing sentences either from Dumas’ works (with probability ) or from Zola’s works (with probability ). The sample mean was computed for this corrupted sample, and the error with respect to Dumas’ parameter vector was measured by the three distances TV, and Hellinger. This experiment was repeated times for each special setting to obtain information on error’s distribution. Furthermore, by grouping nearby outcomes we created samples of different dimensions for illustrating the behavior of the error as a function of .

Figure 3: Estimation error of measured by total variation, Hellinger, and distances as a function of (left panel) number of observations with contamination rate 0.2 and dimension and (right panel) dimension with contamination rate 0.2 and samples. The interval between 5th and 95th quantiles of the error, obtained from repetitions, is also depicted for every graph.

The error of as a function of the sample size , dimension , and contamination rate  is plotted in Figures 4 and 3. These plots are conform to the theoretical results. Indeed, the first plot in Figure 3 shows that the errors for the three distances is decreasing w.r.t. . Furthermore, we see that up to some level of this decay is of order . The second plot in Figure 3 confirms that the risk grows linearly in for the TV and Hellinger distances, while it is constant for the error.

Left panel of Figure 4 suggests that the error grows linearly in terms of contamination rate. This is conform to our results for the TV and errors. But it might seem that there is a disagreement with the result for the Hellinger distance, for which the risk is shown to increase at the rate and not . This is explained by the fact that the rate corresponds to the worst-case risk, whereas here, the setting under experiment does not necessarily represent the worst case. When the parameter vectors of the reference and contamination distributions, respectively, are and with (i.e., when these two distributions are at the largest possible distance, which we call an extreme case), the graph of the error as a function of (right panel of Figure 4) is similar to that of square-root function.

Figure 4: The estimation error of measured by total variation, Hellinger, and distances in terms of the contamination rate (with dimension and samples). At right, we plotted the error with respect to the contamination rate for an extreme case, where the reference and contamination distributions have the largest distance. The interval between 5th and 95th quantiles of the error, obtained from trials, is also depicted.

6 Summary and conclusion

We have analyzed the problem of robust estimation of the mean of a random vector belonging to the probability simplex. Four measures of accuracy have been considered: total variation, Hellinger, Euclidean and Wasserstein distances. In each case, we have established the minimax rates of the expected error of estimation under the sparsity assumption. In addition, confidence regions shrinking at the minimax rate have been proposed.

An intriguing observation is that the choice of the distance has much stronger impact on the rate than the nature of contamination. Indeed, while the rates for the aforementioned distances are all different, the rate corresponding to one particular distance is not sensitive to the nature of outliers (ranging from Huber’s contamination to the adversarial one). While the rate obtained for the TV-distance coincides with the previously known rate of robustly estimating a Gaussian mean, the rates we have established for the Hellinger and for the Euclidean distances appear to be new. Interestingly, when the error is measured by the Euclidean distance, the quality of estimation does not get deteriorated with increasing dimension. \tcbstoprecording

Appendix A Proofs of propositions

Proof of Proposition LABEL:prop@1 on page 2.2.

Recall that is the set of outliers in the Huber model. Let be any subset of . It follows from the definition of Huber’s model that if stands for the conditional distribution of given , when is drawn from , then . Therefore, for every of cardinality , we have

(35)
(36)
(37)

Inequality (1) above is a direct consequence of the inclusion . Summing the obtained inequality over all sets of cardinality , we get

(38)

It follows from the multiplicative form of Chernoff’s inequality that . This leads to the last term in inequality (11).

Using the same argument as for (36), for any of cardinality , we get

(39)
(40)

This completes the proof of (11).

One can use the same arguments along with the Tchebychev inequality to establish (12). Indeed, for every of cardinality , we have

(41)
(42)
(43)
(44)

Summing the obtained inequality over all sets of cardinality , we get

(45)
(46)

On the other hand, it holds that

(47)

and the claim of the proposition follows. ∎

Proof of Proposition LABEL:prop@2 on page 2.6.

Let and be two points in . We have

(48)
(49)

To ease writing, assume that is an even number. Let be any fixed set of cardinality . It is clear that the set of outliers satisfies

Furthermore, if is drawn from , then its conditional distribution given is exactly the same as the conditional distribution of given . This implies that

(50)
(51)
(52)

where in the last step we have used the triangle inequality. The obtained inequality being true for every , we can take the supremum to get

(53)

This completes the proof. ∎

Appendix B Upper bounds on the minimax risk over the sparse simplex

This section is devoted to the proof of the upper bounds on minimax risks in the discrete model with respect to various distances.

Proof of Theorem LABEL:theorem@1 on page 4.1.

To ease notation, we set