Recent Advances in Algorithmic High-Dimensional Robust Statistics1footnote 11footnote 1This article is an expanded version of an invited chapter entitled “Robust High-Dimensional Statistics” in the book “Beyond the Worst-Case Analysis of Algorithms” to be published by Cambridge University Press.

# Recent Advances in Algorithmic High-Dimensional Robust Statistics111This article is an expanded version of an invited chapter entitled “Robust High-Dimensional Statistics” in the book “Beyond the Worst-Case Analysis of Algorithms” to be published by Cambridge University Press.

Ilias Diakonikolas
ilias@cs.wisc.edu
Supported by NSF Award CCF-1652862 (CAREER) and a Sloan Research Fellowship.
Daniel M. Kane
University of California, San Diego
dakane@cs.ucsd.edu
Supported by NSF Award CCF-1553288 (CAREER) and a Sloan Research Fellowship.
###### Abstract

Learning in the presence of outliers is a fundamental problem in statistics. Until recently, all known efficient unsupervised learning algorithms were very sensitive to outliers in high dimensions. In particular, even for the task of robust mean estimation under natural distributional assumptions, no efficient algorithm was known. Recent work in theoretical computer science gave the first efficient robust estimators for a number of fundamental statistical tasks, including mean and covariance estimation. Since then, there has been a flurry of research activity on algorithmic high-dimensional robust estimation in a range of settings. In this survey article, we introduce the core ideas and algorithmic techniques in the emerging area of algorithmic high-dimensional robust statistics with a focus on robust mean estimation. We also provide an overview of the approaches that have led to computationally efficient robust estimators for a range of broader statistical tasks and discuss new directions and opportunities for future work.

## 1 Introduction

### 1.1 Background

Consider the following basic statistical task: Given independent samples from an unknown mean spherical Gaussian distribution on , estimate its mean vector within small -norm. It is not hard to see that the empirical mean has -error at most from with high probability. Moreover, this error upper bound is best possible among all -sample estimators.

The Achilles heel of the empirical estimator is that it crucially relies on the assumption that the observations were generated by a spherical Gaussian. The existence of even a single outlier can arbitrarily compromise this estimator’s performance. However, the Gaussian assumption is only ever approximately valid, as real datasets are typically exposed to some source of contamination. Hence, any estimator that is to be used in practice must be robust in the presence of outliers.

Learning in the presence of outliers is an important goal in statistics and has been studied in the robust statistics community since the 1960s [73, 44] (see [38, 45] for introductory statistical textbooks on the topic). In recent years, the problem of designing robust and computationally efficient estimators for various high-dimensional statistical tasks has become a pressing challenge in a number of data analysis applications. These include the analysis of biological datasets, where natural outliers are common [68, 64, 58] and can contaminate the downstream statistical analysis, and data poisoning attacks in machine learning [4], where even a small fraction of fake data (outliers) can substantially degrade the quality of the learned model [9, 70].

Classical work in robust statistics pinned down the minimax risk of high-dimensional robust estimation in several basic settings of interest. In contrast, until very recently, even the most basic computational questions in this field were poorly understood. For example, the Tukey median [74] is a sample-efficient robust mean estimator for spherical Gaussian distributions. However, it is NP-hard to compute in general [47] and the heuristics proposed to approximate it degrade in the quality of their approximation as the dimension scales. Similar hardness results have been shown [5, 39] for essentially all known classical estimators in robust statistics.

Until recently, all known computationally efficient estimators could only tolerate a negligible fraction of outliers in high dimensions, even for the basic task of mean estimation. Recent work by Diakonikolas, Kamath, Kane, Li, Moitra, and Stewart [21], and by Lai, Rao, and Vempala [57] gave the first efficient robust estimators for various high-dimensional unsupervised learning tasks, including mean and covariance estimation. Specifically, [21] obtained the first polynomial-time robust estimators with dimension-independent error guarantees, i.e., with error scaling only with the fraction of corrupted samples and not with the dimensionality of the data. Since the dissemination of these works [21, 57], there has been significant research activity on designing computationally efficient robust estimators in a variety of settings.

### 1.2 Contamination Model

Throughout this article, we focus on the following model of data contamination that generalizes several other existing models:

###### Definition 1.1 (Strong Contamination Model).

Given a parameter and a distribution family on , the adversary operates as follows: The algorithm specifies a number of samples , and samples are drawn from some unknown . The adversary is allowed to inspect the samples, remove up to of them and replace them with arbitrary points. This modified set of points is then given as input to the algorithm. We say that a set of samples is -corrupted if it is generated by the above process.

The contamination model of Definition 1.1 can be viewed as a semi-random input model: First, nature draws a set of i.i.d. samples from a statistical model of interest, and then an adversary is allowed to change the set in a bounded way to obtain an -corrupted set . The parameter is the proportion of contamination and quantifies the power of the adversary. Intuitively, among our samples, an unknown fraction are generated from a distribution of interest and are called inliers, and the rest are called outliers.

One can consider less powerful adversaries, giving rise to weaker contamination models. An adversary may be (i) adaptive or oblivious to the inliers, (ii) only allowed to add outliers, or only allowed to remove inliers, or allowed to do both. We provide some examples of natural and well-studied such models in the following paragraphs.

In Huber’s contamination model [44], the adversary is oblivious to the inliers and is only allowed to add outliers. More specifically, in Huber’s model, the adversary generates samples from a mixture distribution of the form , where is the unknown target distribution and is an adversarially chosen noise distribution. Another natural contamination model very common in theoretical computer science allows the adversary to perturb the distribution by at most in total variation distance, i.e., the adversary generates samples from a distribution that satisfies . Intuitively, the adversary in this model is oblivious to the inliers and is allowed to both add outliers and remove inliers. This model is very similar to a contamination model proposed by Hampel [37]. We note that contamination in total variation distance is strictly stronger than Huber’s model. More broadly, one can naturally modify this model to study model misspecification with respect to different loss functions (see, e.g., [76]).

In computational learning theory, the contamination model of Definition 1.1 is related to the agnostic model [40, 49], where the goal is to learn a labeling function whose agreement with some underlying target function is close to the best possible, among all functions in some given class. An important difference with our setting is that the agnostic model requires that we fit all the data, while in our robust setting we only want to fit the uncorrupted data.

The strong contamination model can be viewed as the unsupervised analogue of the challenging nasty noise model [13] (itself a strengthening of the malicious model [75, 51]). In the nasty model, an adversary is allowed to corrupt an -fraction of both the labels and the samples, and the goal of the learning algorithm is to output a hypothesis with small misclassification error with respect to the clean data generating distribution.

In robust mean estimation, given an -corrupted set of samples from a well-behaved distribution (e.g., ), we want to output a vector that closely approximates the unknown mean vector . A natural choice of metric between the means for identity covariance distributions is the -error , and in this article we focus on designing robust estimators minimizing . We note however that the same algorithms typically lead to small Mahalanobis distance, i.e., , when the underlying distribution has covariance .

One can analogously define robust estimation for other parameters of high-dimensional distributions (e.g., covariance matrix and higher-order moment tensors) with respect to natural loss functions (e.g., Frobenius norm, spectral norm). A more general statistical task is that of robust density estimation: Given an -corrupted set of samples from an unknown distribution , output a hypothesis distribution (not necessarily in ) such that the total variation distance is minimized. We note that robust density estimation and robust parameter estimation are closely related to each other. For many natural parametric distributions, the latter can be reduced to the former for an appropriate choice of metric between the parameters.

In all these settings, the goal is to design computationally efficient learning algorithms that achieve dimension-independent error, i.e., error that scales only with the contamination parameter , but not with the dimension . The information-theoretic limits of robust estimation depend on our assumptions about the distribution family . In the following subsection, we provide the basic relevant background.

### 1.3 Basic Background: Sample Efficient Robust Estimation

Before we proceed with presenting computationally efficient robust estimators in the next sections, we provide a few basic facts on the information-theoretic limits of robust estimation. For concreteness, we focus here on robust mean estimation. We note that similar arguments can be applied to various other parameter estimation tasks. The interested reader is referred to [21] and to [15, 59] (and references therein) for recent information-theoretic work from the statistics community.

We first note that some assumptions on the underlying distribution family are necessary for robust mean estimation to be information-theoretically possible. Consider for example, the family , where is a probability distribution on the real line with only one point having positive mass and such that . While estimating the mean of an arbitrary distribution in is straightforward without corruptions (by taking samples until we see a sample twice, which must be the true mean), an adversary can erase all information about the mean in an -corrupted sample from . Indeed, an adversary can delete the samples at and move them at an arbitrary location to arbitrarily corrupt the sample mean.

Typical assumptions on the family are either parametric (e.g., could be the family of all Gaussian distributions) or are defined by concentration properties (e.g., each distribution in satisfies sub-gaussian concentration) or conditions on low-degree moments (e.g., each distribution in has appropriately bounded higher-order moments).

Another basic observation is that, in contrast to the uncorrupted setting, in the contaminated setting of Definition 1.1 it is not possible to obtain consistent estimators — that is, estimators with error converging to zero in probability as the sample size increases indefinitely. Typically, there is an information-theoretic limit on the minimum attainable error that depends on the proportion of contamination and structural properties of the underlying distribution family.

In particular, for the robust mean estimation of a high-dimensional Gaussian, we have:

###### Fact 1.2.

For any , any robust estimator for the mean of on , must have -error , even in Huber’s contamination model.

This fact can be shown as follows: Given two distinct distributions and with , the adversary constructs two noise distributions on such that . Consequently, even in the infinite sample regime, any algorithm can at best learn that its samples are coming from , but will be unable to distinguish between the cases where the real distribution is and where it is . This proves Fact 1.2.

If the target distribution is allowed to come from a broader class of distributions (such as allowing any distribution with subgaussian tails, or any distribution with bounded covariance), the situation is even worse. If one can find two distributions and in the desired class with , it becomes information-theoretically impossible for an algorithm to distinguish between the two. However, if the difference between and is concentrated in the tails of the distribution, then and might have very different means. This implies that for the class of distributions with sub-gaussian tails (and identity covariance) we cannot hope to learn the mean to -error better than ; and for the class of distributions with covariance bounded by , we cannot expect to do better than . It turns out that these bounds are information-theoretically optimal, and in fact as we will see, the means of such distributions can be robustly estimated to these errors in many cases.

The problem of robust mean estimation for seems so innocuous that one could naturally wonder why simple approaches do not work. In the one-dimensional case, it is well-known that the median is a robust estimator of the mean, matching the lower bound of Fact 1.2. Specifically, it is easy to show that the median of a multiset of -corrupted samples from a one-dimensional Gaussian satisfies with probability at least .

In high dimensions, the situation is more subtle. There are many reasonable ways to attempt to generalize the median as a robust estimator in high dimensions, but unfortunately, most natural ones lead to -error of in dimensions, even in the infinite sample regime (see, e.g., [21, 57]).

Perhaps the most obvious high-dimensional generalization of the median is the coordinate-wise median. Here the -th coordinate of the output is the median of the -th coordinates of the input samples. This estimator guarantees that every coordinate of the output is within of the corresponding coordinate of the input. This estimator suffices for obtaining small -error, but if one wants -error (which is natural in the case of Gaussians), then there exist instances where the coordinate-wise median has -error as large as . Another potential way to generalize the median to high dimensions is via the geometric median, i.e., the point that minimizes . Unfortunately, the geometric median can also produce -error of if the adversary adds the -fraction of the outliers all off from the mean in the same direction.

A third high-dimensional generalization of the median relies on the observation that taking the median of any univariate projection of our input points gives us an approximation to the projected mean. Finding a mean vector that minimizes the error over the worst direction can actually be used to obtain -error of with high probability. In other words, it is possible to reduce the high-dimensional robust mean estimation problem to a collection of one-dimensional robust mean estimation problems. This is the underlying idea in Tukey’s median [74], which is known to be a robust mean estimator for spherical Gaussians and for more general symmetric distributions. But unfortunately, the Tukey median relies on combining information for infinitely many directions, and is unsurprisingly NP-Hard to compute in general.

The following proposition gives a computationally inefficient robust mean estimator matching the lower bound of Fact 1.2:

###### Proposition 1.3.

There exists an algorithm that, on input an -corrupted set of samples from of size , runs in time, and outputs such that with probability at least , it holds that .

The algorithm to establish Proposition 1.3 proceeds by using a one-dimensional robust mean estimator to estimate , for an appropriate net of unit vectors , and then combines these estimates (by solving a large linear program) to obtain an accurate estimate of . When , our one-dimensional robust mean estimator will be the median, giving the error in Proposition 1.3.

We note that the same procedure is applicable to other distribution families as well (even non-symmetric ones), as long as there is an accurate robust mean estimator for each univariate projection. Specifically, if has tails bounded by those of a Gaussian with covariance , one can use the trimmed mean for each univariate projection. This gives error of . If is only assumed to have bounded covariance matrix (), we can similarly use the trimmed mean, which leads to total error of . By the discussion following Fact 1.2, both these upper bounds are optimal, within constant factors, under the corresponding assumptions.

In Section 2, we provide a unified presentation of two related algorithmic techniques that gave the first computationally efficient algorithms for high-dimensional robust mean estimation. Section 2 is the main technical section of this article and showcases a number of core ideas and techniques that can be applied to several high-dimensional robust estimation tasks. Section 3 provides an overview of recent algorithmic progress for more general robust estimation tasks. Finally, in Section 4 we conclude with a few general directions for future work.

### 1.5 Preliminaries and Notation

For a distribution , we will use the notation to denote that is a sample drawn from . For a finite set , we will write to denote that is drawn from the uniform distribution on . The probability of event will be denoted by .

We will use and to denote the expectation and the variance of random variable . If is multivariate, we will denote by its covariance matrix. We will also use the notation and for the mean and covariance of . Similarly, for a finite set , we will denote by and the sample mean and sample covariance of .

For a vector , we will use to denote its -norm. For a matrix , we will denote by and its spectral and Frobenius norms respectively, and by its trace. We will denote by the Loewner order between matrices.

We will use standard asymptotic notation , . The notation hides logarithmic factors in its argument.

## 2 High-Dimensional Robust Mean Estimation

In this section, we illustrate the main insights underlying recent robust high-dimensional learning algorithms by focusing on the problem of robust mean estimation. The algorithmic techniques presented in this section were introduced in [21, 22]. Here we give a simplified and unified presentation that illustrates the key ideas and the connections between them.

The objective of this section is to provide the intuition and background required to develop robust learning algorithms in an accessible way. As such, we will not attempt to optimize the sample or computational complexities of the algorithms presented, other than to show that they are polynomial in the relevant parameters.

In the problem of robust mean estimation, we are given an -corrupted set of samples from a distribution on and our goal is to approximate the mean of , within small error in -norm. In order for such a goal to be information-theoretically possible, it is required that belongs to a suitably well-behaved family of distributions. A typical assumption is that belongs to a family whose moments are guaranteed to satisfy certain conditions, or equivalently, a family with appropriate concentration properties. In our initial discussion, we will use the running example of a spherical Gaussian, although the results presented here hold in greater generality. That is, the reader is encouraged to imagine that is of the form , for some unknown .

##### Structure of this Section.

In Section 2.1, we discuss the basic intuition underlying the presented approach. In Section 2.2, we will describe a stability condition that is necessary for the algorithms in this section to succeed. In the subsequent subsections, we present two related algorithmic techniques taking advantage of the stability condition in different ways. Specifically, in Section 2.3, we describe an algorithm that relies on convex programming. In Section 2.4, we describe an iterative outlier removal technique, which has been the method of choice in practice. In Section 2.5, we conclude with an overview of the relevant literature.

### 2.1 Key Difficulties and High-Level Intuition

Arguably the most natural idea to robustly estimate the mean of a distribution would be to identify the outliers and output the empirical mean of the remaining points. The key conceptual difficulty is the fact that, in high dimensions, the outliers cannot be identified at an individual level even when they move the mean significantly. In many cases, we can easily identify the “extreme outliers” — via a pruning procedure exploiting the concentration properties of the inliers. Alas, such naive approaches typically do not suffice to obtain non-trivial error guarantees.

The simplest example illustrating this difficulty is that of a high-dimensional spherical Gaussian. Typical samples will be at -distance approximately from the true mean. That is, we can certainly identify as outliers all points of our dataset at distance more than from the coordinate-wise median of the dataset. All other points cannot be removed via such a procedure, as this could result in removing many inliers as well. However, by placing an -fraction of outliers at distance in the same direction from the unknown mean, an adversary can corrupt the sample mean by as much as .

This leaves the algorithm designer with a dilemma of sorts. On the one hand, potential outliers at distance from the unknown mean could lead to large -error, scaling polynomially with . On the other hand, if the adversary places outliers at distance approximately from the true mean in random directions, it may be information-theoretically impossible to distinguish them from the inliers. The way out is the realization that it is in fact not necessary to detect and remove all outliers. It is only required that the algorithm can detect the “consequential outliers”, i.e., the ones that can significantly impact our estimates of the mean.

Let us assume without loss of generality that there no extreme outliers (as these can be removed via pre-processing). Then the only way that the empirical mean can be far from the true mean is if there is a “conspiracy” of many outliers, all producing errors in approximately the same direction. Intuitively, if our corrupted points are at distance from the true mean in random directions, their contributions will on average cancel out, leading to a small error in the sample mean. In conclusion, it suffices to be able to detect these kinds of conspiracies of outliers.

The next key insight is simple and powerful. Let be an -corrupted set of points drawn from . If such a conspiracy of outliers substantially moves the empirical mean of , it must move in some direction. That is, there is a unit vector such that these outliers cause to be large. For this to happen, it must be the case that these outliers are on average far from in the -direction. In particular, if an -fraction of corrupted points in move the sample average of , where is the uniform distribution on , by more than ( should be thought of as small, but substantially larger than ), then on average these corrupted points must have at least . This in turn means that these corrupted points will have a contribution of at least to the variance of . Fortunately, this condition can actually be algorithmically detected! In particular, by computing the top eigenvector of the sample covariance matrix, we can efficiently determine whether or not there is any direction for which the variance of is abnormally large.

The aforementioned discussion leads us to the overall structure of the algorithms we will describe in this section. Starting with an -corrupted set of points (perhaps weighted in some way), we compute the sample covariance matrix and find the eigenvector with largest eigenvalue . If is not much larger than what it should be (in the absence of outliers), by the above discussion, the empirical mean is close to the true mean, and we can return that as an answer. Otherwise, we have obtained a particular direction for which we know that the outliers play an unusual role, i.e., behave significantly differently than the inliers. The distribution of the points projected in the -direction can then be used to perform some sort of outlier removal. The outlier removal procedure can be quite subtle and crucially depends on our distributional assumptions about the clean data.

### 2.2 Good Sets and Stability

In this section, we give a deterministic condition on the uncorrupted data that we call stability (Definition 2.1), which is necessary for the algorithms presented here to succeed. Furthermore, we provide an efficiently checkable condition under which the empirical mean is certifiably close to the true mean (Lemma 2.4).

Let be a set of i.i.d. samples drawn from . We will typically call these sample points good. The adversary can select up to an -fraction of points in and replace them with arbitrary points to obtain an -corrupted set , which is given as input to the algorithm. To establish correctness of an algorithm, we need to show that with high probability over the choice of the set , for any choice of how to corrupt the good samples that the adversary makes, the algorithm will output an accurate estimate of the target mean.

To carry out such an analysis, it is convenient to explicitly state a collection of sufficient deterministic conditions on the set . Specifically, we will define a notion of a “stable” set, quantified by the proportion of contamination and the distribution . The precise stability conditions vary considerably based on the underlying estimation task and the assumptions on the distribution family of the uncorrupted data. Roughly speaking, we require that the uniform distribution over a stable set behaves similarly to the distribution with respect to higher moments and, potentially, tail bounds. Importantly, we require that these conditions hold even after removing an arbitrary -fraction of points in .

The notion of a stable set must have two critical properties: (1) A set of i.i.d. samples from is stable with high probability, when is at least a sufficiently large polynomial in the relevant parameters; and (2) If is a stable set and is obtained from by changing at most an -fraction of the points in , then the algorithm when run on the set will succeed.

The robust mean estimation algorithms that will be presented in this section crucially rely on considering sample means and covariances. The following stability condition is an important ingredient in the success criteria of these algorithms:

###### Definition 2.1 (Stability Condition).

Fix and . A finite set is -stable (with respect to a distribution ) if for every unit vector and every with , the following conditions hold:

1. and

The aforementioned stability condition or a variant thereof is used in every known robust mean estimation algorithm. Definition 2.1 requires that after restricting to a -density subset , the sample mean of is within of the mean of , , and the sample variance of is in every direction. (We note that Definition 2.1 is intended for distributions with covariance or, more generally, . The case of arbitrary known or bounded covariance can be reduced to this case via an appropriate linear transformation of the data.)

The fact that the conditions of Definition 2.1 must hold for every large subset of might make it unclear if they can hold with high probability. However, one can show the following:

###### Proposition 2.2.

A set of i.i.d. samples from an identity covariance sub-gaussian distribution of size is -stable with high probability.

We sketch a proof of Proposition 2.2. The only property required for the proof is that the distribution of the uncorrupted data has identity covariance and sub-gaussian tails in each direction, i.e., the tail probability of each univariate projection is bounded from above by the Gaussian tail.

Fix a direction . To show that the first condition holds for this , we note that we can maximize by removing from the -fraction of points for which is smallest. Since the empirical mean of is close to with high probability, we need to understand how much this quantity is altered by removing the -tail in the -direction. Given our assumptions on the distribution of the uncorrupted data, removing the -tail only changes the mean by . Therefore, if the empirical distribution of , , behaves like a spherical Gaussian in this way, the first condition is satisfied.

The second condition follows via a similar analysis. We can minimize the relevant quantity by removing the -fraction of points with as large as possible. If is distributed like a unit-variance sub-gaussian, the total mass of its square over the -tails is . We have thus established that both conditions hold with high probability for any fixed direction. Showing that the conditions hold with high probability for all directions simultaneously can be shown by an appropriate covering argument.

More generally, one can show quantitatively different stability conditions under various distributional assumptions. In particular, we state the following proposition without proof. (The interested reader is referred to [22] for a proof.)

###### Proposition 2.3.

A set of i.i.d. samples from a distribution with covariance of size is -stable with high probability.

We note that analogous bounds can be easily shown for identity covariance distributions with bounded higher central moments. For example, if our distribution has identity covariance and its -th central moment is bounded from above by a constant, one can show that a set of samples is -stable with high probability.

The aforementioned notion of stability is powerful and suffices for robust mean estimation. Some of the algorithms that will be presented in this section only work under the stability condition, while others require additional conditions beyond stability.

The main reason why stability suffices is quantified in the following lemma:

###### Lemma 2.4 (Certificate for Empirical Mean).

Let be an -stable set with respect to a distribution , for some . Let be an -corrupted version of . Let and be the empirical mean and covariance of . If the largest eigenvalue of is at most , for some , then

###### Proof of Lemma 2.4..

Let and . By replacing with a subset if necessary, we may assume that and . Let represent the empirical means and covariance matrices of and . A simple calculation gives that

 ΣT=(1−ϵ)ΣS′+ϵΣT′+ϵ(1−ϵ)(μS′−μT′)(μS′−μT′)T.

Let be the unit vector in the direction of . We have that

 1+λ≥vTΣTv =(1−ϵ)vTΣS′v+ϵvTΣT′v+ϵ(1−ϵ)vT(μS′−μT′)(μS′−μT′)Tv ≥(1−ϵ)(1−δ2/ϵ)+ϵ(1−ϵ)∥μS′−μT′∥22 ≥1−O(δ2/ϵ)+(ϵ/2)∥μS′−μT′∥22,

where we used the variational characterization of eigenvalues, the fact that is positive semidefinite, and the second stability condition for . By rearranging, we obtain that Therefore, we can write

 ∥μT−μX∥2 =∥(1−ϵ)μS′+ϵμT′−μX∥2=∥μS′−μX+ϵ(μT′−μS′)∥2 ≤∥μS′−μX∥2+ϵ∥μS′−μT′∥2=O(δ)+ϵ⋅O(δ/ϵ+√λ/ϵ) =O(δ+√λϵ),

where we used the first stability condition for and our bound on . ∎

We note that the proof of Lemma 2.4 only used the lower bound part in the second condition of Definition 2.1, i.e., that the sample variance of in each direction is at least . The upper bound part will be crucially used in the design and analysis of our robust mean estimation algorithms in the following sections.

Lemma 2.4 says that if our input set of points is an -corrupted version of any stable set and has bounded sample covariance, the sample mean of closely approximates the true mean. This lemma, or a variant thereof, is a key result in all known robust mean estimation algorithms.

Unfortunately, we are not always guaranteed that the set we are given has this property. In order to deal with this, we will want to find a subset of with bounded covariance and large intersection with . However, for some of the algorithms presented, it will be convenient to find a probability distribution over rather than a subset. For this, we will need a slight generalization of Lemma 2.4.

###### Lemma 2.5.

Let be an -stable set with respect to a distribution , for some with . Let be a probability distribution that differs from , the uniform distribution over , by at most in total variation distance. Let and be the mean and covariance of . If the largest eigenvalue of is at most , for some , then

Note that this subsumes Lemma 2.4 by letting be the uniform distribution over . The proof is essentially identical to that of Lemma 2.4, except that we need to show that the mean and variance of the conditional distribution are approximately correct, whereas in Lemma 2.4 the bounds on the mean and variance of followed directly from stability.

Lemma 2.5 clarifies the goal of our outlier removal procedure. In particular, given our initial -corrupted set , we will attempt to find a distribution supported on so that has no large eigenvalues. The weight , , quantifies our belief whether point is an inlier or an outlier. We will also need to ensure that any such we choose is close to the uniform distribution over .

More concretely, we now describe a framework that captures our robust mean estimation algorithms. We start with the following definition:

###### Definition 2.6.

Let be a -stable set with respect to and let be an -corrupted version of . Let be the set of all probability distributions supported on , where , for all .

We note that any distribution in differs from , the uniform distribution on , by at most . Indeed, for , we have that:

 dTV(US,W) =∑x∈Tmax{W(x)−US(x),0} =∑x∈S∩Tmax{W(x)−1/|T|,0}+∑x∈T∖SW(x) ≤∑x∈S∩Tϵ|T|(1−ϵ)+∑x∈T∖S1|T|(1−ϵ) ≤|T|(ϵ|T|(1−ϵ))+ϵ|T|(1|T|(1−ϵ)) =2ϵ1−ϵ≤3ϵ.

Therefore, if we find with having no large eigenvalues, Lemma 2.5 implies that is a good approximation to . Fortunately, we know that such a exists! In particular, if we take to be , the uniform distribution over , the largest eigenvalue is at most , and thus we achieve -error .

At this point, we have an inefficient algorithm for approximating : Find any with bounded covariance. The remaining question is how we can efficiently find one. There are two basic algorithmic techniques to achieve this, that we present in the subsequent subsections.

The first algorithmic technique we will describe is based on convex programming. We will call this the unknown convex programming method. Note that is a convex set and that finding a point in that has bounded covariance is almost a convex program. It is not quite a convex program, because the variance of , for fixed , is not a convex function of . However, one can show that given a with variance in some direction significantly larger than , we can efficiently construct a hyperplane separating from (recall that is the uniform distribution over ) (Section 2.3). This method has the advantage of naturally working under only the stability assumption. On the other hand, as it relies on the ellipsoid algorithm, it is quite slow (although polynomial time).

Our second technique, which we will call filtering, is an iterative outlier removal method that is typically faster, as it relies primarily on spectral techniques. The main idea of the method is the following: If does not have large eigenvalues, then the empirical mean is close to the true mean. Otherwise, there is some unit vector such that is substantially larger than it should be. This can only be the case if assigns substantial mass to elements of that have values of very far from the true mean of . This observation allows us to perform some kind of outlier removal, in particular by removing (or down-weighting) the points that have inappropriately large. An important conceptual property is that one cannot afford to remove only outliers, but it is possible to ensure that more outliers are removed than inliers. Given a where has a large eigenvalue, one filtering step gives a new distribution that is closer to than was. Repeating the process eventually gives a with no large eigenvalues. The filtering method and its variations are discussed in Section 2.4.

### 2.3 The Unknown Convex Programming Method

By Lemma 2.5, it suffices to find a distribution with having no large eigenvalues. We note that this condition almost defines a convex program. This is because is a convex set of probability distributions and the bounded covariance condition says that for all unit vectors . Unfortunately, the variance is not quite linear in . (If we instead had , where is some fixed vector, this would be linear in .) However, we will show that a unit vector for which is too large, can be used to obtain a separation oracle, i.e., a linear function for which .

Suppose that we identify a unit vector such that , where for a sufficiently large universal constant . Applying Lemma 2.5 to the one-dimensional projection , gives

Let and note that is a linear function of the probability distribution with . We can write

 L(W∗) =EW∗[|v⋅(W∗−μW)|2]=Var[v⋅W∗]+|v⋅(μW−μW∗)|2 ≤1+δ2/ϵ+2|v⋅(μW−μX)|2+2|v⋅(μW∗−μX)|2 ≤1+O(δ2/ϵ+ϵλ)<1+λ=L(W).

In summary, we have an explicit convex set of probability distributions from which we want to find one with eigenvalues bounded by . Given any which does not satisfy this condition, we can produce a linear function that separates from . Using the ellipsoid algorithm, we obtain the following general theorem:

###### Theorem 2.7.

Let be a -stable set with respect to a distribution and let be an -corrupted version of . There exists a polynomial time algorithm which given returns such that .

##### Implications for Concrete Distribution Families.

Combining Theorem 2.7 with corresponding stability bounds, we obtain a number of concrete applications for various distribution families of interest. Using Proposition 2.2, we obtain:

###### Corollary 2.8 (Identity Covariance Sub-gaussian Distributions).

Let be an -corrupted set of samples of size from an identity covariance sub-gaussian distribution on . There exists a polynomial time algorithm which given returns such that with high probability .

We note that Corollary 2.8 can be immediately adapted for identity covariance distributions satisfying weaker concentration assumptions. For example, if satisfies sub-exponential concentration in each direction, we obtain an efficient robust mean estimation algorithm with -error of . If has identity covariance and bounded -th central moments, , we obtain error . These error bounds are information-theoretically optimal, within constant factors.

For distributions with unknown bounded covariance, using Proposition 2.3 we obtain:

###### Corollary 2.9 (Unknown Bounded Covariance Distributions).

Let be an -corrupted set of samples of size from a distribution on with unknown bounded covariance . There exists a polynomial time algorithm which given returns such that with high probability .

By the discussion following Fact 1.2, the above error bound is also information-theoretically optimal, within constant factors.

### 2.4 The Filtering Method

As in the unknown convex programming method, the goal of the filtering method is to find a distribution so that has bounded eigenvalues. Given a , either has bounded eigenvalues (in which case the weighted empirical mean works) or there is a direction in which is too large. In the latter case, the projections must behave very differently from the projections or . In particular, since an -fraction of outliers are causing a much larger increase in the standard deviation, this means that the distribution of will have many “extreme points” — more than one would expect to find in . This fact allows us to identify a non-empty subset of extreme points the majority of which are outliers. These points can then be removed (or down-weighted) in order to “clean up” our sample. Formally, given a without bounded eigenvalues, we can efficiently find a so that is closer to than was. Iterating this procedure eventually terminates giving a with bounded eigenvalues.

We note that while it may be conceptually useful to consider the above scheme for general distributions over points, in most cases it suffices to consider only given as the uniform distribution over some set of points. The filtering step in this case consists of replacing the set by some subset , where . To guarantee progress towards (the uniform distribution over ), it suffices to ensure that at most a third of the elements of are also in , or equivalently that at least two thirds of the removed points are outliers (perhaps in expectation). The algorithm will terminate when the current set of points has bounded empirical covariance, and the output will be the empirical mean of .

Before we proceed with a more detailed technical discussion, we note that there are several possible ways to implement the filtering step, and that the method used has a significant impact on the analysis. In general, a filtering step removes all points that are “far” from the sample mean in a large variance direction. However, the precise way that this is quantified can vary in important ways.

#### 2.4.1 Basic Filtering

In this subsection, we present a filtering method that yields efficient robust mean estimators with optimal error bounds for identity covariance (or, more generally, known covariance) distributions whose univariate projections satisfy appropriate tail bounds. For the purpose of this section, we will restrict ourselves to the Gaussian setting. We note however that this method immediately extends to distributions with weaker concentration properties, e.g., sub-exponential or even inverse polynomial concentration, with appropriate modifications.

We note that the filtering method presented here requires an additional condition on our good set of samples, on top of the stability condition. This is quantified in the following definition:

###### Definition 2.10.

A set is tail-bound-good (with respect to ) if for any unit vector , and any , we have

 Prx∼uS[|v⋅(x−μX)|>2t+2]≤e−t2/2. (1)

Since any univariate projection of is distributed like a standard Gaussian, Equation (1) should hold if the uniform distribution over were replaced by . It can be shown that this condition holds with high probability if consists of i.i.d. random samples from of a sufficiently large polynomial size [21].

Intuitively, the additional tail condition of Definition 2.10 is needed to guarantee that the filtering method used here will remove more outliers than inliers. Formally, we have the following:

###### Lemma 2.11.

Let be a sufficiently small constant. Let be both -stable and tail-bound-good with respect to , with , for a sufficiently large constant. Let be such that and assume we are given a unit vector for which . There exists a polynomial time algorithm that returns a subset satisfying .

To see why Lemma 2.11 suffices for our purposes, note that by replacing by , we obtain a less noisy version of than was. In particular, it is easy to see that the size of the symmetric difference between and is strictly smaller than the size of the symmetric difference between and . From this it follows that the hypothesis still holds when is replaced by , allowing us to iterate this process until we are left with a set with small variance.

###### Proof.

Let . By applying Lemma 2.4 to the set , we get that . By (1), it follows that We claim that there exists a threshold such that

 Prx∼uT[|v⋅(x−μT)|>2t0+2+c√λϵ]>4e−t20/2, (2)

where the constants have not been optimized. Given this claim, the set will satisfy the conditions of the lemma.

To prove our claim, we analyze the variance of and note that much of the excess must be due to points in . In particular, by our assumption on the variance in the -direction, , where . The contribution from the points is at most

 ∑x∈S|v⋅(x−μT)|2 =|S|(Var[v⋅S]+|v⋅(μT−μS)|2)≤|S|(1+δ2/ϵ+2c2λϵ) ≤|T|(1+2c2λϵ+3λ/5),

where the first inequality uses the stability of , and the last uses that . If is sufficiently small relative to , it follows that On the other hand, by definition we have:

 ∑x∈T∖S|v⋅(x−μT)|2=|T|∫∞02tPrx∼uT[|v⋅(x−μT)|>t,x∉S]dt. (3)

Assume for the sake of contradiction that there is no for which Equation (2) is satisfied. Then the RHS of (3) is at most

 |T|(∫2+c√λϵ+10√log(1/ϵ)02tPrx∼uT[x∉S]dt+∫∞2+c√λϵ+10√log(1/ϵ)2tPrx∼uT[|v⋅(x−μT)|>t]dt) ≤ |T|(ϵ(2+c√λϵ+10√log(1/ϵ))2+∫∞5√log(1/ϵ)16(2t+2+c√λϵ)e−t2/2dt) ≤ |T|(O(c2λϵ2+ϵlog(1/ϵ))+O(ϵ2(√log(1/ϵ)+c√λϵ))) ≤ |T|O(c2λϵ2+(δ2/ϵ)/c)<|T|λ/3,

which is a contradiction. Therefore, the tail bounds and the concentration violation together imply the existence of such a (which can be efficiently computed). ∎

#### 2.4.2 Randomized Filtering

The basic filtering method of the previous subsection is deterministic, relying on the violation of a concentration inequality satisfied by the inliers. In some settings, deterministic filtering seems to fail to give optimal results and we require the filtering procedure to be randomized. A concrete such setting is when the uncorrupted distribution is only assumed to have bounded covariance.

The main idea of randomized filtering is simple: Suppose we can identify a non-negative function , defined on the samples , for which (under some high probability condition on the inliers) it holds that , where is an -corrupted set of samples and is the corresponding set of inliers. Then we can create a randomized filter by removing each sample point with probability proportional to . This ensures that the expected number of outliers removed is at least the expected number of inliers removed. The analysis of such a randomized filter is slightly more subtle, so we will discuss it in the following paragraphs.

The key property the above randomized filter ensures is that the sequence of random variables (where “inliers” are points in and “outliers” points in ) across iterations is a sub-martingale. Since the total number of outliers removed across all iterations accounts for at most an -fraction of the total samples, this means that with probability at least , at no point does the algorithm remove more than a -fraction of the inliers. A formal statement follows:

###### Theorem 2.12.

Let be a -stable set (with respect to ) and be an -corrupted version of . Suppose that given any with for which has an eigenvalue bigger than , for some , there is an efficient algorithm that computes a non-zero function such that . Then there exists a polynomial time randomized algorithm that computes a vector that with probability at least satisfies

The algorithm is described in pseudocode below:

Algorithm Randomized Filtering Compute and its largest eigenvalue . If , return . Else Compute as guaranteed in the theorem statement. Remove each with probability and return to Step with the new set .

###### Proof of Theorem 2.12.

First, it is easy to see that this algorithm runs in polynomial time. Indeed, as the point attaining the maximum value of is definitely removed in each filtering iteration, each iteration reduces by at least one. To establish correctness, we will show that, with probability at least , at each iteration of the algorithm it holds . Assuming this claim, Lemma 2.4 implies that our final error will be as desired.

To prove the desired claim, we consider the sequence of random variables across the iterations of the algorithm. We note that, initially, and that cannot drop below . Finally, we note that at each stage of the algorithm decreases by , and that the expectation of this quantity is

 ∑x∈S∖Tf(x)−∑x∈T∖Sf(x)=2∑x∈S∩Tf(x)−∑x∈Tf(x)≤0.

This means that is a sub-martingale (at least until we reach a point where ). However, if we set a stopping time at the first occasion where this condition fails, we note that the expectation of is at most . Since it is at least , this means that with probability at least it is never more than , which would imply that throughout the algorithm. If this is the case, the inequality will continue to hold throughout our algorithm, thus eventually yielding such a set with the variance of bounded. By Lemma 2.4, the mean of this will be a suitable estimate for the true mean. ∎

##### Methods of Point Removal.

The randomized filtering method described above only requires that each point is removed with probability , without any assumption of independence. Therefore, given an , there are several ways to implement this scheme. A few natural ones are given here:

• Randomized Thresholding: Perhaps the easiest method for implementing our randomized filter is generating a uniform random number and removing all points for which . This method is practically useful in many applications. Finding the set of such points is often fairly easy, as this condition may well correspond to a simple threshold.

• Independent Removal: Each is removed independently with probability . This scheme has the advantage of leading to less variance in . A careful analysis of the random walk involved allows one to reduce the failure probability to .

• Deterministic Reweighting: Instead of removing points, this scheme allows for weighted sets of points. In particular, each point will be assigned a weight in and we will consider weighted means and covariances. Instead of removing a point with probability proportional to , we can remove a fraction of ’s weight proportional to . This ensures that the appropriate weighted version of is definitely non-increasing, implying deterministic correctness of the algorithm.

##### Practical Considerations.

While the aforementioned point removal methods have similar theoretical guarantees, recent implementations [24] suggest that they have different practical performance on real datasets. The deterministic reweighting method is somewhat slower in practice as its worst-case runtime and its typical runtime are comparable. In more detail, one can guarantee termination by setting the constant of proportionality so that at each step at least one of the non-zero weights is set to zero. However, in practical circumstances, we will not be able to do better. That is, the algorithm may well be forced to undergo iterations. On the other hand, the randomized versions of the algorithm are likely to remove several points of at each filtering step.

Another reason why the randomized versions may be preferable has to do with the quality of the results. The randomized algorithms only produce bad results when there is a chance that ends up being very large. However, since is a sub-martingale, this will only ever be the case if there is a corresponding possibility that will be exceptionally small. Thus, although the randomized algorithms may have a probability of giving worse results some of the time, this will only happen if a corresponding fraction of the time, they also give better results than the theory guarantees. This consideration suggests that the randomized thresholding procedure might have advantages over the independent removal procedure precisely because it has a higher probability of failure. This has been observed experimentally in [24]: In real datasets (poisoned with a constant fraction of adversarial outliers), the number of iterations of randomized filtering is typically bounded by a small constant.

#### 2.4.3 Universal Filtering

In this subsection, we show how to use randomized filtering to construct a universal filter that works under only the stability condition (Definition 2.1) — not requiring the tail-bound condition of the basic filter (Lemma 2.11). Formally, we show:

###### Proposition 2.13.

Let be an -stable set for sufficiently small constants and at least a sufficiently large multiple of . Let be an -corrupted version of . Suppose that has largest eigenvalue . Then there exists a computationally efficient algorithm that, on input , computes a non-zero function satisfying .

By combining Theorem 2.12 and Proposition 2.13, we obtain a filter-based algorithm establishing Theorem 2.7.

###### Proof of Proposition 2.13..

The algorithm to construct is the following: We start by computing the sample mean and the top (unit) eigenvector of . For , we let . Let be the set of elements of on which is largest. We define to be for , and for .

Our basic plan of attack is as follows: First, we note that the sum of over is the variance of , , which is substantially larger than the sum of over , which is approximately the variance of , . Therefore, the sum of over the elements of must be quite large. In fact, using the stability condition, we can show that the latter quantity must be larger than the sum of the largest values of over . However, since , we have that

We now proceed with the detailed analysis. First, note that

 ∑x∈Tg(x)=|T|Var[v⋅T]=|T|(1+λ).

Moreover, for any with , we have that

 ∑x∈S′g(x)=|S′|(Var[v⋅S′]+(v⋅(μT−μS′))2). (4)

By the second stability condition, we have that . Furthermore, the stability condition and Lemma 2.4 give

 ∥μT−μS′∥2≤∥μT−μX∥2+∥μX−μS′∥2=O(δ+√ϵλ).

Since , combining the above gives that . Moreover, since and since takes its largest values on points , we have that

 ∑x∈Tf(x)=∑x∈Lg(x)≥∑x∈T∖Sg(x)≥(16/3)|S|δ2/ϵ.

Comparing the results of Equation (4) with