Outlier-Robust High-Dimensional Sparse Estimation
via Iterative Filtering
We study high-dimensional sparse estimation tasks in a robust setting where a constant fraction of the dataset is adversarially corrupted. Specifically, we focus on the fundamental problems of robust sparse mean estimation and robust sparse PCA. We give the first practically viable robust estimators for these problems. In more detail, our algorithms are sample and computationally efficient and achieve near-optimal robustness guarantees. In contrast to prior provable algorithms which relied on the ellipsoid method, our algorithms use spectral techniques to iteratively remove outliers from the dataset. Our experimental evaluation on synthetic data shows that our algorithms are scalable and significantly outperform a range of previous approaches, nearly matching the best error rate without corruptions.
The task of leveraging sparsity to extract meaningful information from high-dimensional datasets is a fundamental problem of significant practical importance, motivated by a range of data analysis applications. Various formalizations of this general problem have been investigated in statistics and machine learning for at least the past two decades, see, e.g., [Hastie15] for a recent textbook on the topic. This paper focuses on the unsupervised setting and in particular on estimating the parameters of a high-dimensional distribution under sparsity assumptions. Concretely, we study the problems of sparse mean estimation and sparse PCA under natural data generating models.
The classical setup in statistics is that the data was generated by a probabilistic model of a given type. This is a simplifying assumption that is only approximately valid, as real datasets are typically exposed to some source of contamination. The field of robust statistics [Huber64, Huber09, HampelEtalBook86] aims to design estimators that are robust in the presence of model misspecification. In recent years, designing computationally efficient robust estimators for high-dimensional settings has become a pressing challenge in a number of applications. These include the analysis of biological datasets, where natural outliers are common [RP-Gen02, Pas-MG10, Li-Science08] and can contaminate the downstream statistical analysis, and data poisoning attacks [Barreno2010], where even a small fraction of fake data (outliers) can substantially degrade the learned model [BiggioNL12, SteinhardtKL17].
This discussion motivates the design of robust estimators that can tolerate a constant fraction of adversarially corrupted data. We will use the following model of corruptions (see, e.g., [DKKLMS16]):
Given and a family of distributions on , the adversary operates as follows: The algorithm specifies some number of samples , and samples are drawn from some (unknown) . The adversary is allowed to inspect the samples, removes of them, and replaces them with arbitrary points. This set of points is then given to the algorithm. We say that a set of samples is -corrupted if it is generated by the above process.
Our model of corruptions generalizes several other robustness models, including Huber’s contamination model [Huber64] and the malicious PAC model [Valiant:85, keali93].
In the context of robust sparse mean estimation, we are given an -corrupted set of samples from an unknown mean Gaussian distribution , where is assumed to be -sparse, and the goal is to output a hypothesis vector that approximates in -norm. In the context of robust sparse PCA (in the spiked covariance model), we are given an -corrupted set of samples from , where is assumed to be -sparse and the goal is to approximate . In both settings, we would like to design computationally efficient estimators with sample complexity , i.e., close to the information theoretic minimum, that achieve near-optimal error guarantees.
Until recently, even for the simplest high-dimensional parameter estimation settings, no polynomial time robust learning algorithms with dimension-independent error guarantees were known. Two concurrent works [DKKLMS16, LaiRV16] made the first progress on this front for the unsupervised setting. Specifically, [DKKLMS16, LaiRV16] gave the first polynomial time algorithms for robustly learning the mean and covariance of high-dimensional Gaussians and other models. These works focused on the dense regime and as a result did not obtain algorithms with sublinear sample complexity in the sparse setting. Building on [DKKLMS16], more recent work [BDLS17] obtained sample efficient polynomial time algorithms for the robust sparse setting, and in particular for the problems of robust sparse mean estimation and robust sparse PCA studied in this paper. These algorithms are based the unknown convex programming methodology of [DKKLMS16] and in particular inherently rely on the ellipsoid algorithm. Moreover, the separation oracle required for the ellipsoid algorithm turns out to be another convex program — corresponding to an SDP to solve sparse PCA. As a consequence, the running time of these algorithms, while polynomially bounded, is impractically high.
1.2 Our Results and Techniques
The main contribution of this paper is the design of significantly faster robust estimators for the aforementioned high-dimensional sparse problems. More specifically, our algorithms are iterative and each iteration involves a simple spectral operation (computing the largest eigenvalue of an approximate matrix). Our algorithms achieve the same error guarantee as [BDLS17] with similar sample complexity. At the technical level, we enhance the iterative filtering methodology of [DKKLMS16] to the sparse setting, which we believe is of independent interest and could lead to faster algorithms for other robust sparse estimation tasks as well.
For robust sparse mean estimation, we show:
Theorem 1.2 (Robust Sparse Mean Estimation).
Let be a Gaussian distribution on with unknown -sparse mean vector , and . Let be an -corrupted set of samples from of size . There exists an algorithm that, on input , , and runs in polynomial time returns such that with probability at least it holds
Some comments are in order. First, the sample complexity of our algorithm is asymptotically the same as that of [BDLS17], and matches the lower bound of [DKS17-sq] against Statistical Query algorithms for this problem. The major advantage of our algorithm over [BDLS17] is that while their algorithm made use of the ellipsoid method, ours uses only spectral techniques and is scalable.
For robust sparse PCA in the spiked covariance model, we show:
Theorem 1.3 (Robust Sparse PCA).
Let be a Gaussian distribution on with spiked covariance for an unknown -sparse unit vector , and . For , let be an -corrupted set of samples from of size . There exists an algorithm that, on input , , and , runs in polynomial time and returns such that with probability at least we have that .
The sample complexity upper bound in Theorem 1.3 is somewhat worse than the information theoretic optimum of . While the ellipsoid-based algorithm of [BDLS17] achieves near-optimal sample complexity (within logarithmic factors), our algorithm is practically viable as it only uses spectral operations. We also note that the sample complexity in our above theorem is not known to be optimal for our algorithm. It seems quite plausible, via a tighter analysis, that our algorithm in fact has near-optimal sample complexity as well.
For both of our algorithms, in the most interesting regime of , the running time per iteration is dominated by the computation of the empirical covariance matrix. The number of iterations is at most , although it typically is much smaller, so both algorithms take at most time.
1.3 Related Work
There is extensive literature on exploiting sparsity in statistical estimation (see, e.g., [Hastie15]). In this section, we summarize the related work that is directly related to the results of this paper. Sparse mean estimation is arguably one of the most fundamental sparse estimation tasks and is closely related to the Gaussian sequence model [Tsybakov08, J17]. The task of sparse PCA in the spiked covariance model, initiated in [J01], has been extensively investigated (see Chapter 8 of [Hastie15] and references therein). In this work, we design algorithms for the aforementioned problems that are robust to a constant fraction of outliers.
Learning in the presence of outliers is an important goal in statistics studied since the 1960s [Huber64]. See, e.g., [Huber09, HampelEtalBook86] for book-length introductions in robust statistics. Until recently, all known computationally efficient high-dimensional estimators could tolerate a negligible fraction of outliers, even for the task of mean estimation. Recent work [DKKLMS16, LaiRV16] gave the first efficient robust estimators for basic high-dimensional unsupervised tasks, including mean and covariance estimation. Since the dissemination of [DKKLMS16, LaiRV16], there has been a flurry of research activity on computationally efficient robust learning in high dimensions [BDLS17, CSV17, DKK+17, DKS17-sq, DiakonikolasKKLMS18, SteinhardtCV18, DiakonikolasKS18-mixtures, DiakonikolasKS18-nasty, HopkinsL18, KothariSS18, PrasadSBR2018, DiakonikolasKKLSS2018sever, KlivansKM18, DKS19-lr, LSLC18-sparse, ChengDKS18, ChengDR18, CDGW19].
In the context of robust sparse estimation, [BDLS17] obtained sample-efficient and polynomial time algorithms for robust sparse mean estimation and robust sparse PCA. The main difference between [BDLS17] and the results of this paper is that the [BDLS17] algorithms use the ellipsoid method (whose separation oracle is an SDP). Hence, these algorithms are prohibitively slow for practical applications. More recent work [LSLC18] gave an iterative method for robust sparse mean estimation, which however requires multiple solutions to a convex relaxation for sparse PCA in each iteration. Finally, [LLC19] proposed an algorithm for robust sparse mean estimation via iterative trimmed hard thresholding. While this algorithm seems practically viable in terms of runtime, it can only tolerate – i.e., sub-constant – fraction of corruptions.
1.4 Paper Organization
In Section 2, we describe our algorithms and provide a detailed sketch of their analysis. In Section 6, we report detailed experiments demonstrating the performance of our algorithms on synthetic data in various parameter regimes. Due to space limitations, the full proofs of correctness for our algorithms can be found in the full version of this paper.
In this section, we describe our algorithms in tandem with a detailed outline of the intuition behind them and a sketch of their analysis. Due to space limitations, the proof of correctness is deferred to the full version of our paper.
At a high-level, our algorithms use the iterative filtering methodology of [DKKLMS16]. The main idea is to iteratively remove a small subset of the dataset, so that eventually we have removed all the important outliers and the standard estimator (i.e., the estimator we would have used in the noiseless case) works. Before we explain our new ideas that enhance the filtering methodology to the sparse setting, we provide a brief technical description of the approach.
Overview of Iterative Filtering.
The basic idea of iterative filtering [DKKLMS16] is the following: In a given iteration, carefully pick some test statistic (such as for a well-chosen ). If there were no outliers, this statistic would follow a nice distribution (with good concentration properties). This allows us to do some sort of statistical hypothesis testing of the “null hypothesis” that each is an inlier, rejecting it (and believing that is an outlier) if is far from the expected distribution. Because there are a large number of such hypotheses, one uses a procedure reminiscent of the Benjamini-Hochberg procedure [benjamini1995controlling] to find a candidate set of outliers with low false discovery rate (FDR), i.e., a set with more outliers than inliers in expectation. This procedure looks for a threshold such that the fraction of points with test statistic above is at least a constant factor more than it “should” be. If such a threshold is found, those points are mostly outliers and can be safely removed. The key goal is to judiciously design a test statistic such that either the outliers aren’t particularly important—so the naive empirical solution is adequate—or at least one point will be filtered out.
In other words, the goal is to find a test statistic such that, if the distribution of the test statistic is “close” to what it would be in the outlier-free world, then the outliers cannot perturb the answer too much. An additional complication is that the test statistics depend on the data (such as , where is the principal component of the data) making the distribution on inliers also nontrivial. This consideration drives the sample complexity of the algorithms.
In the algorithms we describe below, we use a specific parameterized notion of a good set. We define these precisely in the supplementary material, briefly, any large enough sample drawn from the uncorrupted distribution will satisfy the structural properties required for the set to be good.
We now describe how to design such test statistics for our two sparse settings.
Before we describe our algorithms, we set up some notation. We define to be the thresholding operator that keeps the entries of with the largest magnitude and sets the rest to 0. For a finite set , we will use to mean that is chosen uniformly at random from . For and , let denote the matrix restricted to the submatrix.
Robust Sparse Mean Estimation.
Here we briefly describe the motivation and analysis of Algorithm 1, describing a single iteration of our filter for the robust sparse mean setting.
In order to estimate the -sparse mean , it suffices to ensure that our estimate has small for any -sparse unit vector . The now-standard idea in robust statistics [DKKLMS16] is that if a small number of corrupted samples suffice to cause a large change in our estimate of , then this must lead to a substantial increase in the sample variance of , which we can detect.
Thus, a very basic form of a robust algorithm might be to compute a sample covariance matrix , and let be the -sparse unit vector that maximizes . If this number is close to , it certifies that our estimate — obtained by truncating the sample mean to its -largest entries — is a good estimate of the true mean . If not, this will allow us to filter our sample set by throwing away the values where is furthest from the true mean. This procedure guarantees that we have removed more corrupted samples than uncorrupted ones. We then repeat the filter until the empirical variance in every sparse direction is close to .
Unfortunately, the optimization problem of finding the optimal is computationally challenging, requiring a convex program. To circumvent the need for a convex program, we notice that is large only if has large entries on the non-zero entries of . Thus, if the largest entries of had small -norm, this would certify that no such bad existed and would allow us to return the truncated sample mean. In case these entries have large -norm, we show that we can produce a filter that removes more bad samples than good ones. Let be the matrix consisting of the large entries of (for the moment assume that they are all off diagonal, but this is not needed). We know that the sample mean of . On the other hand, if approximates on the entries in question, we would have that . This means that if is reasonably large, an -fraction of corrupted points changed the mean of from to . This means that many of these errors must have had . This becomes very unlikely for good samples if is much larger than (by standard results on the concentration of Gaussian polynomials). Thus, if is approximately on these coordinates, we can produce a filter. To ensure this, we can use existing filter-based algorithms to approximate the mean on these coordinates. This results in Algorithm 1. For the analysis, we note that if the entries of are small, then must be small for any unit -sparse , which certifies that the truncated sample mean is good. Otherwise, we can filter the samples using the first kind of filter. This ensures that our mean estimate is sufficiently close to the true mean that we can then filter using the second kind of filter.
It is not hard to show that the above works if we are given sufficiently many samples, but to obtain a tight analysis of the sample complexity, we need a number of subtle technical ideas. The detailed analysis of the sample complexity is deferred to the full version of our paper.
Robust Sparse PCA
Here we briefly describe the motivation and analysis of Algorithm 2, describing a single iteration of our filter for the sparse PCA setting.
Note that estimating the -sparse vector is equivalent to estimating . In fact, estimating to error in Frobenius norm allows one to estimate within error in -norm. Thus, we focus on he task of robustly approximating the mean of .
Our algorithm is going to take advantage of one fact about that even errors cannot hide: that is large. This is because removing uncorrupted samples cannot reduce the variance by much more than an -fraction, and adding samples can only increase it. This means that an adversary attempting to fool our algorithm can only do so by creating other directions where the variance is large, or simply by adding other large entries to the sample covariance matrix in order to make it hard to find this particular -sparse eigenvector. In either case, the adversary is creating large entries in the empirical mean of that should not be there. This suggests that the largest entries of the empirical mean of , whether errors or not, will be of great importance.
These large entries will tell us where to focus our attention. In particular, we can find the largest entries of the empirical mean of and attempt to filter based on them. When we do so, one of two things will happen: Either we remove bad samples and make progress or we verify that these entries ought to be large, and thus must come from the support of . In particular, when we reach the second case, since the adversary cannot shrink the empirical variance of by much, almost all of the entries on the support of must remain large, and this can be captured by our algorithm.
The above algorithm works under a set of deterministic conditions on the good set of samples that are satisfied with high probability with samples. Our current analysis does not establish the information-theoretically optimal sample size of , though we believe that this plausible via a tighter analysis.
We note that a naive implementation of this algorithm will achieve error in our final estimate for , while our goal is to obtain error. To achieve this, we need to overcome two difficulties: First, when trying to filter on subsets of its coordinates, we do not know the true variance of , and thus cannot expect to obtain error. This is fixed with a bootstrapping method similar to that in [Kane18] to estimate the covariance of a Gaussian. In particular, we do not know a priori, but after we run the algorithm, we obtain an approximation to , which gives an approximation to . This in turn lets us get a better approximation to and a better approximation to ; and so on.
We will use the following notation and definitions.
For , let . Throughout this paper, for , we will use to denote its Euclidean norm. If , we will use to denote its spectral norm, to denote its Frobenius norm, and to denote its trace. We will also let and denote the PSD ordering on matrices. For a finite multiset , we will write to denote that is drawn from the empirical distribution defined by . Given finite multisets and we let be the size of the symmetric difference of and divided by the cardinality of .
For and , let be the vector with , , and otherwise. We denote by the thresholding operator that keeps the entries of with largest magnitude (breaking ties arbitrarily) and sets the rest to . For and , let denote the matrix restricted to the sub-matrix. For , then we will use to denote the matrix restricted to the elements whose entries are in
Let denote the Kronecker delta function. We will denote . The notation and hides logarithmic factors in the argument.
4 Robust Sparse Mean Estimation
Let be an identity covariance Gaussian distribution on with unknown -sparse mean vector , and . Let be an -corrupted set of samples from of size . There exists an efficient algorithm that, on input , , , and , returns a mean vector such that with probability at least it holds
4.1 Proof of Theorem 4.1
In this section, we describe and analyze our algorithm establishing Theorem 4.1. We start by formalizing the set of deterministic conditions on the good data under which our algorithm succeeds:
Fix and . A multiset of points in is -good with respect to if, for and , the following conditions hold:
For all , , and for all , .
For all and , we have .
For all -sparse unit vectors , we have that:
For all ,
For all homogeneous111Recall that a degree- polynomial is called homogeneous if its non-zero terms are all of degree exactly . degree- polynomials with and at most terms, we have that:
For all ,
Our first lemma says that a sufficiently large set of samples from is good with high probability:
A set of samples from is -good (with respect to ) with probability at least .
Our algorithm iteratively applies the procedure Robust-Sparse-Mean (Algorithm 1). The crux of the proof is the following performance guarantee of Robust-Sparse-Mean:
Algorithm 1 has the following performance guarantee: On input a multiset of points in such that , where is an -good set with respect to , procedure Robust-Sparse-Mean returns one of the following:
A mean vector such that , or
A multiset satisfying .
We note that our overall algorithm terminates after at most iterations of Algorithm 1, in which case it returns a candidate mean vector satisfying the first condition of Proposition 4.4. Note that the initial -corrupted set satisfies . If is the multiset returned after the -th iteration, then we have that .
In the rest of this section, we prove Proposition 4.4.
We start by showing the first part of Proposition 4.4. Note that Algorithm 1 outputs a candidate mean vector only if . We start with the following lemma:
If , then for any with , we have .
Fix with . By definition, is the Frobenius norm of the corresponding sub-matrix on . Note that this is the -norm of a set of diagonal entries and off-diagonal entries of . By construction, is the set that maximizes this norm, and therefore
Given this bound, we leverage a proof technique from [DKKLMS16] showing that a bound on the spectral norm of the covariance implies a -error bound on the mean. This implication is not explicitly stated in [DKKLMS16], but follows directly from the arguments in Section 5.1.2 of that work. In particular, the analysis of the “small spectral norm” case in that section shows that , from which the desired claim follows. This completes the proof of Lemma 4.5. ∎
Let . If , then .
For vectors , let denote the set of coordinates on which is non-zero and denote the set of coordinates on which is non-zero and is zero. Setting and in Lemma 4.5, we get that and .
If has or fewer non-zero coordinates, then and and we are done. Otherwise, has exactly non-zero coordinates and so . Since the nonzero coordinates of are the largest magnitude coordinates of , for any and , we have that . Since , at least one coordinate must have . Therefore, for any , we have that .
Thus, we have
where the second inequality used that .
Since , by the triangle inequality we have that . Finally, we have that
concluding the proof. ∎
We now analyze the complementary case that . In this case, we apply two different filters, a linear filter (Steps 7-10), and a quadratic filter (Steps 11-13). To prove the second part of Proposition 4.4, we will show that at least one of these two filters: (i) removes at least one point, and (ii) it removes more corrupted than uncorrupted points.
The analysis in the case of the linear filter follows by a reduction to the linear filter in [DKKLMS16] for the non-sparse setting (see Proposition 5.5 in Section 5.1 of that work). More specifically, the linear filter in Steps 7-10 is essentially identical to the linear filter of [DKKLMS16] restricted to the matrix . We note that Definition 4.2 implies that every restriction to coordinates satisfies the properties of the good set in the sense of [DKKLMS16] (Definition 5.2(i)-(ii) of that work). This implies that the analysis of the linear filter from [DKKLMS16] holds in our case, establishing the desired properties. Since the linear filter removes more corrupted points than uncorrupted points, it will remove at most a fraction of the points over all the iterations.
If the condition of the linear filter does not apply, i.e., if , the aforementioned analysis in [DKKLMS16] implies . In this case, we show that the second filter behaves appropriately.
Let be the polynomial considered in the quadratic filter. We start with the following technical lemma analyzing the expectation and variance of under various distributions:
The following hold true:
Let and . We have
Therefore, . Similarly,
We have thus shown (ii) and the first part of (i).
We now proceed to show the first part of (iii). Note that
Thus, we can write
This proves all the statements about expectations.
We now analyze the variance of for and . Since is symmetric, we can write for an orthogonal matrix and a diagonal matrix . Note that is distributed as . Under these substitutions, , the variance of which is the same as the variance of and so
Similarly, to estimate the variance of we see we just need to estimate the variance of where . Let and .
This gives us
This completes the proof of Lemma 4.7. ∎
Suppose that we find a threshold such that Step 12 of the algorithm holds, i.e., the quadratic filter applies. Then we can show that Step 13 removes more bad points than good points. This follows from standard arguments, by combining Definition 4.2(iv)(b) with our upper bound for from Lemma 4.7. Let . By Definition 4.2(iv)(b), for the good set , we have that for , . Lemma 4.7(iii) implies that . Therefore, shifting by we obtain the following corollary:
We have that:
For , .
Condition (ii) implies that the fraction of points in that violate the quadratic filter condition is less than the fraction of points in that violate the same condition. Therefore, the quadratic filter removes more corrupted points than uncorrupted ones.
It remains to show that if Algorithm 1 does not terminate in Step 6 and the linear filter does not apply, then the quadratic filter necessarily applies. To establish this, we need a couple more technical lemmas. We first show that the expectation of over the set of good samples that are removed is small:
We have that .
By a similar argument, we can show that if the quadratic filter does not apply, then the remaining points in contribute a small amount to the expectation of .
Suppose that for all , we have . Then, we have that .
By combining the above, we obtain the following corollary, completing the analysis of our algorithm: