High Dimensional Robust Sparse Regression

# High Dimensional Robust Sparse Regression

Liu Liu
liuliu@utexas.edu
Yanyao Shen
shenyanyao@utexas.edu
Tianyang Li
lty@cs.utexas.edu
Constantine Caramanis
constantine@utexas.edu

The University of Texas at Austin
###### Abstract

We provide a novel – and to the best of our knowledge, the first – algorithm for high dimensional sparse regression with constant fraction of corruptions in explanatory and/or response variables. Our algorithm recovers the true sparse parameters with sub-linear sample complexity, in the presence of a constant fraction of arbitrary corruptions. Our main contribution is a robust variant of Iterative Hard Thresholding. Using this, we provide accurate estimators: when the covariance matrix in sparse regression is identity, our error guarantee is near information-theoretically optimal. We then deal with robust sparse regression with unknown structured covariance matrix. We propose a filtering algorithm which consists of a novel randomized outlier removal technique for robust sparse mean estimation that may be of interest in its own right: the filtering algorithm is flexible enough to deal with unknown covariance. Also, it is orderwise more efficient computationally than the ellipsoid algorithm. Using sub-linear sample complexity, our algorithm achieves the best known (and first) error guarantee. We demonstrate the effectiveness on large-scale sparse regression problems with arbitrary corruptions.

## 1 Introduction

Learning in the presence of arbitrarily (even adversarially) corrupted outliers in the training data has a long history in Robust Statistics [27, 23, 42], and has recently received much renewed attention. The high dimensional setting poses particular challenges as outlier removal via preprocessing is essentially impossible when the number of variables scales with the number of samples. We propose a computationally efficient estimator for outlier-robust sparse regression that has near-optimal sample complexity, and is the first algorithm resilient to a constant fraction of arbitrary outliers with corrupted covariates and/or response variables. Unless we specifically mention otherwise, all future mentions of outliers mean corruptions in covariates and/or response variables.

We assume that the authentic samples are independent and identically distributed (i.i.d.) drawn from an uncorrupted distribution , where represents the linear model , where , and is the true parameter (see Section 1.3 for complete details and definitions). To model the corruptions, the adversary can choose an arbitrary -fraction of the authentic samples, and replace them with arbitrary values. We refer to the observations after corruption as -corrupted samples (Definition 1.1). This corruption model allows the adversary to select an -fraction of authentic samples to delete and corrupt, hence it is stronger than Huber’s -contamination model [26], where the adversary independently corrupts each sample with probability .

Outlier-robust regression is a classical problem within robust statistics (e.g., [39]), yet even in the low-dimensional setting, efficient algorithms robust to corruption in the covariates have proved elusive, until recent breakthroughs in [38, 15] and [31], which built on important results in Robust Mean Estimation [13, 33] and Sums of Squares [4], respectively.

In the sparse setting, the parameter we seek to recover is also -sparse, and a key goal is to provide recovery guarantees with sample complexity scaling with , and sublinearly with . Without outliers, by now classical results (e.g., [18]) show that samples from a i.i.d sub-Gaussian distribution are enough to give recovery guarantees on with and without additive noise. These strong assumptions on the probabilistic distribution are necessary, since in the worst case, sparse recovery is known to be NP-hard [3, 48].

Sparsity recovery with a constant fraction of arbitrary corruption is fundamentally hard. For instance, to the best of our knowledge, there’s no previous work can provide exact recovery for sparse linear equations with arbitrary corruption in polynomial time. In contrast, a simple exhaustive search can easily enumerate the samples and recover the sparse parameter in exponential time.

In this work, we seek to give an efficient, sample-complexity optimal algorithm that recovers to within accuracy depending on (the fraction of outliers). In the case of no additive noise, we are interested in algorithms that can guarantee exact recovery, independent of .

### 1.1 Related work

The last 10 years have seen a resurgence in interest in robust statistics, including the problem of resilience to outliers in the data. Important problems attacked have included PCA [32, 47, 46, 33, 13], and more recently robust regression (as in this paper) [38, 15, 17, 31] and robust mean estimation [13, 33, 1], among others. We focus now on the recent work most related to the present paper.

Robust regression. Earlier work in robust regression considers corruption only in the output, and shows that algorithms nearly as efficient as for regression without outliers succeeds in parameter recovery, even with a constant fraction of outliers [34, 37, 5, 6, 30]. Yet these algorithms (and their analysis) focus on corruption in , and do not seem to extend to the setting of corrupted covariates – the setting of this work. In the low dimensional setting, there has been remarkable recent progress. The work in [31] shows that the Sum of Squares (SOS) based semidefinite hierarchy can be used for solving robust regression. Essentially concurrent to the SOS work, [12, 24, 38, 15] use robust gradient descent for empirical risk minimization, by using robust mean estimation as a subroutine to compute robust gradients at each iteration. [17] uses filtering algorithm [13] for robust regression. Computationally, these latter works scale better than the algorithms in [31], as although the Sum of Squares SDP framework gives polynomial time algorithms, they are often not practical [25].

Much less appears to be known in the high-dimensional regime. Exponential time algorithm, such as [21, 29], optimizes Tukey depth [42, 10]. Their results reveal that handling a constant fraction of outliers () is actually minimax-optimal. Work in [11] first provided a polynomial time algorithm for this problem. They show that replacing the standard inner product in Matching Pursuit with a trimmed version, one can recover from an -fraction of outliers, with . Very recently, [36] considered more general sparsity constrained -estimation by using a trimmed estimator in each step of gradient descent, yet the robustness guarantee is still sub-optimal. Another approach follows as a byproduct of a recent algorithm for robust sparse mean estimation, in [1]. However, their error guarantee scales with , and moreover, does not provide exact recovery in the adversarial corruption case without stochastic noise (i.e., noise variance ). We note that this is an inevitable consequence of their approach, as they directly use sparse mean estimation on , rather than considering Maximum Likelihood Estimation.

Robust mean estimation. The idea in [38, 15, 17] is to leverage recent breakthroughs in robust mean estimation. Very recently, [13, 33] provided the first robust mean estimation algorithms that can handle a constant fraction of outliers (though [33] incurs a small (logarithmic) dependence in the dimension). Following their work, [1] extended the ellipsoid algorithm from [13] to robust sparse mean estimation in high dimensions. They show that -sparse mean estimation in with a constant fraction of outliers can be done with samples. The term appears to be necessary, as follows from an oracle-based lower bound [16].

### 1.2 Main contributions

• Our result is a robust variant of Iterative Hard Thresholding (IHT) [8]. We provide a deterministic stability result showing that IHT works with any robust sparse mean estimation algorithm. We show our robust IHT does not accumulate the errors of a (any) robust sparse mean estimation subroutine for computing the gradient. Specifically, robust IHT produces a final solution whose error is orderwise the same as the error guaranteed by an single use of the robust mean estimation subroutine. We refer to Definition 2.1 and Theorem 2.1 for the precise statement. Thus our result can be viewed as a meta-theorem that can be coupled with any robust sparse mean estimator.

• Coupling robust IHT with a robust sparse mean estimation subroutine based on a version of the ellipsoid algorithm given and analyzed in [1], our results Corollary 3.1 show that given -corrupted sparse regression samples with identity covariance, we recover within additive error (which is minimax optimal [21]). The proof of the ellipsoid algorithm’s performance in [1] hinges on obtaining an upper bound on the sparse operator norm (their Lemmas A.2 and A.3). As we show (see Appendix B), the statement of Lemma A.3 seems to be incorrect, and the general approach of upper bounding the sparse operator norm may not work. Nevertheless, the algorithm performance they claim is correct, as we show through a different avenue (see Lemma D.3 in Section D.3).

Using this ellipsoid algorithm, In particular, we obtain exact recovery if either the fraction of outliers goes to zero (this is just ordinary sparse regression), or in the presence of a constant fraction of outliers but with the additive noise term going to zero (this is the case of robust sparse linear equations). To the best of our knowledge, this is the first result that shows exact recovery for robust sparse linear equations with a constant fraction of outliers. This is the content of Section 3.

• For robust sparse regression with unknown covariance matrix, we consider the wide class of sparse covariance matrices [7]. We then prove a result that may be of interest in its own right: we provide a novel robust sparse mean estimation algorithm that is based on a filtering algorithm for sequentially screening and removing potential outliers. We show that the filtering algorithm is flexible enough to deal with unknown covariance, whereas the ellipsoid algorithm cannot. It also runs a factor of faster than the ellipsoid algorithm. If the covariance matrix is sufficiently sparse, our filtering algorithm gives a robust sparse mean estimation algorithm, that can then be coupled with our meta-theorem. Together, these two guarantee recovery of within an additive error of . In the case of unknown covariance, this is the best (and in fact, only) result we are aware of for robust sparse regression. We note that it can be applied to the case of known and identity covariance, though it is weaker than the optimal results we obtain using the computationally more expensive ellipsoid algorithm. Nevertheless, in both cases (unknown sparse, or known identity) the result is strong enough to guarantee exact recovery when either or goes to zero. We demonstrate the practical effectiveness of our filtering algorithm in Appendix H. This is the content of Section 4 and Section 5.

### 1.3 Setup, Notation and Outline

In this subsection, we formally define the corruption model and the sparse regression model. We first introduce the -corrupted samples described above:

###### Definition 1.1 (ϵ-corrupted samples).

Let be i.i.d. observations follow from a distribution . The -corrupted samples are generated by the following process: an adversary chooses an arbitrary -fraction of the samples in and modifies them with arbitrary values. After the corruption, we use to denote the observations, and use to denote the corruptions.

The parameter represents the fraction of outliers. Throughout, we assume that it is a (small) constant, independent of dimension or other problem parameters. Furthermore, we assume that the distribution is the standard Gaussian-design AWGN linear model.

###### Model 1.1.

The observations follow from the linear model , where is the model parameter, and assumed to be -sparse. We assume that and , where is the normalized covariance matrix with for all . We denote as the smallest eigenvalue of , and as its largest eigenvalue. They are assumed to be universal constants in this paper, and we denote the constant .

As in [1], we pre-process by removing “obvious” outliers; we henceforth assume that all authentic and corrupted points are within a radius bounded by a polynomial in , and .

Notation. We denote the hard thresholding operator of sparsity by . We define the -sparse operator norm as , where is not required to be positive-semidefinite (p.s.d.). We use trace inner produce to denote . We use to denote the expectation operator obtained by the uniform distribution over all samples in a set . Finally, we use the notation to hide the dependency on , and to hide the dependency on in our bounds.

## 2 Hard thresholding with robust gradient estimation

In this section, we present our method of using robust sparse gradient updates in IHT. We then show statistical recovery guarantees given any accurate robust sparse gradient estimation, which is formally defined in Definition 2.1.

We define the notation for the stochastic gradient corresponding to the point , and the population gradient for based on Model 1.1, where is the distribution of the authentic points. Since , the population mean of all authentic gradients is given by

In the uncorrupted case where all samples follow from Model 1.1, a single iteration of IHT updates via . Here, the hard thresholding operator selects the largest elements in magnitude, and the parameter is proportional to (specified in Theorem 2.1). However, given -corrupted samples according to Definition 1.1, the IHT update based on empirical average of all gradient samples can be arbitrarily bad.

The key goal in this paper is to find a robust estimate to replace in each step of IHT, with sample complexity sub-linear in the dimension . For instance, we consider robust sparse regression with . Then, is guaranteed to be -sparse in each iteration of IHT. In this case, given -corrupted samples, we can use a robust sparse mean estimator to recover the unknown true from , with sub-linear sample complexity.

More generally, we propose Robust Sparse Gradient Estimator (RSGE) for gradient estimation given -corrupted samples, as defined in Definition 2.1, which guarantees that the deviation between the robust estimate and true , with sample complexity . For a fixed -sparse parameter , we drop the superscript without abuse of notation, and use in place of , and in place of ; denotes the population gradient over the authentic samples’ distribution , at the point .

###### Definition 2.1 (ψ(ϵ)-Rsge).

Given -corrupted samples from Model 1.1, we call a -RSGE, if given , guarantees , with probability at least .

Here, we use to denote the sample complexity as a function of , and note that the definition of RSGE does not require to be identity matrix. The parameters and will be specified by concrete robust sparse mean estimators in subsequent sections. Equipped with Definition 2.1, we propose Algorithm 1, which takes any RSGE as a subroutine in line 7, and runs a robust variant of IHT with the estimated sparse gradient at each iteration in line 8.111Our results require sample splitting to maintain independence between subsequent iterations, though we believe this is an artifact of our analysis. Similar approach has been used in [2, 38] for theoretical analysis. We do not use sample splitting technique in the experiments.

### 2.1 Global linear convergence and parameter recovery guarantees

In each single IHT update step, RSGE introduces a controlled amount of error. Theorem 2.1 gives a global linear convergence guarantee for Algorithm 1 by showing that IHT does not accumulate too much error. In particular, we are able to recover within error given any -RSGE subroutine. We give the proof of Theorem 2.1 in Appendix A. The hyper-parameter guarantees global linear convergence of IHT when (when ). This setup has been used in [28, 40], and is proved to be necessary in [35]. Note that Theorem 2.1 is a deterministic stability result in nature, and we obtain probabilistic results by certifying the RSGE condition.

###### Theorem 2.1 (Meta-theorem).

Suppose we observe -corrupted samples from Model 1.1. Algorithm 1, with -RSGE defined in Definition 2.1, with step size outputs , such that with probability at least , by setting and . The sample complexity is .

## 3 Robust sparse regression with near-optimal guarantee

In this section, we provide near optimal statistical guarantee for robust sparse regression when the covariance matrix is identity. Under the assumption , [1] proposes a robust sparse regression estimator based on robust sparse mean estimation on , leveraging the fact that . With sample complexity , this algorithm produces a such that , with probability at least . Using Theorem 2.1, we show that we can obtain significantly stronger statistical guarantees which are statistically optimal; in particular, our guarantees are independent of and yield exact recovery when .

### 3.1 RSGE via the ellipsoid algorithm

More specifically, the ellipsoid-based robust sparse mean estimation algorithm [1] deals with outliers by trying to optimize the set of weights on each of the samples in – ideally outliers would receive lower weight and hence their impact would be minimized. Since the set of weights is convex, this can be approached using a separation oracle Algorithm 2. The Algorithm 2 depends on a convex relaxation of Sparse PCA, and the hard thresholding parameter is , as the population mean of all authentic gradient samples is guaranteed to be -sparse. In line 4 and 5, we calculate the weighted mean and covariance based on a hard thresholding operator. In line 6 of Algorithm 2, with each call to the relaxation of Sparse PCA, we obtain an optimal value, , and optimal solution, , to the problem:

 λ∗=maxHTr((ˆΣ−F(ˆG))⋅H),% subject to H≽0,∥H∥1,1≤˜k,Tr(H)=1. (1)

Here, , are weighted first and second order moment estimates from -corrupted samples, and is a function with closed-form

 F(ˆG)=∥ˆG∥22Id+ˆGˆG⊤+σ2Id. (2)

For eq. 2, given the population mean , we have , which calculates the underlying true covariance matrix. We provide more details about the calculation of , as well as some smoothness properties, in Appendix C.

The key component in the separation oracle Algorithm 2 is to use convex relaxation of Sparse PCA eq. 1. This idea generalizes existing work on using PCA to detect outliers in low dimensional robust mean estimation [13, 33]. To gain some intuition for eq. 1, if is an outlier, then the optimal solution of eq. 1, , may detect the direction of this outlier. And this outlier will be down-weighted in the output of Algorithm 2 by the separating hyperplane. Finally, Algorithm 2 will terminate with (line 7) and output the robust sparse mean estimation of the gradients .

Indeed, the ellipsoid-algorithm-based robust sparse mean estimator gives a RSGE, which we can combine with Theorem 2.1 to obtain stronger results. We state these as Corollary 3.1. We note again that the analysis in [1] has a flaw. Their Lemma A.3 is incorrect, as our counterexample in Appendix B demonstrates. We provide a correct route of analysis in Lemma D.3 of Appendix D.

### 3.2 Near-optimal statistical guarantees

###### Corollary 3.1.

Suppose we observe -corrupted samples from Model 1.1 with . By setting , if we use the ellipsoid algorithm for robust sparse gradient estimation with , it requires samples, and guarantees . Hence, Algorithm 1 outputs , such that , with probability at least , by setting .

For a desired error level , we only require sample complexity . Hence, we can achieve statistical error . Our error bound is nearly optimal compared to the information-theoretically optimal in [21], as the term is necessary by an oracle-based SQ lower bound [16].

Proof sketch of Corollary 3.1 The key to the proof relies on showing that controls the quality of the weights of the current iteration, i.e., small means good weights and thus a good current solution. Showing this relies on using to control . Lemma A.3 in [1] claims that . As we show in Appendix B, however, this need not hold. This is because the trace norm maximization eq. 1 is not a valid convex relaxation for the -sparse operator norm when the term is not p.s.d. (which indeed it need not be). We provide a different line of analysis in Lemma D.3, essentially showing that even without the claimed (incorrect) bound, can still provide the control we need. With the corrected analysis for , the ellipsoid algorithm guarantees with probability at least . Therefore, the algorithm provides an -RSGE.

## 4 Robust sparse mean estimation via filtering

From a computational viewpoint, the time complexity of Algorithm 1 depends on the RSGE in each iterate. The time complexity of the ellipsoid algorithm is indeed polynomial in the dimension, but it requires calls to a relaxation of Sparse PCA ([9]). In this section, we introduce a faster algorithm as a RSGE, which only requires calls of Sparse PCA (recall that only scales with ). Importantly, this RSGE is flexible enough to deal with unknown covariance matrix, yet the ellipsoid algorithm cannot. Before we move to the result for unknown covariance matrix in Section 5, we first introduce Algorithm 3 and analyze its performance when the covariance is identity. These supporting Lemmas will be later used in the unknown case.

Our proposed RSGE (Algorithm 3) attempts to remove one outlier at each iteration, as long as a good solution has not already been identified. It first estimates the gradient by hard thresholding (line 4) and then estimates the corresponding sample covariance matrix (line 5). By solving (a relaxation of) Sparse PCA, we obtain a scalar as well as a matrix . If is smaller than the predetermined threshold , we have a certificate that the effect of the outliers is well-controlled (specified in eq. 5). Otherwise, we compute a score for each sample based on , and discard one of the samples according to a probability distribution where each sample’s probability of being discarded is proportional to the score we have computed 222Although we remove one sample in Algorithm 3, our theoretical analysis naturally extend to removing constant number of outliers. This speeds up the algorithm in practice, yet shares the same computational complexity. Algorithm 3 can be used for other robust sparse functional estimation problems (e.g., robust sparse mean estimation for , where is -sparse). To use Algorithm 3 as a RSGE given gradient samples (denoted as ), we call Algorithm 3 repeatedly on and then on its output, , until it returns a robust estimator . The next theorem provides guarantees on this iterative application of Algorithm 3.

###### Theorem 4.1.

Suppose we observe -corrupted samples from Model 1.1 with . Let be an -corrupted set of gradient samples . By setting , if we run Algorithm 3 iteratively with initial set , and subsequently on , and use , 333Similar to [13, 14, 15, 1], our results seem to require this side information. then this repeated use of Algorithm 3 will stop after at most iterations, and output , such that , with probability at least . Here, is a constant depending on , where is a constant.

Thus, Theorem 4.1 shows that with high probability, Algorithm 3 provides a Robust Sparse Gradient Estimator where . For example, we can take . Combining now with Theorem 2.1, we obtain an error guarantee for robust sparse regression.

###### Corollary 4.1.

Suppose we observe -corrupted samples from Model 1.1 with . Under the same setting as Theorem 4.1, if we use Algorithm 3 for robust sparse gradient estimation, it requires samples, and , then we have with probability at least .

Similar to Section 3, we can achieve statistical error . The scaling of in Corollary 4.1 is . These guarantees are worse than achieved by ellipsoid methods. Nevertheless, this result is strong enough to guarantee exact recovery when either or goes to zero. The simulation of robust estimation for the filtering algorithm is in Appendix H.

The key step in Algorithm 3 is outlier removal eq. 4 based on the solution of Sparse PCA’s convex relaxation eq. 3. We describe the outlier removal below, and then give the proofs in Appendix E and Appendix F.

Outlier removal guarantees in Algorithm 3. We denote samples in the input set as . This input set can be partitioned into two parts: , and . Lemma 4.1 shows that Algorithm 3 can return a guaranteed gradient estimate, or the outlier removal step eq. 4 is likely to discard an outlier. The guarantee on the outlier removal step eq. 4 hinges on the fact that if is less than , we can show eq. 4 is likely to remove an outlier.

###### Lemma 4.1.

Suppose we observe -corrupted samples from Model 1.1 with . Let be an -corrupted set . Algorithm 3 computes that satisfies

 (5)

If , then with probability at least , we have , where is defined in line 8, is a constant depending on , and is a constant.

The proofs are collected in Appendix E. In a nutshell, eq. 5 is a natural convex relaxation for the sparsity constraint . On the other hand, when , the contribution of is relatively small, which can be obtained through concentration inequalities for the samples in . Based on Lemma 4.1, if , then the RHS of eq. 5 is bounded, leading to the error guarantee of . On the other hand, if , we can show that eq. 4 is more likely to throw out samples of rather than . Iteratively applying Algorithm 3 on the remaining samples, we can remove those outliers with large effect, and keep the remaining outliers’ effect well-controlled. This leads to the final bounds in Theorem 4.1.

## 5 Robust sparse regression with unknown covariance

In this section, we consider robust sparse regression with unknown covariance matrix , which has additional sparsity structure. Formally, we define the sparse covariance matrices as follows:

###### Model 5.1 (Sparse covariance matrices).

In Model 1.1, the authentic covariates are drawn from . We assume that each row and column of is -sparse, but the positions of the non-zero entries are unknown.

Model 5.1 is widely studied in high dimensional statistics [7, 20, 45]. Under Model 5.1, for the population gradient , where we use to denote the -sparse vector , we can guarantee the . Hence, we can use the filtering algorithm (Algorithm 3) with as a RSGE for robust sparse regression with unknown . When the covariance is unknown, we cannot evaluate a priori, thus the ellipsoid algorithm is not applicable to this case. And we provide error guarantees as follows.

###### Theorem 5.1.

Suppose we observe -corrupted samples from Model 1.1, where the covariates ’s follow from Model 5.1. If we use Algorithm 3 for robust sparse gradient estimation, it requires samples, and , then, we have, with probability at least .

The proof of Theorem 5.1 is collected in Appendix G, and the main technique hinges on previous analysis for the identity covariance case (Theorem 4.1 and Lemma 4.1). In the case of unknown covariance, this is the best (and in fact, only) recovery guarantee we are aware of for robust sparse regression. We show the performance of robust estimation using our filtering algorithm with unknown covariance in Appendix H, and we observe same linear convergence as Section 4.

## 6 Acknowledgments

The authors would like to thank Simon S. Du for helpful discussions.

## References

• [1] Sivaraman Balakrishnan, Simon S. Du, Jerry Li, and Aarti Singh. Computationally efficient robust sparse estimation in high dimensions. In Proceedings of the 2017 Conference on Learning Theory, 2017.
• [2] Sivaraman Balakrishnan, Martin J Wainwright, and Bin Yu. Statistical guarantees for the em algorithm: From population to sample-based analysis. The Annals of Statistics, 45(1):77–120, 2017.
• [3] Afonso S. Bandeira, Edgar Dobriban, Dustin G. Mixon, and William F. Sawin. Certifying the restricted isometry property is hard. IEEE Transactions on Information Theory, 59(6):3448–3450, 2013.
• [4] Boaz Barak and David Steurer. Proofs, beliefs, and algorithms through the lens of sum-of-squares. Course notes: http://www. sumofsquares. org/public/index. html, 2016.
• [5] Kush Bhatia, Prateek Jain, and Purushottam Kar. Robust regression via hard thresholding. In Advances in Neural Information Processing Systems, pages 721–729, 2015.
• [6] Kush Bhatia, Prateek Jain, and Purushottam Kar. Consistent robust regression. In Advances in Neural Information Processing Systems, pages 2107–2116, 2017.
• [7] Peter J Bickel and Elizaveta Levina. Covariance regularization by thresholding. The Annals of Statistics, 36(6):2577–2604, 2008.
• [8] Thomas Blumensath and Mike E Davies. Iterative hard thresholding for compressed sensing. Applied and computational harmonic analysis, 27(3):265–274, 2009.
• [9] Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015.
• [10] Mengjie Chen, Chao Gao, and Zhao Ren. Robust covariance and scatter matrix estimation under huberâs contamination model. Ann. Statist., 46(5):1932–1960, 10 2018.
• [11] Yudong Chen, Constantine Caramanis, and Shie Mannor. Robust sparse regression under adversarial corruption. In International Conference on Machine Learning, pages 774–782, 2013.
• [12] Yudong Chen, Lili Su, and Jiaming Xu. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 1(2):44, 2017.
• [13] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high dimensions without the computational intractability. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 655–664. IEEE, 2016.
• [14] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Being robust (in high dimensions) can be practical. In International Conference on Machine Learning, pages 999–1008, 2017.
• [15] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Jacob Steinhardt, and Alistair Stewart. Sever: A robust meta-algorithm for stochastic optimization. arXiv preprint arXiv:1803.02815, 2018.
• [16] Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. Statistical query lower bounds for robust estimation of high-dimensional gaussians and gaussian mixtures. In Foundations of Computer Science (FOCS), 2017 IEEE 58th Annual Symposium on, pages 73–84. IEEE, 2017.
• [17] Ilias Diakonikolas, Weihao Kong, and Alistair Stewart. Efficient algorithms and lower bounds for robust linear regression. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2745–2754. SIAM, 2019.
• [18] David L Donoho. Compressed sensing. IEEE Transactions on information theory, 52(4):1289–1306, 2006.
• [19] Alexandre dâAspremont, Francis Bach, and Laurent El Ghaoui. Optimal solutions for sparse principal component analysis. Journal of Machine Learning Research, 9(Jul):1269–1294, 2008.
• [20] Noureddine El Karoui. Operator norm consistent estimation of large-dimensional sparse covariance matrices. The Annals of Statistics, 36(6):2717–2756, 2008.
• [21] Chao Gao. Robust regression via mutivariate regression depth. arXiv preprint arXiv:1702.04656, 2017.
• [22] Michael Grant, Stephen Boyd, and Yinyu Ye. Cvx: Matlab software for disciplined convex programming, 2008.
• [23] Frank R Hampel, Elvezio M Ronchetti, Peter J Rousseeuw, and Werner A Stahel. Robust statistics: the approach based on influence functions, volume 196. John Wiley & Sons, 2011.
• [24] Matthew J Holland and Kazushi Ikeda. Efficient learning with robust gradient descent. arXiv preprint arXiv:1706.00182, 2017.
• [25] Samuel B Hopkins, Tselil Schramm, Jonathan Shi, and David Steurer. Fast spectral algorithms from sum-of-squares proofs: tensor decomposition and planted sparse vectors. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 178–191. ACM, 2016.
• [26] Peter J Huber. Robust estimation of a location parameter. The annals of mathematical statistics, pages 73–101, 1964.
• [27] Peter J Huber. Robust statistics. In International Encyclopedia of Statistical Science, pages 1248–1251. Springer, 2011.
• [28] Prateek Jain, Ambuj Tewari, and Purushottam Kar. On iterative hard thresholding methods for high-dimensional m-estimation. In Advances in Neural Information Processing Systems, pages 685–693, 2014.
• [29] David S Johnson and Franco P Preparata. The densest hemisphere problem. Theoretical Computer Science, 6(1):93–107, 1978.
• [30] Sushrut Karmalkar and Eric Price. Compressed sensing with adversarial sparse noise via l1 regression. arXiv preprint arXiv:1809.08055, 2018.
• [31] Adam Klivans, Pravesh K. Kothari, and Raghu Meka. Efficient Algorithms for Outlier-Robust Regression. arXiv preprint arXiv:1803.03241, 2018.
• [32] Adam R Klivans, Philip M Long, and Rocco A Servedio. Learning halfspaces with malicious noise. Journal of Machine Learning Research, 10(Dec):2715–2740, 2009.
• [33] Kevin A Lai, Anup B Rao, and Santosh Vempala. Agnostic estimation of mean and covariance. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 665–674. IEEE, 2016.
• [34] Xiaodong Li. Compressed sensing and matrix completion with constant proportion of corruptions. Constructive Approximation, 37(1):73–99, 2013.
• [35] Haoyang Liu and Rina Foygel Barber. Between hard and soft thresholding: optimal iterative thresholding algorithms. arXiv preprint arXiv:1804.08841, 2018.
• [36] Liu Liu, Tianyang Li, and Constantine Caramanis. High dimensional robust -estimation: Arbitrary corruption and heavy tails. arXiv preprint arXiv:1901.08237, 2019.
• [37] Nam H Nguyen and Trac D Tran. Exact recoverability from dense corrupted observations via l1-minimization. IEEE transactions on information theory, 59(4):2017–2035, 2013.
• [39] Peter J Rousseeuw and Annick M Leroy. Robust regression and outlier detection, volume 589. John wiley & sons, 2005.
• [40] Jie Shen and Ping Li. A tight bound of hard thresholding. The Journal of Machine Learning Research, 18(1):7650–7691, 2017.
• [41] Charles M Stein. Estimation of the mean of a multivariate normal distribution. The annals of Statistics, pages 1135–1151, 1981.
• [42] John W Tukey. Mathematics and the picturing of data. In Proceedings of the International Congress of Mathematicians, Vancouver, 1975, volume 2, pages 523–531, 1975.
• [43] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.
• [44] Vincent Q Vu, Juhee Cho, Jing Lei, and Karl Rohe. Fantope projection and selection: A near-optimal convex relaxation of sparse PCA. In Advances in neural information processing systems, pages 2670–2678, 2013.
• [45] Martin Wainwright. High-dimensional statistics: A non-asymptotic viewpoint. Cambridge University Press, 2019.
• [46] Huan Xu, Constantine Caramanis, and Shie Mannor. Outlier-robust PCA: the high-dimensional case. IEEE transactions on information theory, 59(1):546–572, 2013.
• [47] Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust PCA via outlier pursuit. IEEE Transactions on Information Theory, 58(5):3047–3064, 2012.
• [48] Yuchen Zhang, Martin J Wainwright, and Michael I Jordan. Lower bounds on the performance of polynomial-time algorithms for sparse linear regression. In Conference on Learning Theory, pages 921–948, 2014.
\setstretch

1.25

#### Notation.

In proofs, we let denote the Kronecker product, and for a vector , we denote the outer product by . We define the infinity norm for a matrix as . Given index set , is the vector restricted to indices . Similarly, is the sub-matrix on indices . we use to denote constants that are independent of dimension, but whose value can change from line to line.

## Appendix A Proofs for the meta-theorem

In this section, we prove the global linear convergence guarantee given the Definition 2.1. In each iteration of Algorithm 1, we use to update

 βt+1=Pk′(βt−ηˆGt),

where is a fixed step size. Given the condition in RSGE’s definition, we show that Algorithm 1 linearly converges to a neighborhood around with error at most .

First, we introduce a supporting Lemma from [40], which bounds the distance between and in each iteration of Algorithm 1.

###### Lemma A.1 (Theorem 1 in [40]).

Let be an arbitrary vector and be any -sparse signal. For any , we have the following bound:

We choose the hard thresholding parameter , hence

###### Theorem A.1 (Theorem 2.1).

Suppose we observe -corrupted samples from Model 1.1. Algorithm 1, with -RSGE defined in Definition 2.1, with step size outputs , such that

 ∥∥ˆβ−β∗∥∥2=O(√ψ(ϵ)),

with probability at least , by setting and . The sample complexity is .

###### Proof.

By splitting samples into sets (each set has sample size ), Algorithm 1 collects a fresh batch of samples with size at each iteration . Definition 2.1 shows that for the fixed gradient expectation , the estimate for the gradient satisfies:

 ∥∥ˆGt−Gt∥∥22≤α(ϵ)∥∥Gt∥∥22+ψ(ϵ) (6)

with probability at least , where is determined by .

Letting , we study the -th iteration of Algorithm 1. Based on Lemma A.1, we have

 ∥∥βt+1−β∗∥∥2 ≤√ζ∥∥βt−ηˆG−β∗∥∥2 =√ζ∥∥βt−ηG−β∗+η(G−ˆG)∥∥2 (i)≤√ζ∥∥(Id−ηΣ)(βt−β∗)∥∥2+√ζη√α(ϵ)∥G∥22+ψ(ϵ)

where (i) follows from the theoretical guarantee of RSGE, and (ii) follows from the basic inequality for non-negative .

By setting , we have

 ∥∥βt+1−β∗∥∥2 ≤√ζ∥∥(Id−ηΣ)(βt−β∗)∥∥2+√ζη√α(ϵ)∥∥Σ(βt−β∗)∥∥2+√ζη√ψ(ϵ) ≤√ζ(1−1cκ+√α(ϵ))∥∥βt−β∗∥∥2+√ζη√ψ(ϵ) (7)

When is a small enough constant, we have , then

 √ζ(1−1cκ+√α(ϵ)) ≤√ζ(1−12cκ) ≤√1+ρ+√(4+ρ)ρ2(1−12cκ)

Plugging in the parameter in Lemma A.1, we have

 √ζ(1−1cκ+√α(ϵ))≤1−110cκ

Together with eq. 7, we have the recursion

 ∥∥βt+1−β∗∥∥2≤(1−110cκ)∥∥βt−β∗∥∥2+√ζη√ψ(ϵ).

By solving this recursion and using a union bound, we have

with probability at least .

By the definition of and , we have

## Appendix B Correcting Lemma A.3 in [1]’s proof

A key part of the proof of the main theorem in [1] is to obtain an upper bound on the -sparse operator norm. Specifically, their Lemmas A.2 and A.3 aim to show:

 λ∗≥∥∥ ∥∥|S|∑i=1wi(gi−ˆG(w))⊗2−F(ˆG(w))∥∥ ∥∥˜k,op≥∥∥P˜k(˜Δ(w))∥∥225ϵ, (8)

where , 444The are weights, and these are defined precisely in Section D, but are not required for the present discussion or counterexample., and recall is the solution to the SDP as given in Algorithm 3.

Lemma A.3 asserts the first inequality above, and Lemma A.2 the second. As we show below, Lemma A.3 cannot be correct. Specifically, the issue is that the quantity inside the second term in eq. 8 may not be positive semidefinite. In this case, the convex optimization problem whose solution is is not a valid relaxation, and hence the they obtain need not a valid upper bound. Indeed, we give a simple example below that illustrates precisely this potential issue.

Fortunately, not all is lost – indeed, as our results imply, the main results in [1] is correct. The key is to show that while does not upper bound the sparse operator norm, it does, however, upper bound the quantity

 max∥v∥2=1,∥v∥0≤˜kv⊤⎛⎝|S|∑i=1wi(gi−ˆG(w))⊗2−F(ˆG(w))⎞⎠v. (9)

We show this in Appendix D. More specifically, in Lemma D.3, we replace the -sparse operator norm in the second term of eq. 8 by the term in eq. 9. We show this can be used to complete the proof in Section D.4.

We now provide a counterexample that shows the first inequality in (8) cannot hold. The main argument is that the convex relaxation for sparse PCA is a valid upper bound of the sparse operator norm only for positive semidefinite matrices. Specifically, denoting as the matrix in eq. 9, [1] solves the following convex program:

Since is no longer a p.s.d. matrix, the trace maximization above may not be a valid convex relaxation, and thus not an upper bound. Let us consider a specific example, in robust sparse mean estimation for , where function is a fixed identity matrix . We choose , , and . Suppose we observe data to be , , and the weights for and are the same. Then, we can compute the following matrices as:

 ˆΣ=[1.5625000],F=[1001],E=ˆΣ−F=[0.562500−1].

It is clear that . Solving the convex relaxation or gives answer and the corresponding , which is clearly not an upper bound of . Hence cannot hold in general.

## Appendix C Covariance smoothness properties in robust sparse mean estimation

When the covariance is identity, the ellipsoid algorithm requires a closed form expression of the true covariance function . Indeed, the ellipsoid-based robust sparse mean estimation algorithm uses the covariance structure given by to detect outliers. The accuracy of robust sparse mean estimation explicitly depends on the properties of . and are two important properties of , related to its smoothness. We first provide a closed-form expression for , and then define precisely smoothness parameters and , and show how these can be controlled.

#### Closed form expression of F(G).

###### Lemma C.1.

Suppose we observe i.i.d. samples from the distribution in Model 1.1 with , we have the covariance of gradient as

 Cov(g)=Ezi∼P((gi−G)(gi−G)⊤)=∥G∥22Id+GG⊤+σ2Id.

Since , and