High Dimensional Robust Estimation: Arbitrary Corruption and Heavy Tails
Abstract
We consider the problem of sparsityconstrained estimation when both explanatory and response variables have heavy tails (bounded 4th moments), or a fraction of arbitrary corruptions. We focus on the sparse, highdimensional regime where the number of variables and the sample size are related through . We define a natural condition we call the Robust Descent Condition (RDC), and show that if a gradient estimator satisfies the RDC, then Robust Hard Thresholding (IHT using this gradient estimator), is guaranteed to obtain good statistical rates. The contribution of this paper is in showing that this RDC is a flexible enough concept to recover known results, and obtain new robustness results. Specifically, new results include: (a) For sparse highdimensional linear and logisticregression with heavy tail (bounded 4th moment) explanatory and response variables, a lineartimecomputable medianofmeans gradient estimator satisfies the RDC, and hence Robust Hard Thresholding is minimax optimal; (b) When instead of heavy tails we have fraction of arbitrary corruptions in explanatory and response variables, a near lineartime computable trimmed gradient estimator satisfies the RDC, and hence Robust Hard Thresholding is minimax optimal. We demonstrate the effectiveness of our approach in sparse linear, logistic regression, and sparse precision matrix estimation on synthetic and realworld US equities data.
1 Introduction
estimation is a standard technique for statistical estimation [vdV00]. The past decade has seen successful extensions of estimation to the high dimensional setting with sparsity (or other lowdimensional structure), e.g., using Lasso [Tib96, BvdG11, HTW15, Wai19]. Yet sparse modeling in high dimensions is NPhard in the worst case [BDMS13, ZWJ14]. Thus theoretical sparse recovery guarantees for most computationally tractable approaches (e.g., minimization [Don06, CRT04, Wai09], Iterative Hard Thresholding [BD09]) rely on strong assumptions on the probabilistic models of the data, such as subGaussianity. Under such assumptions, these approaches achieve the minimax rate for sparse regression [RWY11].
Meanwhile, statistical estimation with heavy tailed outliers or even arbitrary corruptions has long been a focus in robust statistics [Box53, Tuk75, Hub11, HRRS11].^{1}^{1}1Following [Min18, FWZ16], by heavytail we mean satisfying only weak moment bounds, specifically, bounded 4th order moments (compared to subexponential or subGaussian). But heavytails and arbitrary corruptions in the data violate the assumptions required for convergence of the usual algorithms. A central question then, is what assumptions are sufficient to enable efficient and robust algorithms for high dimensional estimation under heavy tails or arbitrary corruption.
Huber’s seminal work [Hub64] and more modern followup work [Loh17] has considered replacing the classical least squared risk minimization objective with a robust counterpart (e.g., Huber loss). Other approaches (e.g., [Li13]) considered regularizationbased robustness approaches. However, when there are outliers in the explanatory variables (covariates), these approaches do not seem to succeed [CCM13]. Meanwhile, approaches combining recent advances in robust mean estimation and gradient descent have proved remarkably powerful in the lowdimensional setting [PSBR18, KKM18, DKK18], but for high dimensions, have so far only managed to address the setting where the covariance of the explanatory variables is the identity, or sparse [BDLS17, LSLC18]. Meanwhile, flexible and statistically optimal approaches ([Gao17]) have relied on intractable estimators such as Tukeydepth.
For the heavytail setting, another line of research considers estimators such as MedianofMeans (MOM) [NY83, JVV86, AMS99, Min15] and Catoni’s mean estimator [Cat12, Min18] only use weak moment assumptions. [Min15, BJL15, HS16] generalized these ideas to estimation, yet it is not clear if these approaches apply to the highdimensional setting with heavy tailed covariates.
Main Contributions. In this paper, we develop a sufficient condition that when satisfied, guarantees that an efficient algorithm (a variant of IHT) achieves the minimax optimal statistical rate. We show that our condition is flexible enough to apply to a number of important highdimensional estimation problems under either heavy tails, or arbitrary corruption of the data. Specifically:

We consider two models. For our arbitrary corruption model, we assume that an adversary replaces an arbitrary fraction of the authentic samples with arbitrary values (creftypecap 2.1). For the heavytailed model, we assume our data (response and covariates) satisfy only weak moment assumptions (creftypecap 2.2) without subGaussian or subexponential concentration bounds.

We propose a notion that we call the Robust Descent Condition (RDC). Given any gradient estimator that satisfies the RDC, we define RHT – Robust Hard Thresholding (Algorithm 1) for sparsity constrained estimation, and prove that Algorithm 1 converges linearly to a minimax statistically optimal solution. Thus the RDC and Robust Hard Thresholding form the basis for a Deterministic MetaTheorem (creftypecap 3.1) that guarantees estimation error rates as soon as the RDC property of any gradient estimator can be certified.

We then obtain nonasymptotic bounds via certifying the RDC for different robust gradient estimators under various statistical models. (A) For corruptions in both response and explanatory variables, we show the trimmed gradient estimator satisfies the RDC. Thus our algorithm RHT has minimaxoptimal statistical error, and tolerates fraction of outliers. This fraction is nearly independent of the , which is important in the high dimension regime. (B) In the heavy tailed regime, we use the MedianofMeans (MOM) gradient estimator. Our RHT algorithm obtains the sharpest available error bound, in fact nearly matching the results in the subGaussian case. With either of these gradient estimators, our algorithm is computationally efficient, nearly matching vanilla gradient descent. This is in particular much faster than algorithms relying on sparse PCA relaxations as subroutines ([BDLS17, LSLC18]).

We use Robust Hard Thresholding for neighborhood selection [MB06] for estimating Gaussian graphical models, and provide model selection guarantees under adversarial corruption of the data; our results share similar robustness guarantees with sparse regression.

We demonstrate the effectiveness of Robust Hard Thresholding on both arbitrarily corrupted/heavy tailed synthetic data and (unmodified) real data.
A concrete illustration of 3(B) above: Consider a sparse linear regression problem without noise (sparse linear equations), with scaling . When the covariates are subGaussian, Lasso succeeds in exact recovery with high probability (as expected). When the covariates have only 4th moments, we do not expect Lasso to succeed, and indeed experiments indicate this. Moreover, to the best of our knowledge, no previous efficient algorithm with samples can guarantee exact recovery in this observation model ([FWZ16] has a statistical rate depending on the norm of the parameter , and thus exact recovery for is not guaranteed). Our contributions show that Robust Hard Thresholding using MOM achieves this (see also simulations in Figure 6(b)).
Related work
Sparse regression with arbitrary corruptions or heavy tails. Several works in robustness of high dimensional problems consider heavy tailed distributions or arbitrary corruptions only in the response variables [Li13, BJK15, BJK17, Loh17, KP18, HS16, Min15, CLZL]. Yet these algorithms cannot be trivially extended to the setting with heavy tails or corruptions in explanatory variables. Another line [ACG13, VMX17, YLA18, SS18] focuses on alternating minimization approaches which extend Least Trimmed Squares [Rou84]. However, these methods only have local convergence guarantees, and cannot handle arbitrary corruptions.
[CCM13] was one of the first papers to provide guarantees for sparse regression with arbitrary outliers in both response and explanatory variables by trimming the design matrix. Similar trimming techniques are also used in [FWZ16] for heavy tails in response and explanatory variables. Those results are specific to sparse regression, however, and cannot be easily extended to general estimation problems. Moreover, even for linear regression, the statistical rates are not minimax optimal. [LM16] uses MedianofMeans tournaments to deal with heavy tails in the explanatory variables and obtains near optimal rates. However, MedianofMeans tournaments is not known to be computationally tractable. [LL17] deals with heavy tails and outliers in the explanatory variables, but they require higher moment bound (whose order is ) in the isotropic design case. [Gao17] optimizes Tukey depth [Tuk75, CGR18] for robust sparse regression under the Huber contamination model, and their algorithm is minimax optimal and can handle a constant fraction of outliers. However, computing Tukey depth is intractable [JP78]. Recent results [BDLS17, LSLC18] leverage robust sparse mean estimation in robust sparse regression. Their algorithms are computationally tractable, and can tolerate , but they require very restrictive assumptions on the covariance matrix ( or sparse), which precludes their use in applications such as graphical model estimation.
Robust estimation via robust gradient descent. Works in [CSX17, HI17] and later [YCRB18a] first leveraged the idea of using robust mean estimation in each step of gradient descent, using a subroutine such as geometric median. A similar approach using more sophisticated robust mean estimation methods was later proposed in [PSBR18, DKK18, YCRB18b, SX18, Hol18] for robust gradient descent. These methods all focused on low dimensional robust estimation. Work in [LSLC18] extended the approach to the highdimensional setting (though is limited to or sparse covariances). Even though the corrupted fraction can be independent of the ambient dimension by using sophisticated robust mean estimation algorithms [DKK16, LRV16, SCV17], or the sumofsquares framework [KKM18], these algorithms (except [LSLC18]) are not applicable to the high dimensional setting (), as they require at least samples.
Robust estimation of graphical models. A line of research using a robustified covariance matrix in Gaussian graphical models [LHY12b, WG17, LT18] leverages GLasso [FHT08] or CLIME [CLL11] to estimate the sparse precision matrix. These robust methods are restricted to Gaussian graphical model estimation, and their techniques cannot be generalized to other estimation problems.
Notation. We denote the Hard Thresholding operator of sparsity by , and denote the Euclidean projection onto the ball by . We use to denote the expectation operator obtained by the uniform distribution over all samples .
2 Problem formulation
We now define the corruption and heavy tails model and sparsity constrained estimation.
Definition 2.1 (corrupted samples).
Let be i.i.d. observations with distribution . We say that a collection of samples is corrupted if an adversary chooses an arbitrary fraction of the samples in and modifies them with arbitrary values.
This corruption model allows corruptions in both explanatory and response variables in regression problems where we observe . creftypecap 2.1 also allows the adversary to select an fraction of samples to delete and corrupt.
Definition 2.2 (heavytailed samples).
For a distribution of with mean and covariance , we say that has bounded th moment, if there is a universal constant such that, for a unit vector , we have .
creftypecap 2.2 allows heavy tails in both explanatory and response variables for . For example, in creftypecap 4.3, we study linear regression with bounded 4th moments for and bounded variance for and noise.
Let be a convex and differentiable loss function. Our target is the unknown sparse population minimizer , and we write as the population risk, . Note that ’s definition allows model misspecification. The following creftypecap 2.3 provides general assumptions for the population risk.
Definition 2.3 (Strong convexity/smoothness).
For the population risk , we assume , where is the strongconvexity parameter and is the smoothness parameter. The condition number is .
A well known result [NRWY12] considers ERM with convex relaxation from to , by certifying the RSC condition for subGaussian ensembles – this obtains uniform convergence of the empirical risk. From an optimization viewpoint, existing results reveal that gradient descent algorithms equipped with softthresholding [ANW12] or hardthresholding [BD09, JTK14, SL17, YLZ18, LB18] have linear convergence rate, and achieve known minimax lower bounds in statistical estimation [RWY11, ZWJ14].
Given samples , running ERM on the entire input dataset: , cannot guarantee uniform convergence of the empirical risk, and can be arbitrarily bad for corrupted samples. The next two sections outline the main results of this paper, addressing this problem.
3 Robust sparse estimation via Robust Hard Thresholding
We introduce our metaalgorithm, Robust Hard Thresholding, that essentially uses a robust gradient estimator to run IHT. We require several definitions to specify the algorithm, and describe its results. We use as a placeholder for the estimate at , obtained from whichever robust gradient estimator we are using. Let denote the population gradient. We use and when the context is clear.
Many previous works ([CSX17, HI17, PSBR18, DKK18, YCRB18a, YCRB18b, SX18]) have provided algorithms for obtaining robust gradient estimators, then used as subroutines in robust gradient algorithms. However, those results require controlling , and do not readily extend to high dimensions, as sufficiently controlling seems to require . A recent work [LSLC18] on robust sparse linear regression uses a robust sparse mean estimator [BDLS17] to guarantee with sample complexity . However, their algorithm requires the restrictive assumption or sparse, and thus cannot be extended to more general estimation problems.
To address this issue, we propose Robust Hard Thresholding (Algorithm 1), which uses hard thresholding after each robust gradient update^{2}^{2}2Our theory requires splitting samples across different iterations to maintain independence between iterations. We believe this is an artifact of the analysis, and do not use this in our experiments. [BWY17, PSBR18] use a similar approach for theoretical analysis. . In line 7, we use a gradient estimator to obtain the robust gradient estimate . In line 8, we update the parameter by hard thresholding , where the hyperparameter proportional to is specified in creftypecap 2.3. A key observation in line 8 is that, in each step of IHT, the iterate is sparse, and thus the perturbation from outliers or heavy tails only depends on IHT’s sparsity instead of the ambient dimension . Based on a careful analysis of the hard thresholding operator in each iteration, we show that rather than controlling , it is enough to control a weaker quantity: this is what we call the Robust Descent Condition creftypecap 3.1 and we define it next; it plays a key role in obtaining sharp rates of convergence for various types of statistical models.
Robust Descent Condition
The Robust Descent Condition eq. 1 provides an upper bound on the inner product of the robust gradient estimate and the distance to the population optimum. This is a natural notion to control the potential progress obtained by using a robust gradient update instead of the population gradient.
Definition 3.1 (Robust Descent Condition (RDC)).
For the population gradient at , a robust gradient estimator satisfies the robust descent condition if for any sparse ,
(1) 
We begin with a MetaTheorem for Algorithm 1 that holds under the Robust Descent Condition creftypecap 3.1 and assumptions on population risk creftypecap 2.3. In creftypecap 3.1, we prove Algorithm 1’s global convergence and its statistical guarantees. The proofs are collected in Appendix B.
Theorem 3.1 (MetaTheorem).
Suppose we observe samples from a statistical model with population risk satisfying creftypecap 2.3. If a robust gradient estimator satisfies Robust Descent Condition (creftypecap 3.1) where , then Algorithm 1 with outputs such that , by setting .
We note that creftypecap 3.1 is deterministic in nature. In the sequel, we omit the log term in the sample complexity due to sample splitting. We obtain high probability results via certifying that the RDC holds for certain robust gradient estimators under various statistical models. To obtain the minimax estimation error rate in creftypecap 3.1, the key step is providing a robust gradient estimator with sufficiently small , in the definition of RDC.
Section 4 uses the RDC and creftypecap 3.1 to obtain new results for sparse regression under heavy tails or arbitrary corruption. Before we move to this, we observe that we can use the RDC and creftypecap 3.1 to recover existing results in the literature. Some immediate examples are as follows:
Uncorrupted gradient satisfies the RDC. Suppose the samples follow from sparse linear regression with subGaussian covariates and noise . The empirical average of gradient samples satisfies eq. 1 with , by assuming constraint on and [LW11]. Plugging in this to creftypecap 3.1 recovers the wellknown minimax rates for sparse linear regression [RWY11].
RSGE implies RDC. When or is sparse, [BDLS17] and [LSLC18], respectively, provide robust sparse gradient estimators (RSGE) which upper bound , for a constant fraction of corrupted samples. Noting that , we observe that RSGE implies RDC. Hence any RSGE can be used in Algorithm 1. The RSGE for in [BDLS17] guarantees an RDC with when , and the RSGE for unknown sparse from [LSLC18] guarantees when . Again plugging these values for into our theorem, recovers the results in those papers. ^{3}^{3}3It remains an open question to obtain a RSGE for a constant fraction of outliers for robust sparse regression with arbitrary covariance .
4 Main Results: Using the RDC and Algorithm 1
In the remainder of our paper, we use creftypecap 3.1 and the RDC to analyze two wellknown and computationally efficient robust mean estimation subroutines that have been used in the lowdimensional setting: the trimmed mean estimator and the MOM estimator. We show that these two can obtain a sufficiently small in the definition of the RDC. This leads to the minimax estimation error in the case of arbitrary corruptions or heavy tails.
4.1 Gradient estimation
The trimmed mean and MOM estimators have been successfully applied to robustify gradient descent [YCRB18a, PSBR18] in the low dimensional setting. They have not been used in the high dimensional regime, however, because until now we have not had the machinery to analyze their algorithmic convergence, statistical rates and minimax optimality in the high dimensional setting.
To show they satisfy the RDC with a sufficiently small , we observe that by using Hölder’s inequality on the LHS of eq. 1, we have Using Algorithm 1, the Hard Thresholding step enforces sparsity of . Therefore, controlling amounts to bounding the infinity norm of the robust gradient estimate.
In Section 4.2, we show that by using coordinatewise robust mean estimation, we can certify the RDC with sufficiently small to guarantee minimax rates. Specifically, we show this for the trimmed gradient estimator for arbitrary corruption, and and the MOM gradient estimator for heavy tailed distributions.
Definition 4.1.
Given gradients samples , for each dimension ,
: Trimmed gradient estimator removes the largest and smallest fraction of elements in , and calculates the mean of the remaining terms. We choose for constant , and require for a small constant .
: MOM gradient estimator partitions into blocks and computes the sample mean of within each block, and then take the median of these means.^{4}^{4}4Without loss of generality, we assume the number of blocks divides , and is chosen in [HS16].
4.2 Statistical guarantees
In this section, we consider some typical models for general estimation.
Model 4.1 (Sparse linear regression).
Samples are drawn from a linear model : , with being sparse. We assume that ’s are i.i.d. with normalized covariance matrix , with , and the stochastic noise has mean and variance .
Model 4.2 (Sparse logistic regression).
Samples are drawn from a binary classification model , where the binary label follows the conditional probability distribution , with being sparse. We assume that ’s are i.i.d. with normalized covariance matrix , where for all .
To obtain the following corollaries, we first certify the RDC for a certain robust gradient estimator over random ensembles with corruption or heavy tails, and then use them in creftypecap 3.1. We collect the results for gradient estimation in Appendix A, and the proofs for corollaries in Appendix B.
Arbitrary corruption case.
Based on creftypecap 3.1, we first provide concrete results for arbitrary corruption case creftypecap 2.1, where the covariates and response variables in the authentic distribution are assumed to be subGaussian.
Corollary 4.1.
Suppose we observe corrupted (creftypecap 2.1) subGaussian samples from sparse linear regression model (creftypecap 4.1). Under the condition , and , with probability at least , Algorithm 1 with trimmed gradient estimator satisfies the RDC with , and thus creftypecap 3.1 provides
Time complexity. creftypecap 4.1 has a global linear convergence rate. In each iteration, we only use operations complexity to calculate trimmed mean. We incur logarithmic overhead compared to normal gradient descent [Bub15].
Statistical accuracy and robustness. Compared with [CCM13, BDLS17], our statistical error rate is minimax optimal [RWY11, ZWJ14], and has no dependencies on . Furthermore, the upper bound on is nearly independent of , which guarantees Algorithm 1’s robustness in high dimensions.
Corollary 4.2.
Suppose we observe corrupted (creftypecap 2.1) subGaussian samples from sparse logistic regression model (creftypecap 4.2). With probability at least , Algorithm 1 with trimmed gradient estimator satisfies the RDC with , and thus creftypecap 3.1 provides .
Statistical accuracy and robustness. Under the sparse Gaussian linear discriminant analysis model (a typical example of creftypecap 4.2), Algorithm 1 achieves the statistical minimax rate [LPR15, LYCR17].
Heavytailed distribution case.
We next turn to the heavy tailed distribution case creftypecap 2.2.
Corollary 4.3.
Suppose we observe samples from sparse linear regression model (creftypecap 4.2) with bounded 4th moment covariates. Under the condition , with probability at least , Algorithm 1 with MOM gradient estimator satisfies the RDC with , and thus creftypecap 3.1 provides .
Time complexity. Similar to creftypecap 4.1, creftypecap 4.3 has a global linear convergence. In each iteration, we only use operations complexity – the same as normal gradient descent [Bub15].
Statistical accuracy. [LM16] uses MedianofMeans tournaments to deal with sparse linear regression with bounded moment assumptions for the covariates, and they obtain near optimal rates. We obtain similar rates, however our algorithm is efficient, where as MedianofMeans tournaments is not known to be computationally tractable. [FWZ16, Zhu17] deal with the same problem by truncating and shrinking the data to certify the RSC condition. Their results require boundedness of higher moments of the noise , and the final error depends on . Our estimation error bounds exactly recover optimal subGaussian bounds for sparse regression [NRWY12, Wai19], and moreover, we obtain exact recovery when ’s variance .
Corollary 4.4.
Suppose we observe samples from sparse logistic regression model (creftypecap 4.2). With probability at least , Algorithm 1 with MOM gradient estimator satisfies the RDC with , and thus creftypecap 3.1 provides .
4.3 Sparsity recovery and Gaussian graphical model estimation
We next demonstrate the sparsity recovery performance of Algorithm 1 for graphical model learning [MB06, Wai09, RWL10, RWRY11, BvdG11, HTW15]. Our sparsity recovery guarantees hold for both heavy tails and arbitrary corruption, though we only present results in the case of arbitrary corruption in this section.
We use to denote top indexes of with the largest magnitude. Let denote the smallest absolute value of nonzero element of . To control the false negative rate, creftypecap 4.5 shows that under the condition, is exactly . The proofs are given in Appendix C. Sparsity recovery guarantee for sparse logistic regression is similar, and is omitted due to space constraints. Existing results on sparsity recovery for regularized estimators [Wai09, LSRC15] do not require the RSC condition, but instead require an irrepresentability condition, which is stronger. If , creftypecap 4.5 has the same condition as IHT for sparsity recovery [YLZ18].
Corollary 4.5.
Under the same condition as in creftypecap 4.1, and a condition on , , Algorithm 1 with trimmed gradient estimator guarantees that , with probability at least .
We consider sparse precision matrix estimation for Gaussian graphical models. The sparsity pattern of its precision matrix matches the conditional independence relationships [KFB09, WJ08].
Model 4.3 (Sparse precision matrix estimation).
Under the contamination model creftypecap 2.1, authentic samples are drawn from a multivariate Gaussian distribution . We assume that each row of the precision matrix is sparse – each node has at most edges.
For the uncorrupted samples drawn from the Gaussian graphical model, the neighborhood selection (NS) algorithm [MB06] solves a convex relaxation of the following sparsity constrained optimization to regress each variable against its neighbors
(2) 
where denotes the th coordinate of , and denotes the index set . Let denote ’s th column with the diagonal entry removed. and denote the th diagonal element of . Then, the sparsity pattern of can be estimated through . Details on the connection between and are given in Appendix C.
However, given corrupted samples from the Gaussian graphical model, this procedure will fail [LHY12b, WG17]. To address this issue, we propose Robust NS (Algorithm 2 in Appendix C), which robustifies Neighborhood Selection [MB06] by using Robust Hard Thresholding (with least square loss) to robustify eq. 2. Similar to creftypecap 4.5, a condition guarantees consistent edge selection.
Corollary 4.6.
Under the same condition as in creftypecap 4.1, and a condition for , , Robust NS (Algorithm 2) achieves consistent edge selection, with probability at least .
Similar to creftypecap 4.1, the fraction is nearly independent of dimension , which provides guarantees of Robust NS in high dimensions. Other Gaussian graphical model selection algorithms include GLasso [FHT08], CLIME[CLL11]. The experimental details comparing robustified versions of these algorithms are presented in Section D.4.
5 Experiments
We provide the complete details for our experiment setup in Appendix D.
Sparse regression with arbitrary corruption. We generate samples from a sparse regression model (creftypecap 4.1) with a Toeplitz covariance . Here, the stochastic noise , and we vary the noise level in different simulations. We add outliers with , and track the parameter error in each iteration. Left plot of Figure 6 shows Algorithm 1’s linear convergence, and the error curves flatten out at the final error level. Furthermore, Algorithm 1 can achieve machine precision when , which means exactly recovering of .
Sparse regression with heavy tails. We consider a lognormal distribution (a typical example of heavy tails) in creftypecap 4.1. More specifically, , and . Here, is the same Toeplitz covariance, each entry of and follows from , where . We fix , and vary sample size . For lognormal samples, we run Algorithm 1 with MOM and vanilla Lasso. We then regenerate standard Gaussian samples using the same dimensions with and run Vanilla Lasso. Each curve in the right plot of Figure 6 is the average of 50 trials. Algorithm 1 with MOM significantly improves vanilla Lasso on lognormal data, and has the same performance as Lasso on subGaussian data
Real data experiments. We next apply Algorithm 2, to a US equities dataset [LHY12a, ZLR12], which is heavytailed and has many outliers [dP18]. The dataset contains 1,257 daily closing prices of 452 stocks (variables). It is well known that stocks from the same sector tend to be clustered together [Kin66]. Therefore, we use Robust NS (Algorithm 2) to construct an undirected graph among stocks. Graphs estimated by different algorithms are shown in Figure 6. We can see that stocks from the same sector are clustered together, and these clustering centers can be easily identified. We also compare Algorithm 2 to the baseline NS approach (as in the ideal setting). We can observe that stocks from Information Technology (colored by purple) are much better clustered by Algorithm 2.
References
 [ACG13] Andreas Alfons, Christophe Croux, and Sarah Gelper. Sparse least trimmed squares regression for analyzing highdimensional large data sets. The Annals of Applied Statistics, pages 226–248, 2013.
 [AMS99] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and system sciences, 58(1):137–147, 1999.
 [ANW12] Alekh Agarwal, Sahand Negahban, and Martin J. Wainwright. Fast global convergence of gradient methods for highdimensional statistical recovery. Ann. Statist., 40(5):2452–2482, 10 2012.
 [BD09] Thomas Blumensath and Mike E Davies. Iterative hard thresholding for compressed sensing. Applied and computational harmonic analysis, 27(3):265–274, 2009.
 [BDLS17] Sivaraman Balakrishnan, Simon S. Du, Jerry Li, and Aarti Singh. Computationally efficient robust sparse estimation in high dimensions. In Proceedings of the 2017 Conference on Learning Theory, 2017.
 [BDMS13] Afonso S. Bandeira, Edgar Dobriban, Dustin G. Mixon, and William F. Sawin. Certifying the restricted isometry property is hard. IEEE Transactions on Information Theory, 59(6):3448–3450, 2013.
 [BJK15] Kush Bhatia, Prateek Jain, and Purushottam Kar. Robust regression via hard thresholding. In Advances in Neural Information Processing Systems, pages 721–729, 2015.
 [BJK17] Kush Bhatia, Prateek Jain, and Purushottam Kar. Consistent robust regression. In Advances in Neural Information Processing Systems, pages 2107–2116, 2017.
 [BJL15] Christian T Brownlees, Emilien Joly, and Gabor Lugosi. Empirical risk minimization for heavytailed losses. Annals of Statistics, 43(6):2507–2536, 2015.
 [Box53] George EP Box. Nonnormality and tests on variances. Biometrika, 40(3/4):318–335, 1953.
 [Bub15] Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(34):231–357, 2015.
 [BvdG11] Peter Bühlmann and Sara van de Geer. Statistics for highdimensional data: methods, theory and applications. Springer Science & Business Media, 2011.
 [BWY17] Sivaraman Balakrishnan, Martin J Wainwright, and Bin Yu. Statistical guarantees for the em algorithm: From population to samplebased analysis. The Annals of Statistics, 45(1):77–120, 2017.
 [Cat12] Olivier Catoni. Challenging the empirical mean and empirical variance: a deviation study. In Annales de l’IHP Probabilités et statistiques, volume 48, pages 1148–1185, 2012.
 [CCM13] Yudong Chen, Constantine Caramanis, and Shie Mannor. Robust sparse regression under adversarial corruption. In International Conference on Machine Learning, pages 774–782, 2013.
 [CGR18] Mengjie Chen, Chao Gao, and Zhao Ren. Robust covariance and scatter matrix estimation under huberâs contamination model. Ann. Statist., 46(5):1932–1960, 10 2018.
 [CLL11] Tony Cai, Weidong Liu, and Xi Luo. A constrained minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association, 106(494):594–607, 2011.
 [CLZL] Yuejie Chi, Yuanxin Li, Huishuai Zhang, and Yingbin Liang. Mediantruncated gradient descent: A robust and scalable nonconvex approach for signal estimation.
 [CRT04] Emmanuel J. Candès, Justin K. Romberg, and Terence Tao. Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52:489–509, 2004.
 [CSX17] Yudong Chen, Lili Su, and Jiaming Xu. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 1(2):44, 2017.
 [DKK16] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high dimensions without the computational intractability. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 655–664. IEEE, 2016.
 [DKK18] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Jacob Steinhardt, and Alistair Stewart. Sever: A robust metaalgorithm for stochastic optimization. arXiv preprint arXiv:1803.02815, 2018.
 [Don06] David L Donoho. Compressed sensing. IEEE Transactions on information theory, 52(4):1289–1306, 2006.
 [dP18] M. L. de Prado. Advances in Financial Machine Learning. Wiley, 2018.
 [FHT08] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441, 2008.
 [FWZ16] Jianqing Fan, Weichen Wang, and Ziwei Zhu. A shrinkage principle for heavytailed data: Highdimensional robust lowrank matrix recovery. arXiv preprint arXiv:1603.08315, 2016.
 [Gao17] Chao Gao. Robust regression via mutivariate regression depth. arXiv preprint arXiv:1702.04656, 2017.
 [HI17] Matthew J Holland and Kazushi Ikeda. Efficient learning with robust gradient descent. arXiv preprint arXiv:1706.00182, 2017.
 [Hol18] Matthew J Holland. Robust descent using smoothed multiplicative noise. arXiv preprint arXiv:1810.06207, 2018.
 [HRRS11] Frank R Hampel, Elvezio M Ronchetti, Peter J Rousseeuw, and Werner A Stahel. Robust statistics: the approach based on influence functions, volume 196. John Wiley & Sons, 2011.
 [HS16] Daniel Hsu and Sivan Sabato. Loss minimization and parameter estimation with heavy tails. The Journal of Machine Learning Research, 17(1):543–582, 2016.
 [HTW15] Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity: the lasso and generalizations. CRC press, 2015.
 [Hub64] Peter J Huber. Robust estimation of a location parameter. The annals of mathematical statistics, pages 73–101, 1964.
 [Hub11] Peter J Huber. Robust statistics. In International Encyclopedia of Statistical Science, pages 1248–1251. Springer, 2011.
 [JP78] David Johnson and Franco Preparata. The densest hemisphere problem. Theoretical Computer Science, 6(1):93–107, 1978.
 [JTK14] Prateek Jain, Ambuj Tewari, and Purushottam Kar. On iterative hard thresholding methods for highdimensional Mestimation. In Advances in Neural Information Processing Systems, pages 685–693, 2014.
 [JVV86] Mark R Jerrum, Leslie G Valiant, and Vijay V Vazirani. Random generation of combinatorial structures from a uniform distribution. Theoretical Computer Science, 43:169–188, 1986.
 [KFB09] Daphne Koller, Nir Friedman, and Francis Bach. Probabilistic graphical models: principles and techniques. MIT press, 2009.
 [Kin66] Benjamin King. Market and industry factors in stock price behavior. the Journal of Business, 39(1):139–190, 1966.
 [KKM18] Adam Klivans, Pravesh K. Kothari, and Raghu Meka. Efficient Algorithms for OutlierRobust Regression. arXiv preprint arXiv:1803.03241, 2018.
 [KP18] Sushrut Karmalkar and Eric Price. Compressed sensing with adversarial sparse noise via l1 regression. arXiv preprint arXiv:1809.08055, 2018.
 [LB18] Haoyang Liu and Rina Foygel Barber. Between hard and soft thresholding: optimal iterative thresholding algorithms. arXiv preprint arXiv:1804.08841, 2018.
 [LHY12a] Han Liu, Fang Han, Ming Yuan, John Lafferty, and Larry Wasserman. The nonparanormal skeptic. arXiv preprint arXiv:1206.6488, 2012.
 [LHY12b] Han Liu, Fang Han, Ming Yuan, John Lafferty, Larry Wasserman, et al. Highdimensional semiparametric gaussian copula graphical models. The Annals of Statistics, 40(4):2293–2326, 2012.
 [Li13] Xiaodong Li. Compressed sensing and matrix completion with constant proportion of corruptions. Constructive Approximation, 37(1):73–99, 2013.
 [LL17] Guillaume Lecué and Matthieu Lerasle. Robust machine learning by medianofmeans: theory and practice. arXiv preprint arXiv:1711.10306, 2017.
 [LM16] Gabor Lugosi and Shahar Mendelson. Risk minimization by medianofmeans tournaments. arXiv preprint arXiv:1608.00757, 2016.
 [Loh17] PoLing Loh. Statistical consistency and asymptotic normality for highdimensional robust estimators. The Annals of Statistics, 45(2):866–896, 2017.
 [LPR15] Tianyang Li, Adarsh Prasad, and Pradeep K Ravikumar. Fast classification rates for highdimensional gaussian generative models. In Advances in Neural Information Processing Systems, pages 1054–1062, 2015.
 [LRV16] Kevin A Lai, Anup B Rao, and Santosh Vempala. Agnostic estimation of mean and covariance. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 665–674. IEEE, 2016.
 [LSLC18] Liu Liu, Yanyao Shen, Tianyang Li, and Constantine Caramanis. High dimensional robust sparse regression. arXiv preprint arXiv:1805.11643, 2018.
 [LSRC15] YenHuan Li, Jonathan Scarlett, Pradeep Ravikumar, and Volkan Cevher. Sparsistency of 1regularized Mestimators. In AISTATS, 2015.
 [LT18] PoLing Loh and Xin Lu Tan. Highdimensional robust precision matrix estimation: Cellwise corruption under contamination. Electronic Journal of Statistics, 12(1):1429–1467, 2018.
 [LW11] PoLing Loh and Martin J Wainwright. Highdimensional regression with noisy and missing data: Provable guarantees with nonconvexity. In Advances in Neural Information Processing Systems, pages 2726–2734, 2011.
 [LYCR17] Tianyang Li, Xinyang Yi, Constantine Caramanis, and Pradeep Ravikumar. Minimax gaussian classification & clustering. In Artificial Intelligence and Statistics, pages 1–9, 2017.
 [MB06] Nicolai Meinshausen and Peter Bühlmann. Highdimensional graphs and variable selection with the lasso. The annals of statistics, 34(3):1436–1462, 2006.
 [Min15] Stanislav Minsker. Geometric median and robust estimation in banach spaces. Bernoulli, 21(4):2308–2335, 2015.
 [Min18] Stanislav Minsker. Subgaussian estimators of the mean of a random matrix with heavytailed entries. The Annals of Statistics, 46(6A):2871–2903, 2018.
 [NRWY12] Sahand N Negahban, Pradeep Ravikumar, Martin J Wainwright, and Bin Yu. A unified framework for highdimensional analysis of estimators with decomposable regularizers. Statistical Science, 27(4):538–557, 2012.
 [NY83] Arkadii Semenovich Nemirovsky and David Borisovich Yudin. Problem complexity and method efficiency in optimization. 1983.
 [PSBR18] Adarsh Prasad, Arun Sai Suggala, Sivaraman Balakrishnan, and Pradeep Ravikumar. Robust estimation via robust gradient estimation. arXiv preprint arXiv:1802.06485, 2018.
 [Rou84] Peter J Rousseeuw. Least median of squares regression. Journal of the American statistical association, 79(388):871–880, 1984.
 [RWL10] Pradeep Ravikumar, Martin J Wainwright, and John D Lafferty. Highdimensional ising model selection using regularized logistic regression. The Annals of Statistics, 38(3):1287–1319, 2010.
 [RWRY11] Pradeep Ravikumar, Martin J Wainwright, Garvesh Raskutti, and Bin Yu. Highdimensional covariance estimation by minimizing penalized logdeterminant divergence. Electronic Journal of Statistics, 5:935–980, 2011.
 [RWY10] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Restricted eigenvalue properties for correlated gaussian designs. Journal of Machine Learning Research, 11(Aug):2241–2259, 2010.
 [RWY11] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Minimax rates of estimation for highdimensional linear regression over balls. IEEE transactions on information theory, 57(10):6976–6994, 2011.
 [SCV17] Jacob Steinhardt, Moses Charikar, and Gregory Valiant. Resilience: A criterion for learning in the presence of arbitrary outliers. arXiv preprint arXiv:1703.04940, 2017.
 [SL17] Jie Shen and Ping Li. A tight bound of hard thresholding. The Journal of Machine Learning Research, 18(1):7650–7691, 2017.
 [SS18] Yanyao Shen and Sujay Sanghavi. Iteratively learning from the best. arXiv preprint arXiv:1810.11874, 2018.
 [SX18] Lili Su and Jiaming Xu. Securing distributed machine learning in high dimensions. arXiv preprint arXiv:1804.10140, 2018.
 [Tib96] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
 [Tuk75] John W Tukey. Mathematics and the picturing of data. In Proceedings of the International Congress of Mathematicians, Vancouver, 1975, volume 2, pages 523–531, 1975.
 [vdV00] Aad van der Vaart. Asymptotic statistics. Cambridge University Press, 2000.
 [VMX17] Daniel Vainsencher, Shie Mannor, and Huan Xu. Ignoring is a bliss: Learning with large noise through reweightingminimization. In Conference on Learning Theory, pages 1849–1881, 2017.
 [Wai09] Martin J Wainwright. Sharp thresholds for highdimensional and noisy sparsity recovery using constrained quadratic programming (lasso). IEEE transactions on information theory, 55(5):2183–2202, 2009.
 [Wai19] Martin Wainwright. Highdimensional statistics: A nonasymptotic viewpoint. Cambridge University Press, 2019.
 [WG17] Lingxiao Wang and Quanquan Gu. Robust gaussian graphical model estimation with arbitrary corruption. In International Conference on Machine Learning, pages 3617–3626, 2017.
 [WJ08] Martin Wainwright and Michael Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1–2):1–305, 2008.
 [YCRB18a] Dong Yin, Yudong Chen, Kannan Ramchandran, and Peter Bartlett. Byzantinerobust distributed learning: Towards optimal statistical rates. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5650–5659. PMLR, 10–15 Jul 2018.
 [YCRB18b] Dong Yin, Yudong Chen, Kannan Ramchandran, and Peter Bartlett. Defending against saddle point attack in byzantinerobust distributed learning. arXiv preprint arXiv:1806.05358, 2018.
 [YL15] Eunho Yang and Aurélie C Lozano. Robust gaussian graphical modeling with the trimmed graphical lasso. In Advances in Neural Information Processing Systems, pages 2602–2610, 2015.
 [YLA18] Eunho Yang, Aurélie C Lozano, and Aleksandr Aravkin. A general family of trimmed estimators for robust highdimensional data analysis. Electronic Journal of Statistics, 12(2):3519–3553, 2018.
 [YLZ18] XiaoTong Yuan, Ping Li, and Tong Zhang. Gradient hard thresholding pursuit. Journal of Machine Learning Research, 18(166):1–43, 2018.
 [Zhu17] Ziwei Zhu. Taming the heavytailed features by shrinkage and clipping. arXiv preprint arXiv:1710.09020, 2017.
 [ZLR12] Tuo Zhao, Han Liu, Kathryn Roeder, John Lafferty, and Larry Wasserman. The huge package for highdimensional undirected graph estimation in r. Journal of Machine Learning Research, 13(Apr):1059–1062, 2012.
 [ZWJ14] Yuchen Zhang, Martin J Wainwright, and Michael I Jordan. Lower bounds on the performance of polynomialtime algorithms for sparse linear regression. In Conference on Learning Theory, pages 921–948, 2014.
1.2
Notations in Appendix.
In our proofs, the exponent in tail bounds is arbitrary, and can be changed to other larger constant without affecting the results. denote universal constants, and they may change line by line.
Appendix A Proofs for the gradient estimators
In Robust Hard Thresholding (Algorithm 1), we use trimmed gradient estimator or MOM gradient estimator. And in creftypecap 3.1, the key quantity to control the statistical rates of convergence is the Robust Descent Condition (creftypecap 3.1).
By Holder inequality, we have
In this section, we provide one direct route for obtaining upper bound of Robust Descent Condition via bounding the infinity norm of the robust gradient estimate (creftypecap A.1 and creftypecap A.2).
Later, in Appendix B, we will leverage creftypecap A.1 and creftypecap A.2 in verifying the Robust Descent Condition for trimmed/MOM gradient estimator under sparse linear/logistic regression. Together with creftypecap 3.1, this will complete creftypecap 4.1 – creftypecap 4.4.
Proposition A.1.
Suppose we observe corrupted subGaussian samples (creftypecap 2.1). With probability at least , the coordinatewise trimmed gradient estimator can guarantee

for sparse linear regression (creftypecap 4.1).

for sparse logistic regression (creftypecap 4.2).
Proposition A.2.
Suppose we observe samples from the heavy tailed model with bounded 4th moment covariates. With probability at least , the coordinatewise Median of Means gradient estimator can guarantee

for sparse linear regression;

for sparse logistic regression.
a.1 Proofs for the MOM gradient estimator
We first prove creftypecap A.2. creftypecap A.1 of trimmed gradient estimator for corrupted subGaussian samples has the same dependency on . The proof of creftypecap A.1 leverages standard concentration bound for subGaussian samples, and then uses trimming to control the effect of outliers.
Proof of creftypecap a.2.
For loss function, we have , where we omit the subscript in the proof. We denote , and bound the operator norm of the covariance of gradient samples
where (i) follows from the Holder inequality, and (ii) follows from the 4th moment bound assumption.
Hence, by using coordinatewise Median of Means gradient estimator, we have
with probability at least , where (i) follows from Proposition 5 in [HS16]. Applying union bounds on all d indexes, we have with probability at least .
For logistic loss, the gradient can be computed as: where we omit the subscript in the proof.
Since , and , we directly have Similar to the case of loss, we have , with probability at least .
∎
a.2 Proofs for the trimmed gradient estimator
We then turn to the trimmed gradient estimator for the case of arbitrary corruption. Before we proceed to the trimmed estimator, let us first visit the definition and tail bounds of subexponential random variable, as it will be used in sparse linear regression, where the gradient’s distribution is indeed subexponential under the subGaussian assumptions in creftypecap 4.1.
We first present standard concentration inequalities ([Wai19]).
Definition A.1 (Subexponential random variables).
A random variable with mean is subexponential if there are nonnegative parameters such that
Lemma A.1 (Bernstein’s inequality).
Suppose that , are i.i.d. subexponential random variables with parameters . Then \cref@addtoresetequationparentequation
We also have a twosided tail bound
We define trimmed mean estimator for one dimensional samples, and denote it as .
Definition A.2 (trimmed mean estimator).
Given a set of corrupted samples , the coordinatewise trimmed mean estimator removes the largest and smallest fraction of elements in , and calculate the mean of the remaining terms. We choose , for a constant . We also require that , for some small constant .
creftypecap A.2 shows the guarantees for this robust gradient estimator in each coordinate. We note that creftypecap A.2 is stronger than guarantees for trimmed mean estimator (Lemma 3) in [YCRB18a]. In our contamination model creftypecap 2.1, the adversary may delete fraction of authentic samples, and then add arbitrary outliers. And creftypecap A.2 provides guarantees for trimmed mean estimator on subexponential random variables. The trimmed mean estimator is robust enough, that it allows the adversary to arbitrarily remove fraction of data points. We use to denote the samples at the th coordinate of . We can also define in the same way.
Lemma A.2.
Suppose we observe corrupted samples from creftypecap 2.1. For each dimension , we assume the samples in are i.i.d. subexponential with mean . After the contamination, we have the th samples as . Then, we can guarantee the trimmed mean estimator on th dimension that
with probability at least .
We leave the proof of creftypecap A.2 at the end of this section. Then, we present analysis of trimmed gradient estimator for sparse linear regression and sparse logistic regression by using creftypecap A.2. For sparse linear regression model with subGaussian covariates, the distribution of authentic gradients are subexponential instead of subGaussian. More specifically, we first prove that when the current parameter iterate is , the subexponential parameter of all authentic gradient is , where .
To gain some intuition for this, we can consider the sparse linear equation problem, where . When , we exactly recover , and all stochastic gradients of authentic samples are actually zero vectors, as all observations are noiseless. It is clear that we will have subexponential parameter as .
Proof of creftypecap a.1.
For any , the gradient for one sample can be written as
where we omit the subscript in the proof. For any fixed standard basis vector , and define , we have
(4) 
To characterize the tail bounds of , we study the moment generating function:
We denote as a Rademacher random variable, which is independent of and . Then we can use a standard symmetrization technique [Wai19],
where follows from the exponential function’s power series expansion, and follows from the independence of , together with the fact that all odd moments of the terms have zeros means.
By the CauchySchwarz inequality, we have
It is clear that is a subGaussian random variable with parameter . Since , we have . For any fixed standard basis vector , we can conclude that is subGaussian with parameter at most based on creftypecap 4.1. By basic properties of subGaussian random variables [Wai19], we have
where follows from the fact that is the weighted summation of two independent subGaussian random variables. Hence, we have
(5) 
where follows from (proof by mathematical induction). When we have , eq. 5 converges to. Hence,
That being said, is a subexponential random variable. By choosing as each coordinate in , each coordinate of gradient has subexponential parameter as .
Then, applying creftypecap A.2 on this collection of corrupted subexponential random variables, we have
(6) 
with probability at least .
Applying union bounds on eq. 6 for all indexes, we have