High Dimensional Robust M-Estimation: Arbitrary Corruption and Heavy Tails

High Dimensional Robust -Estimation: Arbitrary Corruption and Heavy Tails

Liu Liu
liuliu@utexas.edu
   Tianyang Li
lty@cs.utexas.edu
   Constantine Caramanis
constantine@utexas.edu
   The University of Texas at Austin
Abstract

We consider the problem of sparsity-constrained -estimation when both explanatory and response variables have heavy tails (bounded 4-th moments), or a fraction of arbitrary corruptions. We focus on the -sparse, high-dimensional regime where the number of variables and the sample size are related through . We define a natural condition we call the Robust Descent Condition (RDC), and show that if a gradient estimator satisfies the RDC, then Robust Hard Thresholding (IHT using this gradient estimator), is guaranteed to obtain good statistical rates. The contribution of this paper is in showing that this RDC is a flexible enough concept to recover known results, and obtain new robustness results. Specifically, new results include: (a) For -sparse high-dimensional linear- and logistic-regression with heavy tail (bounded 4-th moment) explanatory and response variables, a linear-time-computable median-of-means gradient estimator satisfies the RDC, and hence Robust Hard Thresholding is minimax optimal; (b) When instead of heavy tails we have -fraction of arbitrary corruptions in explanatory and response variables, a near linear-time computable trimmed gradient estimator satisfies the RDC, and hence Robust Hard Thresholding is minimax optimal. We demonstrate the effectiveness of our approach in sparse linear, logistic regression, and sparse precision matrix estimation on synthetic and real-world US equities data.

1 Introduction

-estimation is a standard technique for statistical estimation [vdV00]. The past decade has seen successful extensions of -estimation to the high dimensional setting with sparsity (or other low-dimensional structure), e.g., using Lasso [Tib96, BvdG11, HTW15, Wai19]. Yet sparse modeling in high dimensions is NP-hard in the worst case [BDMS13, ZWJ14]. Thus theoretical sparse recovery guarantees for most computationally tractable approaches (e.g., minimization [Don06, CRT04, Wai09], Iterative Hard Thresholding [BD09]) rely on strong assumptions on the probabilistic models of the data, such as sub-Gaussianity. Under such assumptions, these approaches achieve the minimax rate for sparse regression [RWY11].

Meanwhile, statistical estimation with heavy tailed outliers or even arbitrary corruptions has long been a focus in robust statistics [Box53, Tuk75, Hub11, HRRS11].111Following [Min18, FWZ16], by heavy-tail we mean satisfying only weak moment bounds, specifically, bounded 4-th order moments (compared to sub-exponential or sub-Gaussian). But heavy-tails and arbitrary corruptions in the data violate the assumptions required for convergence of the usual algorithms. A central question then, is what assumptions are sufficient to enable efficient and robust algorithms for high dimensional -estimation under heavy tails or arbitrary corruption.

Huber’s seminal work [Hub64] and more modern followup work [Loh17] has considered replacing the classical least squared risk minimization objective with a robust counterpart (e.g., Huber loss). Other approaches (e.g., [Li13]) considered regularization-based robustness approaches. However, when there are outliers in the explanatory variables (covariates), these approaches do not seem to succeed [CCM13]. Meanwhile, approaches combining recent advances in robust mean estimation and gradient descent have proved remarkably powerful in the low-dimensional setting [PSBR18, KKM18, DKK18], but for high dimensions, have so far only managed to address the setting where the covariance of the explanatory variables is the identity, or sparse [BDLS17, LSLC18]. Meanwhile, flexible and statistically optimal approaches ([Gao17]) have relied on intractable estimators such as Tukey-depth.

For the heavy-tail setting, another line of research considers estimators such as Median-of-Means (MOM) [NY83, JVV86, AMS99, Min15] and Catoni’s mean estimator [Cat12, Min18] only use weak moment assumptions. [Min15, BJL15, HS16] generalized these ideas to -estimation, yet it is not clear if these approaches apply to the high-dimensional setting with heavy tailed covariates.

Main Contributions. In this paper, we develop a sufficient condition that when satisfied, guarantees that an efficient algorithm (a variant of IHT) achieves the minimax optimal statistical rate. We show that our condition is flexible enough to apply to a number of important high-dimensional estimation problems under either heavy tails, or arbitrary corruption of the data. Specifically:

  1. We consider two models. For our arbitrary corruption model, we assume that an adversary replaces an arbitrary -fraction of the authentic samples with arbitrary values (creftypecap 2.1). For the heavy-tailed model, we assume our data (response and covariates) satisfy only weak moment assumptions (creftypecap 2.2) without sub-Gaussian or sub-exponential concentration bounds.

  2. We propose a notion that we call the Robust Descent Condition (RDC). Given any gradient estimator that satisfies the RDC, we define RHT – Robust Hard Thresholding (Algorithm 1) for sparsity constrained -estimation, and prove that Algorithm 1 converges linearly to a minimax statistically optimal solution. Thus the RDC and Robust Hard Thresholding form the basis for a Deterministic Meta-Theorem (creftypecap 3.1) that guarantees estimation error rates as soon as the RDC property of any gradient estimator can be certified.

  3. We then obtain non-asymptotic bounds via certifying the RDC for different robust gradient estimators under various statistical models. (A) For corruptions in both response and explanatory variables, we show the trimmed gradient estimator satisfies the RDC. Thus our algorithm RHT has minimax-optimal statistical error, and tolerates -fraction of outliers. This fraction is nearly independent of the , which is important in the high dimension regime. (B) In the heavy tailed regime, we use the Median-of-Means (MOM) gradient estimator. Our RHT algorithm obtains the sharpest available error bound, in fact nearly matching the results in the sub-Gaussian case. With either of these gradient estimators, our algorithm is computationally efficient, nearly matching vanilla gradient descent. This is in particular much faster than algorithms relying on sparse PCA relaxations as subroutines ([BDLS17, LSLC18]).

  4. We use Robust Hard Thresholding for neighborhood selection [MB06] for estimating Gaussian graphical models, and provide model selection guarantees under adversarial corruption of the data; our results share similar robustness guarantees with sparse regression.

  5. We demonstrate the effectiveness of Robust Hard Thresholding on both arbitrarily corrupted/heavy tailed synthetic data and (unmodified) real data.

A concrete illustration of 3(B) above: Consider a sparse linear regression problem without noise (sparse linear equations), with scaling . When the covariates are sub-Gaussian, Lasso succeeds in exact recovery with high probability (as expected). When the covariates have only 4-th moments, we do not expect Lasso to succeed, and indeed experiments indicate this. Moreover, to the best of our knowledge, no previous efficient algorithm with samples can guarantee exact recovery in this observation model ([FWZ16] has a statistical rate depending on the norm of the parameter , and thus exact recovery for is not guaranteed). Our contributions show that Robust Hard Thresholding using MOM achieves this (see also simulations in Figure 6(b)).

Related work

Sparse regression with arbitrary corruptions or heavy tails. Several works in robustness of high dimensional problems consider heavy tailed distributions or arbitrary corruptions only in the response variables [Li13, BJK15, BJK17, Loh17, KP18, HS16, Min15, CLZL]. Yet these algorithms cannot be trivially extended to the setting with heavy tails or corruptions in explanatory variables. Another line [ACG13, VMX17, YLA18, SS18] focuses on alternating minimization approaches which extend Least Trimmed Squares [Rou84]. However, these methods only have local convergence guarantees, and cannot handle arbitrary corruptions.

[CCM13] was one of the first papers to provide guarantees for sparse regression with arbitrary outliers in both response and explanatory variables by trimming the design matrix. Similar trimming techniques are also used in [FWZ16] for heavy tails in response and explanatory variables. Those results are specific to sparse regression, however, and cannot be easily extended to general -estimation problems. Moreover, even for linear regression, the statistical rates are not minimax optimal. [LM16] uses Median-of-Means tournaments to deal with heavy tails in the explanatory variables and obtains near optimal rates. However, Median-of-Means tournaments is not known to be computationally tractable. [LL17] deals with heavy tails and outliers in the explanatory variables, but they require higher moment bound (whose order is ) in the isotropic design case. [Gao17] optimizes Tukey depth [Tuk75, CGR18] for robust sparse regression under the Huber -contamination model, and their algorithm is minimax optimal and can handle a constant fraction of outliers. However, computing Tukey depth is intractable [JP78]. Recent results [BDLS17, LSLC18] leverage robust sparse mean estimation in robust sparse regression. Their algorithms are computationally tractable, and can tolerate , but they require very restrictive assumptions on the covariance matrix ( or sparse), which precludes their use in applications such as graphical model estimation.

Robust -estimation via robust gradient descent. Works in [CSX17, HI17] and later [YCRB18a] first leveraged the idea of using robust mean estimation in each step of gradient descent, using a subroutine such as geometric median. A similar approach using more sophisticated robust mean estimation methods was later proposed in [PSBR18, DKK18, YCRB18b, SX18, Hol18] for robust gradient descent. These methods all focused on low dimensional robust -estimation. Work in [LSLC18] extended the approach to the high-dimensional setting (though is limited to or sparse covariances). Even though the corrupted fraction can be independent of the ambient dimension by using sophisticated robust mean estimation algorithms [DKK16, LRV16, SCV17], or the sum-of-squares framework [KKM18], these algorithms (except [LSLC18]) are not applicable to the high dimensional setting (), as they require at least samples.

Robust estimation of graphical models. A line of research using a robustified covariance matrix in Gaussian graphical models [LHY12b, WG17, LT18] leverages GLasso [FHT08] or CLIME [CLL11] to estimate the sparse precision matrix. These robust methods are restricted to Gaussian graphical model estimation, and their techniques cannot be generalized to other -estimation problems.

Notation. We denote the Hard Thresholding operator of sparsity by , and denote the Euclidean projection onto the ball by . We use to denote the expectation operator obtained by the uniform distribution over all samples .

2 Problem formulation

We now define the corruption and heavy tails model and sparsity constrained -estimation.

Definition 2.1 (-corrupted samples).

Let be i.i.d. observations with distribution . We say that a collection of samples is -corrupted if an adversary chooses an arbitrary -fraction of the samples in and modifies them with arbitrary values.

This corruption model allows corruptions in both explanatory and response variables in regression problems where we observe . creftypecap 2.1 also allows the adversary to select an -fraction of samples to delete and corrupt.

Definition 2.2 (heavy-tailed samples).

For a distribution of with mean and covariance , we say that has bounded -th moment, if there is a universal constant such that, for a unit vector , we have .

creftypecap 2.2 allows heavy tails in both explanatory and response variables for . For example, in creftypecap 4.3, we study linear regression with bounded 4-th moments for and bounded variance for and noise.

Let be a convex and differentiable loss function. Our target is the unknown sparse population minimizer , and we write as the population risk, . Note that ’s definition allows model misspecification. The following creftypecap 2.3 provides general assumptions for the population risk.

Definition 2.3 (Strong convexity/smoothness).

For the population risk , we assume , where is the strong-convexity parameter and is the smoothness parameter. The condition number is .

A well known result [NRWY12] considers ERM with convex relaxation from to , by certifying the RSC condition for sub-Gaussian ensembles – this obtains uniform convergence of the empirical risk. From an optimization viewpoint, existing results reveal that gradient descent algorithms equipped with soft-thresholding [ANW12] or hard-thresholding [BD09, JTK14, SL17, YLZ18, LB18] have linear convergence rate, and achieve known minimax lower bounds in statistical estimation [RWY11, ZWJ14].

Given samples , running ERM on the entire input dataset: , cannot guarantee uniform convergence of the empirical risk, and can be arbitrarily bad for -corrupted samples. The next two sections outline the main results of this paper, addressing this problem.

3 Robust sparse estimation via Robust Hard Thresholding

We introduce our meta-algorithm, Robust Hard Thresholding, that essentially uses a robust gradient estimator to run IHT. We require several definitions to specify the algorithm, and describe its results. We use as a placeholder for the estimate at , obtained from whichever robust gradient estimator we are using. Let denote the population gradient. We use and when the context is clear.

Many previous works ([CSX17, HI17, PSBR18, DKK18, YCRB18a, YCRB18b, SX18]) have provided algorithms for obtaining robust gradient estimators, then used as subroutines in robust gradient algorithms. However, those results require controlling , and do not readily extend to high dimensions, as sufficiently controlling seems to require . A recent work [LSLC18] on robust sparse linear regression uses a robust sparse mean estimator [BDLS17] to guarantee with sample complexity . However, their algorithm requires the restrictive assumption or sparse, and thus cannot be extended to more general -estimation problems.

To address this issue, we propose Robust Hard Thresholding (Algorithm 1), which uses hard thresholding after each robust gradient update222Our theory requires splitting samples across different iterations to maintain independence between iterations. We believe this is an artifact of the analysis, and do not use this in our experiments. [BWY17, PSBR18] use a similar approach for theoretical analysis. . In line 7, we use a gradient estimator to obtain the robust gradient estimate . In line 8, we update the parameter by hard thresholding , where the hyper-parameter proportional to is specified in creftypecap 2.3. A key observation in line 8 is that, in each step of IHT, the iterate is sparse, and thus the perturbation from outliers or heavy tails only depends on IHT’s sparsity instead of the ambient dimension . Based on a careful analysis of the hard thresholding operator in each iteration, we show that rather than controlling , it is enough to control a weaker quantity: this is what we call the Robust Descent Condition creftypecap 3.1 and we define it next; it plays a key role in obtaining sharp rates of convergence for various types of statistical models.

1:  Input: Data samples , gradient estimator .
2:  Output: The estimation .
3:  Parameters: Hard thresholding parameter . 
4:  Split samples into subsets each of size . Initialize with .
5:  for  to do
6:     At current , calculate all gradients for current samples: , .
7:     For , we obtain
8:     Update the parameter: Then project: .
9:  end for
10:  Output the estimation .
Algorithm 1 Robust Hard Thresholding

Robust Descent Condition

The Robust Descent Condition eq. 1 provides an upper bound on the inner product of the robust gradient estimate and the distance to the population optimum. This is a natural notion to control the potential progress obtained by using a robust gradient update instead of the population gradient.

Definition 3.1 (-Robust Descent Condition (RDC)).

For the population gradient at , a robust gradient estimator satisfies the robust descent condition if for any sparse ,

(1)

We begin with a Meta-Theorem for Algorithm 1 that holds under the Robust Descent Condition creftypecap 3.1 and assumptions on population risk creftypecap 2.3. In creftypecap 3.1, we prove Algorithm 1’s global convergence and its statistical guarantees. The proofs are collected in Appendix B.

Theorem 3.1 (Meta-Theorem).

Suppose we observe samples from a statistical model with population risk satisfying creftypecap 2.3. If a robust gradient estimator satisfies -Robust Descent Condition (creftypecap 3.1) where , then Algorithm 1 with outputs such that , by setting .

We note that creftypecap 3.1 is deterministic in nature. In the sequel, we omit the log term in the sample complexity due to sample splitting. We obtain high probability results via certifying that the RDC holds for certain robust gradient estimators under various statistical models. To obtain the minimax estimation error rate in creftypecap 3.1, the key step is providing a robust gradient estimator with sufficiently small , in the definition of RDC.

Section 4 uses the RDC and creftypecap 3.1 to obtain new results for sparse regression under heavy tails or arbitrary corruption. Before we move to this, we observe that we can use the RDC and creftypecap 3.1 to recover existing results in the literature. Some immediate examples are as follows:

Uncorrupted gradient satisfies the RDC. Suppose the samples follow from sparse linear regression with sub-Gaussian covariates and noise . The empirical average of gradient samples satisfies eq. 1 with , by assuming constraint on and [LW11]. Plugging in this to creftypecap 3.1 recovers the well-known minimax rates for sparse linear regression [RWY11].

RSGE implies RDC. When or is sparse, [BDLS17] and [LSLC18], respectively, provide robust sparse gradient estimators (RSGE) which upper bound , for a constant fraction of corrupted samples. Noting that , we observe that RSGE implies RDC. Hence any RSGE can be used in Algorithm 1. The RSGE for in [BDLS17] guarantees an RDC with when , and the RSGE for unknown sparse from [LSLC18] guarantees when . Again plugging these values for into our theorem, recovers the results in those papers. 333It remains an open question to obtain a RSGE for a constant fraction of outliers for robust sparse regression with arbitrary covariance .

4 Main Results: Using the RDC and Algorithm 1

In the remainder of our paper, we use creftypecap 3.1 and the RDC to analyze two well-known and computationally efficient robust mean estimation subroutines that have been used in the low-dimensional setting: the trimmed mean estimator and the MOM estimator. We show that these two can obtain a sufficiently small in the definition of the RDC. This leads to the minimax estimation error in the case of arbitrary corruptions or heavy tails.

4.1 Gradient estimation

The trimmed mean and MOM estimators have been successfully applied to robustify gradient descent [YCRB18a, PSBR18] in the low dimensional setting. They have not been used in the high dimensional regime, however, because until now we have not had the machinery to analyze their algorithmic convergence, statistical rates and minimax optimality in the high dimensional setting.

To show they satisfy the RDC with a sufficiently small , we observe that by using Hölder’s inequality on the LHS of eq. 1, we have Using Algorithm 1, the Hard Thresholding step enforces sparsity of . Therefore, controlling amounts to bounding the infinity norm of the robust gradient estimate.

In Section 4.2, we show that by using coordinate-wise robust mean estimation, we can certify the RDC with sufficiently small to guarantee minimax rates. Specifically, we show this for the trimmed gradient estimator for arbitrary corruption, and and the MOM gradient estimator for heavy tailed distributions.

Definition 4.1.

Given gradients samples , for each dimension ,

: Trimmed gradient estimator removes the largest and smallest fraction of elements in , and calculates the mean of the remaining terms. We choose for constant , and require for a small constant .

: MOM gradient estimator partitions into blocks and computes the sample mean of within each block, and then take the median of these means.444Without loss of generality, we assume the number of blocks divides , and is chosen in [HS16].

4.2 Statistical guarantees

In this section, we consider some typical models for general -estimation.

Model 4.1 (Sparse linear regression).

Samples are drawn from a linear model : , with being -sparse. We assume that ’s are i.i.d. with normalized covariance matrix , with , and the stochastic noise has mean and variance .

Model 4.2 (Sparse logistic regression).

Samples are drawn from a binary classification model , where the binary label follows the conditional probability distribution , with being -sparse. We assume that ’s are i.i.d. with normalized covariance matrix , where for all .

To obtain the following corollaries, we first certify the RDC for a certain robust gradient estimator over random ensembles with corruption or heavy tails, and then use them in creftypecap 3.1. We collect the results for gradient estimation in Appendix A, and the proofs for corollaries in Appendix B.

Arbitrary corruption case.

Based on creftypecap 3.1, we first provide concrete results for arbitrary corruption case creftypecap 2.1, where the covariates and response variables in the authentic distribution are assumed to be sub-Gaussian.

Corollary 4.1.

Suppose we observe -corrupted (creftypecap 2.1) sub-Gaussian samples from sparse linear regression model (creftypecap 4.1). Under the condition , and , with probability at least , Algorithm 1 with trimmed gradient estimator satisfies the RDC with , and thus creftypecap 3.1 provides

Time complexity. creftypecap 4.1 has a global linear convergence rate. In each iteration, we only use operations complexity to calculate trimmed mean. We incur logarithmic overhead compared to normal gradient descent [Bub15].

Statistical accuracy and robustness. Compared with [CCM13, BDLS17], our statistical error rate is minimax optimal [RWY11, ZWJ14], and has no dependencies on . Furthermore, the upper bound on is nearly independent of , which guarantees Algorithm 1’s robustness in high dimensions.

Corollary 4.2.

Suppose we observe -corrupted (creftypecap 2.1) sub-Gaussian samples from sparse logistic regression model (creftypecap 4.2). With probability at least , Algorithm 1 with trimmed gradient estimator satisfies the RDC with , and thus creftypecap 3.1 provides .

Statistical accuracy and robustness. Under the sparse Gaussian linear discriminant analysis model (a typical example of creftypecap 4.2), Algorithm 1 achieves the statistical minimax rate [LPR15, LYCR17].

Heavy-tailed distribution case.

We next turn to the heavy tailed distribution case creftypecap 2.2.

Corollary 4.3.

Suppose we observe samples from sparse linear regression model (creftypecap 4.2) with bounded 4-th moment covariates. Under the condition , with probability at least , Algorithm 1 with MOM gradient estimator satisfies the RDC with , and thus creftypecap 3.1 provides .

Time complexity. Similar to creftypecap 4.1, creftypecap 4.3 has a global linear convergence. In each iteration, we only use operations complexity – the same as normal gradient descent [Bub15].

Statistical accuracy. [LM16] uses Median-of-Means tournaments to deal with sparse linear regression with bounded moment assumptions for the covariates, and they obtain near optimal rates. We obtain similar rates, however our algorithm is efficient, where as Median-of-Means tournaments is not known to be computationally tractable. [FWZ16, Zhu17] deal with the same problem by truncating and shrinking the data to certify the RSC condition. Their results require boundedness of higher moments of the noise , and the final error depends on . Our estimation error bounds exactly recover optimal sub-Gaussian bounds for sparse regression [NRWY12, Wai19], and moreover, we obtain exact recovery when ’s variance .

Corollary 4.4.

Suppose we observe samples from sparse logistic regression model (creftypecap 4.2). With probability at least , Algorithm 1 with MOM gradient estimator satisfies the RDC with , and thus creftypecap 3.1 provides .

4.3 Sparsity recovery and Gaussian graphical model estimation

We next demonstrate the sparsity recovery performance of Algorithm 1 for graphical model learning [MB06, Wai09, RWL10, RWRY11, BvdG11, HTW15]. Our sparsity recovery guarantees hold for both heavy tails and arbitrary corruption, though we only present results in the case of arbitrary corruption in this section.

We use to denote top indexes of with the largest magnitude. Let denote the smallest absolute value of nonzero element of . To control the false negative rate, creftypecap 4.5 shows that under the -condition, is exactly . The proofs are given in Appendix C. Sparsity recovery guarantee for sparse logistic regression is similar, and is omitted due to space constraints. Existing results on sparsity recovery for regularized estimators [Wai09, LSRC15] do not require the RSC condition, but instead require an irrepresentability condition, which is stronger. If , creftypecap 4.5 has the same -condition as IHT for sparsity recovery [YLZ18].

Corollary 4.5.

Under the same condition as in creftypecap 4.1, and a -condition on , , Algorithm 1 with trimmed gradient estimator guarantees that , with probability at least .

We consider sparse precision matrix estimation for Gaussian graphical models. The sparsity pattern of its precision matrix matches the conditional independence relationships [KFB09, WJ08].

Model 4.3 (Sparse precision matrix estimation).

Under the contamination model creftypecap 2.1, authentic samples are drawn from a multivariate Gaussian distribution . We assume that each row of the precision matrix is -sparse – each node has at most edges.

For the uncorrupted samples drawn from the Gaussian graphical model, the neighborhood selection (NS) algorithm [MB06] solves a convex relaxation of the following sparsity constrained optimization to regress each variable against its neighbors

(2)

where denotes the -th coordinate of , and denotes the index set . Let denote ’s -th column with the diagonal entry removed. and denote the -th diagonal element of . Then, the sparsity pattern of can be estimated through . Details on the connection between and are given in Appendix C.

However, given -corrupted samples from the Gaussian graphical model, this procedure will fail [LHY12b, WG17]. To address this issue, we propose Robust NS (Algorithm 2 in Appendix C), which robustifies Neighborhood Selection [MB06] by using Robust Hard Thresholding (with least square loss) to robustify eq. 2. Similar to creftypecap 4.5, a -condition guarantees consistent edge selection.

Corollary 4.6.

Under the same condition as in creftypecap 4.1, and a -condition for , , Robust NS (Algorithm 2) achieves consistent edge selection, with probability at least .

Similar to creftypecap 4.1, the fraction is nearly independent of dimension , which provides guarantees of Robust NS in high dimensions. Other Gaussian graphical model selection algorithms include GLasso [FHT08], CLIME[CLL11]. The experimental details comparing robustified versions of these algorithms are presented in Section D.4.

5 Experiments

We provide the complete details for our experiment setup in Appendix D.

(a) (b)
Figure 3: In the left plot, the corruption level is fixed and we use Algorithm 1 with trimming for different noise level . In the right plot, we consider log-normal samples, and we use Algorithm 1 with MOM for different sample size to compare with baselines (Lasso on heavy tailed data, and Lasso on sub-Gaussian data).
(a) (b)
Figure 6: Graph estimated from the S&P 500 stock data by Algorithm 2 and Vanilla NS approach. Variables are colored according to their sectors. In particular, the stocks from sector Information Technology are colored as purple.

Sparse regression with arbitrary corruption. We generate samples from a sparse regression model (creftypecap 4.1) with a Toeplitz covariance . Here, the stochastic noise , and we vary the noise level in different simulations. We add outliers with , and track the parameter error in each iteration. Left plot of Figure 6 shows Algorithm 1’s linear convergence, and the error curves flatten out at the final error level. Furthermore, Algorithm 1 can achieve machine precision when , which means exactly recovering of .

Sparse regression with heavy tails. We consider a log-normal distribution (a typical example of heavy tails) in creftypecap 4.1. More specifically, , and . Here, is the same Toeplitz covariance, each entry of and follows from , where . We fix , and vary sample size . For log-normal samples, we run Algorithm 1 with MOM and vanilla Lasso. We then re-generate standard Gaussian samples using the same dimensions with and run Vanilla Lasso. Each curve in the right plot of Figure 6 is the average of 50 trials. Algorithm 1 with MOM significantly improves vanilla Lasso on log-normal data, and has the same performance as Lasso on sub-Gaussian data

Real data experiments. We next apply Algorithm 2, to a US equities dataset [LHY12a, ZLR12], which is heavy-tailed and has many outliers [dP18]. The dataset contains 1,257 daily closing prices of 452 stocks (variables). It is well known that stocks from the same sector tend to be clustered together [Kin66]. Therefore, we use Robust NS (Algorithm 2) to construct an undirected graph among stocks. Graphs estimated by different algorithms are shown in Figure 6. We can see that stocks from the same sector are clustered together, and these clustering centers can be easily identified. We also compare Algorithm 2 to the baseline NS approach (as in the ideal setting). We can observe that stocks from Information Technology (colored by purple) are much better clustered by Algorithm 2.

References

  • [ACG13] Andreas Alfons, Christophe Croux, and Sarah Gelper. Sparse least trimmed squares regression for analyzing high-dimensional large data sets. The Annals of Applied Statistics, pages 226–248, 2013.
  • [AMS99] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and system sciences, 58(1):137–147, 1999.
  • [ANW12] Alekh Agarwal, Sahand Negahban, and Martin J. Wainwright. Fast global convergence of gradient methods for high-dimensional statistical recovery. Ann. Statist., 40(5):2452–2482, 10 2012.
  • [BD09] Thomas Blumensath and Mike E Davies. Iterative hard thresholding for compressed sensing. Applied and computational harmonic analysis, 27(3):265–274, 2009.
  • [BDLS17] Sivaraman Balakrishnan, Simon S. Du, Jerry Li, and Aarti Singh. Computationally efficient robust sparse estimation in high dimensions. In Proceedings of the 2017 Conference on Learning Theory, 2017.
  • [BDMS13] Afonso S. Bandeira, Edgar Dobriban, Dustin G. Mixon, and William F. Sawin. Certifying the restricted isometry property is hard. IEEE Transactions on Information Theory, 59(6):3448–3450, 2013.
  • [BJK15] Kush Bhatia, Prateek Jain, and Purushottam Kar. Robust regression via hard thresholding. In Advances in Neural Information Processing Systems, pages 721–729, 2015.
  • [BJK17] Kush Bhatia, Prateek Jain, and Purushottam Kar. Consistent robust regression. In Advances in Neural Information Processing Systems, pages 2107–2116, 2017.
  • [BJL15] Christian T Brownlees, Emilien Joly, and Gabor Lugosi. Empirical risk minimization for heavy-tailed losses. Annals of Statistics, 43(6):2507–2536, 2015.
  • [Box53] George EP Box. Non-normality and tests on variances. Biometrika, 40(3/4):318–335, 1953.
  • [Bub15] Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015.
  • [BvdG11] Peter Bühlmann and Sara van de Geer. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media, 2011.
  • [BWY17] Sivaraman Balakrishnan, Martin J Wainwright, and Bin Yu. Statistical guarantees for the em algorithm: From population to sample-based analysis. The Annals of Statistics, 45(1):77–120, 2017.
  • [Cat12] Olivier Catoni. Challenging the empirical mean and empirical variance: a deviation study. In Annales de l’IHP Probabilités et statistiques, volume 48, pages 1148–1185, 2012.
  • [CCM13] Yudong Chen, Constantine Caramanis, and Shie Mannor. Robust sparse regression under adversarial corruption. In International Conference on Machine Learning, pages 774–782, 2013.
  • [CGR18] Mengjie Chen, Chao Gao, and Zhao Ren. Robust covariance and scatter matrix estimation under huber’s contamination model. Ann. Statist., 46(5):1932–1960, 10 2018.
  • [CLL11] Tony Cai, Weidong Liu, and Xi Luo. A constrained minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association, 106(494):594–607, 2011.
  • [CLZL] Yuejie Chi, Yuanxin Li, Huishuai Zhang, and Yingbin Liang. Median-truncated gradient descent: A robust and scalable nonconvex approach for signal estimation.
  • [CRT04] Emmanuel J. Candès, Justin K. Romberg, and Terence Tao. Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52:489–509, 2004.
  • [CSX17] Yudong Chen, Lili Su, and Jiaming Xu. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 1(2):44, 2017.
  • [DKK16] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high dimensions without the computational intractability. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 655–664. IEEE, 2016.
  • [DKK18] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Jacob Steinhardt, and Alistair Stewart. Sever: A robust meta-algorithm for stochastic optimization. arXiv preprint arXiv:1803.02815, 2018.
  • [Don06] David L Donoho. Compressed sensing. IEEE Transactions on information theory, 52(4):1289–1306, 2006.
  • [dP18] M. L. de Prado. Advances in Financial Machine Learning. Wiley, 2018.
  • [FHT08] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441, 2008.
  • [FWZ16] Jianqing Fan, Weichen Wang, and Ziwei Zhu. A shrinkage principle for heavy-tailed data: High-dimensional robust low-rank matrix recovery. arXiv preprint arXiv:1603.08315, 2016.
  • [Gao17] Chao Gao. Robust regression via mutivariate regression depth. arXiv preprint arXiv:1702.04656, 2017.
  • [HI17] Matthew J Holland and Kazushi Ikeda. Efficient learning with robust gradient descent. arXiv preprint arXiv:1706.00182, 2017.
  • [Hol18] Matthew J Holland. Robust descent using smoothed multiplicative noise. arXiv preprint arXiv:1810.06207, 2018.
  • [HRRS11] Frank R Hampel, Elvezio M Ronchetti, Peter J Rousseeuw, and Werner A Stahel. Robust statistics: the approach based on influence functions, volume 196. John Wiley & Sons, 2011.
  • [HS16] Daniel Hsu and Sivan Sabato. Loss minimization and parameter estimation with heavy tails. The Journal of Machine Learning Research, 17(1):543–582, 2016.
  • [HTW15] Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity: the lasso and generalizations. CRC press, 2015.
  • [Hub64] Peter J Huber. Robust estimation of a location parameter. The annals of mathematical statistics, pages 73–101, 1964.
  • [Hub11] Peter J Huber. Robust statistics. In International Encyclopedia of Statistical Science, pages 1248–1251. Springer, 2011.
  • [JP78] David Johnson and Franco Preparata. The densest hemisphere problem. Theoretical Computer Science, 6(1):93–107, 1978.
  • [JTK14] Prateek Jain, Ambuj Tewari, and Purushottam Kar. On iterative hard thresholding methods for high-dimensional M-estimation. In Advances in Neural Information Processing Systems, pages 685–693, 2014.
  • [JVV86] Mark R Jerrum, Leslie G Valiant, and Vijay V Vazirani. Random generation of combinatorial structures from a uniform distribution. Theoretical Computer Science, 43:169–188, 1986.
  • [KFB09] Daphne Koller, Nir Friedman, and Francis Bach. Probabilistic graphical models: principles and techniques. MIT press, 2009.
  • [Kin66] Benjamin King. Market and industry factors in stock price behavior. the Journal of Business, 39(1):139–190, 1966.
  • [KKM18] Adam Klivans, Pravesh K. Kothari, and Raghu Meka. Efficient Algorithms for Outlier-Robust Regression. arXiv preprint arXiv:1803.03241, 2018.
  • [KP18] Sushrut Karmalkar and Eric Price. Compressed sensing with adversarial sparse noise via l1 regression. arXiv preprint arXiv:1809.08055, 2018.
  • [LB18] Haoyang Liu and Rina Foygel Barber. Between hard and soft thresholding: optimal iterative thresholding algorithms. arXiv preprint arXiv:1804.08841, 2018.
  • [LHY12a] Han Liu, Fang Han, Ming Yuan, John Lafferty, and Larry Wasserman. The nonparanormal skeptic. arXiv preprint arXiv:1206.6488, 2012.
  • [LHY12b] Han Liu, Fang Han, Ming Yuan, John Lafferty, Larry Wasserman, et al. High-dimensional semiparametric gaussian copula graphical models. The Annals of Statistics, 40(4):2293–2326, 2012.
  • [Li13] Xiaodong Li. Compressed sensing and matrix completion with constant proportion of corruptions. Constructive Approximation, 37(1):73–99, 2013.
  • [LL17] Guillaume Lecué and Matthieu Lerasle. Robust machine learning by median-of-means: theory and practice. arXiv preprint arXiv:1711.10306, 2017.
  • [LM16] Gabor Lugosi and Shahar Mendelson. Risk minimization by median-of-means tournaments. arXiv preprint arXiv:1608.00757, 2016.
  • [Loh17] Po-Ling Loh. Statistical consistency and asymptotic normality for high-dimensional robust -estimators. The Annals of Statistics, 45(2):866–896, 2017.
  • [LPR15] Tianyang Li, Adarsh Prasad, and Pradeep K Ravikumar. Fast classification rates for high-dimensional gaussian generative models. In Advances in Neural Information Processing Systems, pages 1054–1062, 2015.
  • [LRV16] Kevin A Lai, Anup B Rao, and Santosh Vempala. Agnostic estimation of mean and covariance. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 665–674. IEEE, 2016.
  • [LSLC18] Liu Liu, Yanyao Shen, Tianyang Li, and Constantine Caramanis. High dimensional robust sparse regression. arXiv preprint arXiv:1805.11643, 2018.
  • [LSRC15] Yen-Huan Li, Jonathan Scarlett, Pradeep Ravikumar, and Volkan Cevher. Sparsistency of 1-regularized M-estimators. In AISTATS, 2015.
  • [LT18] Po-Ling Loh and Xin Lu Tan. High-dimensional robust precision matrix estimation: Cellwise corruption under -contamination. Electronic Journal of Statistics, 12(1):1429–1467, 2018.
  • [LW11] Po-Ling Loh and Martin J Wainwright. High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity. In Advances in Neural Information Processing Systems, pages 2726–2734, 2011.
  • [LYCR17] Tianyang Li, Xinyang Yi, Constantine Caramanis, and Pradeep Ravikumar. Minimax gaussian classification & clustering. In Artificial Intelligence and Statistics, pages 1–9, 2017.
  • [MB06] Nicolai Meinshausen and Peter Bühlmann. High-dimensional graphs and variable selection with the lasso. The annals of statistics, 34(3):1436–1462, 2006.
  • [Min15] Stanislav Minsker. Geometric median and robust estimation in banach spaces. Bernoulli, 21(4):2308–2335, 2015.
  • [Min18] Stanislav Minsker. Sub-gaussian estimators of the mean of a random matrix with heavy-tailed entries. The Annals of Statistics, 46(6A):2871–2903, 2018.
  • [NRWY12] Sahand N Negahban, Pradeep Ravikumar, Martin J Wainwright, and Bin Yu. A unified framework for high-dimensional analysis of -estimators with decomposable regularizers. Statistical Science, 27(4):538–557, 2012.
  • [NY83] Arkadii Semenovich Nemirovsky and David Borisovich Yudin. Problem complexity and method efficiency in optimization. 1983.
  • [PSBR18] Adarsh Prasad, Arun Sai Suggala, Sivaraman Balakrishnan, and Pradeep Ravikumar. Robust estimation via robust gradient estimation. arXiv preprint arXiv:1802.06485, 2018.
  • [Rou84] Peter J Rousseeuw. Least median of squares regression. Journal of the American statistical association, 79(388):871–880, 1984.
  • [RWL10] Pradeep Ravikumar, Martin J Wainwright, and John D Lafferty. High-dimensional ising model selection using -regularized logistic regression. The Annals of Statistics, 38(3):1287–1319, 2010.
  • [RWRY11] Pradeep Ravikumar, Martin J Wainwright, Garvesh Raskutti, and Bin Yu. High-dimensional covariance estimation by minimizing -penalized log-determinant divergence. Electronic Journal of Statistics, 5:935–980, 2011.
  • [RWY10] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Restricted eigenvalue properties for correlated gaussian designs. Journal of Machine Learning Research, 11(Aug):2241–2259, 2010.
  • [RWY11] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Minimax rates of estimation for high-dimensional linear regression over -balls. IEEE transactions on information theory, 57(10):6976–6994, 2011.
  • [SCV17] Jacob Steinhardt, Moses Charikar, and Gregory Valiant. Resilience: A criterion for learning in the presence of arbitrary outliers. arXiv preprint arXiv:1703.04940, 2017.
  • [SL17] Jie Shen and Ping Li. A tight bound of hard thresholding. The Journal of Machine Learning Research, 18(1):7650–7691, 2017.
  • [SS18] Yanyao Shen and Sujay Sanghavi. Iteratively learning from the best. arXiv preprint arXiv:1810.11874, 2018.
  • [SX18] Lili Su and Jiaming Xu. Securing distributed machine learning in high dimensions. arXiv preprint arXiv:1804.10140, 2018.
  • [Tib96] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
  • [Tuk75] John W Tukey. Mathematics and the picturing of data. In Proceedings of the International Congress of Mathematicians, Vancouver, 1975, volume 2, pages 523–531, 1975.
  • [vdV00] Aad van der Vaart. Asymptotic statistics. Cambridge University Press, 2000.
  • [VMX17] Daniel Vainsencher, Shie Mannor, and Huan Xu. Ignoring is a bliss: Learning with large noise through reweighting-minimization. In Conference on Learning Theory, pages 1849–1881, 2017.
  • [Wai09] Martin J Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using -constrained quadratic programming (lasso). IEEE transactions on information theory, 55(5):2183–2202, 2009.
  • [Wai19] Martin Wainwright. High-dimensional statistics: A non-asymptotic viewpoint. Cambridge University Press, 2019.
  • [WG17] Lingxiao Wang and Quanquan Gu. Robust gaussian graphical model estimation with arbitrary corruption. In International Conference on Machine Learning, pages 3617–3626, 2017.
  • [WJ08] Martin Wainwright and Michael Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1–2):1–305, 2008.
  • [YCRB18a] Dong Yin, Yudong Chen, Kannan Ramchandran, and Peter Bartlett. Byzantine-robust distributed learning: Towards optimal statistical rates. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5650–5659. PMLR, 10–15 Jul 2018.
  • [YCRB18b] Dong Yin, Yudong Chen, Kannan Ramchandran, and Peter Bartlett. Defending against saddle point attack in byzantine-robust distributed learning. arXiv preprint arXiv:1806.05358, 2018.
  • [YL15] Eunho Yang and Aurélie C Lozano. Robust gaussian graphical modeling with the trimmed graphical lasso. In Advances in Neural Information Processing Systems, pages 2602–2610, 2015.
  • [YLA18] Eunho Yang, Aurélie C Lozano, and Aleksandr Aravkin. A general family of trimmed estimators for robust high-dimensional data analysis. Electronic Journal of Statistics, 12(2):3519–3553, 2018.
  • [YLZ18] Xiao-Tong Yuan, Ping Li, and Tong Zhang. Gradient hard thresholding pursuit. Journal of Machine Learning Research, 18(166):1–43, 2018.
  • [Zhu17] Ziwei Zhu. Taming the heavy-tailed features by shrinkage and clipping. arXiv preprint arXiv:1710.09020, 2017.
  • [ZLR12] Tuo Zhao, Han Liu, Kathryn Roeder, John Lafferty, and Larry Wasserman. The huge package for high-dimensional undirected graph estimation in r. Journal of Machine Learning Research, 13(Apr):1059–1062, 2012.
  • [ZWJ14] Yuchen Zhang, Martin J Wainwright, and Michael I Jordan. Lower bounds on the performance of polynomial-time algorithms for sparse linear regression. In Conference on Learning Theory, pages 921–948, 2014.
\setstretch

1.2

Notations in Appendix.

In our proofs, the exponent in tail bounds is arbitrary, and can be changed to other larger constant without affecting the results. denote universal constants, and they may change line by line.

Appendix A Proofs for the gradient estimators

In Robust Hard Thresholding (Algorithm 1), we use trimmed gradient estimator or MOM gradient estimator. And in creftypecap 3.1, the key quantity to control the statistical rates of convergence is the Robust Descent Condition (creftypecap 3.1).

By Holder inequality, we have

In this section, we provide one direct route for obtaining upper bound of Robust Descent Condition via bounding the infinity norm of the robust gradient estimate (creftypecap A.1 and creftypecap A.2).

Later, in Appendix B, we will leverage creftypecap A.1 and creftypecap A.2 in verifying the Robust Descent Condition for trimmed/MOM gradient estimator under sparse linear/logistic regression. Together with creftypecap 3.1, this will complete creftypecap 4.1creftypecap 4.4.

Proposition A.1.

Suppose we observe -corrupted sub-Gaussian samples (creftypecap 2.1). With probability at least , the coordinate-wise trimmed gradient estimator can guarantee

  • for sparse linear regression (creftypecap 4.1).

  • for sparse logistic regression (creftypecap 4.2).

Proposition A.2.

Suppose we observe samples from the heavy tailed model with bounded 4-th moment covariates. With probability at least , the coordinate-wise Median of Means gradient estimator can guarantee

  • for sparse linear regression;

  • for sparse logistic regression.

a.1 Proofs for the MOM gradient estimator

We first prove creftypecap A.2. creftypecap A.1 of trimmed gradient estimator for -corrupted sub-Gaussian samples has the same dependency on . The proof of creftypecap A.1 leverages standard concentration bound for sub-Gaussian samples, and then uses trimming to control the effect of outliers.

Proof of creftypecap a.2.

For loss function, we have , where we omit the subscript in the proof. We denote , and bound the operator norm of the covariance of gradient samples

where (i) follows from the Holder inequality, and (ii) follows from the 4-th moment bound assumption.

Hence, by using coordinate-wise Median of Means gradient estimator, we have

with probability at least , where (i) follows from Proposition 5 in [HS16]. Applying union bounds on all d indexes, we have with probability at least .

For logistic loss, the gradient can be computed as: where we omit the subscript in the proof.

Since , and , we directly have Similar to the case of loss, we have , with probability at least .

a.2 Proofs for the trimmed gradient estimator

We then turn to the trimmed gradient estimator for the case of arbitrary corruption. Before we proceed to the trimmed estimator, let us first visit the definition and tail bounds of sub-exponential random variable, as it will be used in sparse linear regression, where the gradient’s distribution is indeed sub-exponential under the sub-Gaussian assumptions in creftypecap 4.1.

We first present standard concentration inequalities ([Wai19]).

Definition A.1 (Sub-exponential random variables).

A random variable with mean is sub-exponential if there are non-negative parameters such that

Lemma A.1 (Bernstein’s inequality).

Suppose that , are i.i.d. sub-exponential random variables with parameters . Then \cref@addtoresetequationparentequation

We also have a two-sided tail bound

We define -trimmed mean estimator for one dimensional samples, and denote it as .

Definition A.2 (-trimmed mean estimator).

Given a set of -corrupted samples , the coordinate-wise trimmed mean estimator removes the largest and smallest fraction of elements in , and calculate the mean of the remaining terms. We choose , for a constant . We also require that , for some small constant .

creftypecap A.2 shows the guarantees for this robust gradient estimator in each coordinate. We note that creftypecap A.2 is stronger than guarantees for trimmed mean estimator (Lemma 3) in [YCRB18a]. In our contamination model creftypecap 2.1, the adversary may delete -fraction of authentic samples, and then add arbitrary outliers. And creftypecap A.2 provides guarantees for trimmed mean estimator on sub-exponential random variables. The trimmed mean estimator is robust enough, that it allows the adversary to arbitrarily remove -fraction of data points. We use to denote the samples at the -th coordinate of . We can also define in the same way.

Lemma A.2.

Suppose we observe -corrupted samples from creftypecap 2.1. For each dimension , we assume the samples in are i.i.d. -sub-exponential with mean . After the contamination, we have the -th samples as . Then, we can guarantee the trimmed mean estimator on -th dimension that

with probability at least .

We leave the proof of creftypecap A.2 at the end of this section. Then, we present analysis of trimmed gradient estimator for sparse linear regression and sparse logistic regression by using creftypecap A.2. For sparse linear regression model with sub-Gaussian covariates, the distribution of authentic gradients are sub-exponential instead of sub-Gaussian. More specifically, we first prove that when the current parameter iterate is , the sub-exponential parameter of all authentic gradient is , where .

To gain some intuition for this, we can consider the sparse linear equation problem, where . When , we exactly recover , and all stochastic gradients of authentic samples are actually zero vectors, as all observations are noiseless. It is clear that we will have sub-exponential parameter as .

Proof of creftypecap a.1.

For any , the gradient for one sample can be written as

where we omit the subscript in the proof. For any fixed standard basis vector , and define , we have

(4)

To characterize the tail bounds of , we study the moment generating function:

We denote as a Rademacher random variable, which is independent of and . Then we can use a standard symmetrization technique [Wai19],

where follows from the exponential function’s power series expansion, and follows from the independence of , together with the fact that all odd moments of the terms have zeros means.

By the Cauchy-Schwarz inequality, we have

It is clear that is a sub-Gaussian random variable with parameter . Since , we have . For any fixed standard basis vector , we can conclude that is sub-Gaussian with parameter at most based on creftypecap 4.1. By basic properties of sub-Gaussian random variables [Wai19], we have

where follows from the fact that is the weighted summation of two independent sub-Gaussian random variables. Hence, we have

(5)

where follows from (proof by mathematical induction). When we have , eq. 5 converges to. Hence,

That being said, is a sub-exponential random variable. By choosing as each coordinate in , each coordinate of gradient has sub-exponential parameter as .

Then, applying creftypecap A.2 on this collection of corrupted sub-exponential random variables, we have

(6)

with probability at least .

Applying union bounds on eq. 6 for all indexes, we have