User-Friendly Covariance Estimation for Heavy-Tailed Distributions: A Survey and Recent Results
We offer a survey of selected recent results on covariance estimation for heavy-tailed distributions. By unifying scattered ideas appeared in the literature, we propose user-friendly methods that facilitate practical implementation. Specifically, we introduce element-wise and spectrum-wise truncation operators, as well as their -estimator counterparts, to robustify the sample covariance matrix. Different from the classical notion of robustness which is typically characterized by the breakdown property, we focus on the tail robustness that is evidenced by the nonasymptotic deviation property of the estimator for data with only finite fourth moments. The key observation is that the robustification parameter needs to adapt to the sample size, dimensionality and moment to achieve optimal tradeoff between bias and robustness. Furthermore, to facilitate their practical use, we propose tuning-free procedures that automatically calibrate the tuning parameters. Applications to a series of structured models in high dimensions, including the bandable covariance estimation, sparse precision matrix estimation, low-rank covariance estimation, and covariance estimation and multiple testing under factor models are revisited. Numerical examples lend strong support to our proposed methodology.
Keywords: Covariance estimation, heavy-tailed data, -estimation, nonasymptotics, tail robustness, truncation.
Covariance matrix has played a significant role in multivariate statistics. The estimation of covariance matrix, therefore, serves as a building block for many important statistical methods, including principal component analysis, discriminant analysis, clustering analysis and regression analysis, among many others. Recently, estimating structured large covariance matrices, such as bandable, sparse and low-rank matrices, has attracted ever-growing attention in statistics and machine learning (Bickel and Levina, 2008a, b; Cai, Ren and Zhou, 2016; Fan, Liao and Liu, 2016). It also finds applications in a wide range of applications, ranging from functional magnetic resonance imaging (fMRI), analysis of gene expression arrays to risk management and portfolio allocation.
Theoretical properties of large covariance matrix estimators in the literature often hinges upon the Gaussian or sub-Gaussian111A random variable is said to have sub-Gaussian tails if there exists constants and such that for all . assumption (Vershynin, 2012). See, for example, Theorem 1 in Bickel and Levina (2008a). Such an assumption is typically very restrictive in practice. For example, a recent fMRI study by Eklund et al. (2016) reports that most of the common software packages, such as SPM and FSL, for fMRI analysis can result in inflated false-positive rates up to 70% under 5% nominal levels, and questioned a number of fMRI studies among 40,000 or so studies according to PubMed. Their result suggested that
The principal cause of the invalid cluster inferences is spatial autocorrelation functions that do not follow the assumed Gaussian shape.
Eklund et al. (2016) plot the the empirical spatial autocorrelation functions, which clearly exhibit heavier tails (Eklund et al., 2016, Page 7902). Violation of the Gaussianity has also been found in genomic studies (Liu et al., 2003; Purdom et al., 2005) and quantitative finance (Fama, 1963; Mandelbrot, 1967). It is therefore imperial to develop robust inferential procedures that are insensitive to restrictive distributional assumptions.
We are interested in obtaining estimators that admit tight nonasymptotic deviation guarantees under weak assumptions, such as existence of moments of only low order. Heavy-tailed distribution is a viable model for data contaminated by outliers that are typically encountered in applications. Due to heavy tailedness, there is a higher chance that some observations are sampled far away from the population mean. We refer to these outlying data points as distributional outliers. A procedure that is robust against such outliers is called a tail-robust procedure, and is evidenced by its better finite-sample performance than a non-robust method in the presence of heavy-tailed distributions. In this paper, by unifying scattered ideas appeared in the literature, we provide a general framework for constructing user-friendly tail-robust covariance estimators. Specifically, we propose element-wise and spectrum-wise truncation operators, as well as their -estimator counterparts, with adaptively chosen robustification parameters. Theoretically, we establish nonasymptotic deviation bounds and demonstrate that the robustification parameters should adapt to the sample size, dimensionality and noise level for optimal tradeoff between bias and robustness. Our goal is to obtain estimators that are computationally efficient and easily implementable in applications. To this end, we propose data-driven schemes to calibrate these robustification parameters, making our proposal computational friendly. Finally, we discuss a series of applications to structured models in high dimensions, including the bandable covariance estimation, sparse precision matrix estimation, low-rank covariance estimation, and covariance estimation, and multiple testing under factor models.
Our definition of robustness is different from the conventional perspective under Huber’s -contamination model (Huber, 1964), where the focus has been on developing robust procedures with high breakdown point. The breakdown point (Hampel, 1971) of an estimator is defined roughly as the proportion of arbitrary outliers the estimator can tolerate before it produces arbitrarily large estimates. Since the seminar work of Tukey (1975), a number of depth-based procedures have been developed; see, for example, Liu (1990), Zuo and Serfling (2000), Mizera (2002) and Salibian-Barrera and Zamar (2002), among others. Another line of work focuses on robust and resistant -estimators, including the least median of squares and least trimmed squares (Rousseeuw, 1984), the S-estimator (Rousseeuw and Yohai, 1984) and the MM-estimator (Yohai, 1987). We refer to Portnoy and He (2000) for a literature review on classical robust statistics, and Chen et al. (2018) for recent development on nonasymptotic analysis under contamination models.
The rest of the paper proceeds as follows. We start with a motivating example in Section 2, which reveals the weakness of sample covariance matrix. In Section 3, we introduce two types of generic robust covariance estimators and establish their deviation bounds under different norms of interest. The finite-sample performance of the proposed estimators, both element-wise and spectrum-wise, depends on a careful tuning of the robustification parameter, which should adapt to the noise level for bias-robustness tradeoff. We also discuss a median-of-means estimator, which is automatically tuning-free at the cost of slightly stronger conditions. For practical implementation, in Section 4 we propose a data-driven scheme to choose the key tuning parameters. Section 5 presents various applications to estimating structured covariance and precision matrices, multiple testing under factor models with heavy-tailed error. Numerical studies are provided in Section 6. We end this paper with a discussion in Section 7.
1.1 Overview of the previous work
In the past several decades, there have been active developments of robust covariance estimators in the absence of normalty. Examples include the Minimum Covariance Determinant (MCD) estimator, the Minimum Volume Ellipsoid (MVE) estimator, Maronna’s (Maronna, 1976), and Tyler’s (Tyler, 1987) -estimators of multivariate scatter matrix. We refer to Hubert, Rousseeuw and Van Aelst (2008) for a comprehensive review. Asymptotic properties of these methods have been established for the family of elliptically symmetric distributions; see, for example, Davies (1992), Butler, Davies and Jhun (1993) and Zhang, Cheng and Singer (2016), among others. However, the above estimators rely on either a parametric assumption or a shape constraint on the sampling distribution. Under a general setting where neither is assumed, robust covariance estimation remains a challenging problem.
The work of Catoni (2012) triggers a growing interest in developing tail-robust estimators, which are characterized by tight nonasymptotic deviation analyses, rather than the mean squared errors. The current state-of-the-art methods for covariance estimation with heavy-tailed data include those of Catoni (2016), Minsker (2018), Minsker and Wei (2018), Avella-Medina et al. (2018), Mendelson and Zhivotovskiy (2018). From a spectrum-wise perspective, Catoni (2016) constructs a robust estimator of the Gram matrix of a random vector or a covariance matrix via estimating the quadratic forms uniformly over the unit sphere in , and obtains error bounds for the operator norm. More recently, Mendelson and Zhivotovskiy (2018) build a different robust covariance estimator that admits dimension-free deviation bounds under the finite kurtosis condition. Both constructions, however, involve brute-force search over every direction in a -dimensional -net, and thus are computationally intractable. From an element-wise perspective, Avella-Medina et al. (2018) combine robust estimates of the first and second moments to obtain variance estimators. In practice, three potential drawbacks of this approach are: (i) the accumulated error consists of those from estimating the first and second moments, which may be significant; (ii) the diagonal variance estimators are not necessarily positive and therefore additional adjustments are required; and (iii) using the cross-validation to calibrate a total number of tuning parameters is computationally expensive.
Building upon the ideas in Minsker (2018) and Avella-Medina et al. (2018), we propose user-friendly tail-robust covariance estimators that enjoy desirable finite-sample deviation bounds under weak moment conditions. The constructed estimators only involve simple truncation techniques and are computationally more friendly. Through a novel data-driven tuning scheme, we are able to efficiently compute these robust estimators for large-scale problems in practice. These two points distinguish our work from others in the literature. The proposed robust estimators serve as building blocks for estimating large structured covariance and precision matrices, and we illustrate their board applicabilities in a series of problems.
We adopt the following notation throughout the paper. For any and a matrix , we define the max-norm , the Frobenius norm and the operator norm
where for , and . In particular, it holds and . Moreover, we write for the spectral norm and use to denote the effective rank of a nonnegative definite matrix , where is the trace of . When is symmetric, we have where are eigenvalues of . For any matrix and an index set , we use to denote the complement of , and to denote the submatrix of with entries indexed by . For a real-valued random variable , let be the kurtosis of , defined as , where and .
2 Motivating example: A challenge of heavy-tailedness
Suppose we observe a sample of independent and identically distributed (i.i.d.) data vectors from a random vector with mean and covariance matrix . To recognize the difficulty of mean and covariance estimation for heavy-tailed distributions, we first provide a lower bound for the deviation of the empirical mean under the -norm in .
For any and , there exists some distribution in with mean and covariance matrix such that the empirical mean of i.i.d. observations from this distribution satisfies, with probability at least , that
The above proposition, which is a multivariate extension of Proposition 6.2 in Catoni (2012), provides a lower bound under the -norm for estimating a mean vector via the empirical mean. Integrating the union bound and Chebyshev’s inequality, we obtain that with probability at least ,
Together, this upper bound and (2.1) show that the worst case deviations of the empirical means are polynomial at best under the -norm in the presence of heavy-tailed distributions. We will see a later a modified estimator can achieve an exponential-type deviation bound, and thus the sample mean is suboptimal.
To better appreciate the above motivation, we perform a toy numerical study on covariance matrix estimation. Let be i.i.d. data vectors drawn from , where ’s are independent from a centered Gamma distribution so that and . We compare the performance of three methods: the sample covariance matrix, the element-wise truncated covariance estimator and the spectrum-wise truncated estimator. The latter two are tail-robust covariance estimators that will be introduced in Sections 3.1 and 3.2 respectively. Their performances are assessed by the max-norm error. We take and let increase from to with a step size of . The results are based on simulations. Figure 1 displays the average (line) and the spread (dots) of estimation errors for each method as the dimension increases. We see that the sample covariance estimator has not only the largest average error but also the highest variability in all the settings. This example demonstrates that the sample covariance matrix suffers from poor finite-sample performance when the data are heavy-tailed.
3 Tail-robust covariance estimation
3.1 Element-wise truncated estimator
We consider the same setting as in the previous section. For mean estimation, we notice that the suboptimality of under -norm is due to the fact that the tail probability of decays only polynomially. A simple yet natural idea for improvement is to truncate the data to eliminate outliers caused by the noise, so that each entry of the resultant estimator has a sub-Gaussian tail. To realize this idea, we introduce the following truncation function/operator, which is closely related to the Huber loss, to be discussed later.
Definition 3.1 (Truncation operator).
Let be a truncation operator given by
where the truncation parameter is also referred to as the robustification parameter that balances bias and robustness.
For illustration purpose, we assume so that . We apply the truncation operator above to each entry of , and then take the average to obtain
where are robustification parameters. When the mean vector is unspecified, a straightforward approach is to first estimate the mean vector using existing robust methods (Minsker, 2015; Lugosi and Mendelson, 2017), and then employ as robust estimates of the second moments. Estimating the first and second moments separately will unavoidably introduce additional tuning parameters, which increases both statistical variability and computational complexity of the method. In what follows, we propose to use the pairwise difference approach, which is free of mean estimation.
Let and define the paired data
which are identically distributed from a random vector with mean 0 and covariance matrix . It is easy to check that the sample covariance matrix, with , can be expressed using paired data as
Following the same argument from the last section, we apply the truncation operator to entry-wise, and then take the average to obtain
Collecting these marginal estimators, we define the following element-wise truncated covariance matrix estimator
where is a symmetric parameter matrix. can be viewed as a truncated version of the sample covariance matrix . We assume and define , the largest integer no greater than . Moreover, let be a symmetric matrix with
For any , the truncated estimator given in (3.3) with
Theorem 3.1 indicates that, with properly tuned parameter matrix , the resulting covariance matrix estimator achieves element-wise tail robustness against heavy-tailed distributions: provided the fourth moments are bounded, each entry of concentrates tightly around its mean so that the maximum error scales as . Element-wise, we are able to accurately estimate at high confidence levels even when the dimension grows with the sample size exponentially fast.
3.2 Spectrum-wise truncated estimator
In this section, we propose and study a tail-robust covariance estimator that is robust in the spectrum domain. Following the general robustification technique developed in Minsker (2018), we directly apply the truncation operator to matrices in their spectrum domain. To this end, we need the following definition of matrix functional.
Definition 3.2 (Matrix functional).
Given a real-valued function defined on and a symmetric with eigenvalue decomposition such that , is defined as , where .
Following the same rational as in the previous section, we propose a spectrum-wise truncated covariance estimator based on the pairwise difference approach:
This alternative expression renders the computation almost effortless. The following theorem provides an exponential-type concentration inequality for in operator norm, which is a useful complement to (Minsker, 2018, Remark 8). In a similar spirit to Theorem 3.1, our next theorem shows that achieves an exponential-type concentration bound under operator norm for heavy-tailed data with finite operator-wise fourth moment, that is,
is finite. We refer to Minsker and Wei (2018) for a general treatment of robust modifications of -statistics.
For any , the estimator with
satisfies that with probability at least ,
To better understand the above result, note that can be written as
which is well-defined if the fourth moments are finite. Define the maximum kurtosis of as . Then, it follows that
The following result is a direct consequence of Theorem 3.2: admits exponential-type concentration for data with finite kurtosises.
Assume that is finite. Then, for any , the estimator as in Theorem 3.2 satisfies
with probability at least .
Theoretically, the current state-of-the-art estimator proposed by Mendelson and Zhivotovskiy (2018) achieves a more refined deviation bound with in (B.9) improved to ; in particular, the deviations are controlled by the spectral norm instead of the (possibly much larger) trace . Estimators admitting such deviations guarantees are often called “sub-Gaussian” as they achieve performance similar to the sample covariance obtained from data with multivariate normal distribution. Unfortunately, the aforementioned estimator is computationally intractable. In a recent work, Hopkins (2018) proposed a multivariate mean estimator that achieves theoretically optimal performance in the -norm, and is computationally tractable at the same time. It remains an open question to design a polynomial-time algorithm capable of computing the near-optimal estimator efficiently.
3.3 An -estimation viewpoint
In this section, we discuss alternative tail-robust covariance estimators from an -estimation perspective and study both the element-wise and spectrum-wise truncated estimators. The connection with truncated covariance estimators is discussed at the end of this section. To proceed, we revisit the definition of Huber loss.
Definition 3.3 (Huber loss).
Compared with the squared error loss, large values of are down-weighted in the Huber loss, by which it introduces robustness. Generally, the minimizer is biased from the model parameters, and letting grow reduces the bias. In other words, the robustification parameter quantifies the tradeoff between bias and robustness. As observed in Sun, Zhou and Fan (2017), in order to achieve optimal tradeoff between bias and robustness, should adapt to the sample size, dimension and noise level.
Starting with the element-wise method, we define marginal estimators
where are robustification parameters satisfying . When , even though the minimization is over , it turns out the solution is still positive almost surely and therefore provides a reasonable estimator of . To see this, for each , define and note that for any and ,
This implies that , which is strictly positive as long as there are no tied observations. Again, collecting these marginal estimators, we obtain a Huber-type -estimator
where . Our main result in this section indicates that achieves fast concentration under the max-norm for data with finite fourth moments. A version of Theorem 3.3 follows from Theorem 3.1 in Minsker and Wei (2018) by setting , although the univariate case requires slightly weaker assumptions and is of independent interest.
Let be a symmetric matrix satisfying
For any , the covariance estimator given in (3.13) with
as long as .
The -estimator counterpart of the spectrum truncated covariance estimator is first proposed by Minsker (2018) using a different robust loss function. Later, Minsker and Wei (2018) investigate robust modifications of empirical risk minimization of U-statistics. In line with the previous element-wise -estimator, we restrict our attention to the Huber loss and consider
which is a natural robust variant of the sample covariance matrix
Define the matrix that satisfies
The following result is modifed from Corollary 4.1 in Minsker and Wei (2018).
Assume that there exists some such that . Then for any and , the -estimator with satisfies
with probability at least as long as , where are absolute constants.
where is given in (3.1). From this point of view, the truncated estimator given in (3.6) can be viewed as the first step of the gradient descent iteration for solving optimization problem (3.17) initiated at . This procedure enjoys a nice contraction property, as demonstrated by Lemma 3.2 in Minsker and Wei (2018). However, since the difference matrix for each is no longer rank-one, we need to perform a singular value decomposition to compute the matrix functional for .
We close this section with a discussion on the connection and difference between truncated estimators and -estimators. Both types of estimators achieve tail robustness through bias-robustness tradeoff, either element-wise or spectrum-wise. The difference lies in the quantity where the truncation is applied to. Intuitively, -estimators truncate symmetrically around the true expectation as shown in (3.12) and (3.17), while the truncation estimators simply truncate symmetrically around zero as in (3.3) and (3.6). Due to smaller finite-sample bias, -estimators are expected to outperform the simple truncated estimators. However, since the optimal choice of the robustification parameter is often much larger than the population moments in magnitude either element-wise or spectrum-wise, the difference between truncation estimators and -estimators becomes insignificant when is large. Therefore, we advocate using the simple truncated estimator primarily due to its simplicity and computational efficiency.
3.4 Median-of-means estimator
Truncation-based approaches described in the previous sections require knowledge of robustification parameters . Adaptation and tuning of these parameters will be discussed in Section 4 below. Here, we suggest another method that does not need any tuning but requires stronger assumptions, namely, existence of moments of order . This method is based on the median-of-means (MOM) technique (Nemirovsky and Yudin, 1983; Devroye et al., 2016; Minsker and Strawn, 2017). To this end, assume that the index set is partitioned into disjoint groups (partitioning scheme is assumed to be independent of ) such that the cardinalities satisfy for . For each , let and
be the sample covariance evaluated over the data in group . Then, for all , the MOM estimator of is defined via
where is the entry in the -th row and -th column of . This leads to
Let for . The following result provides a deviation bound for the MOM estimator under the max-norm.
Assume that and that for all . Then there exists depending only on the fourth and sixth central moments of the entries of such that
for all such that .
The only user-defined parameter in the definition of is the number of subgroups . The bound above shows that, provided , the term in (3.19) is of smaller order, and we obtain an estimator that admits tight deviation bounds for a wide range of . In this sense, estimator is essentially tuning-free (say, one could set .
Application of the MOM construction to large covariance estimation problems has been explored by Avella-Medina et al. (2018). However, results obtained in that work are insufficient to conclude that MOM estimators are “tuning-free”. In particular, in (Avella-Medina et al., 2018) authors considered weaker assumptions (fourth moments instead of sixth) and obtained weaker results that (i) are not uniform in the confidence parameter , hence the estimator has to be recomputed for each value of this parameter and (ii) result in error bounds that are controlled by the fourth moments instead of .
4 Automatic tuning of robustification parameters
For all the proposed tail-robust estimators besides the median-of-means, the robustification parameter needs to adapt properly to the sample size, dimension and noise level in order to achieve optimal tradeoff between bias and robustness in finite samples. An intuitive yet computationally expensive idea is to use cross-validation. Another approach is based on Lepski’s method (Lepski and Spokoiny, 1997); this approach yields adaptive estimators with provable guarantees (Minsker, 2018; Minsker and Wei, 2018), however, it is also not computationally efficient. In this section, we propose tuning-free approaches for constructing both truncated and -estimators that have low computational costs. Our nonasymptotic analysis provides useful guidance on the proposed choice of key tuning parameters.
4.1 Adaptive truncated estimator
In this section we introduce a data-driven procedure that automatically tunes the robustification parameters in the element-wise truncated covariance estimator. Practically, each individual parameter can be tuned by cross-validation from a grid search. This, however, will incur extensive computational cost even when the dimension is moderately large.
Instead, we propose a data-driven procedure that automatically calibrates the parameters. This procedure is motivated by the theoretical properties established in Theorem 3.1. For presentation simplicity, we fix and define such that . Then can be written as . In view of (3.4), an “ideal” choice of is
where is prespecified to control the confidence level and will be discussed later. A naive estimator of is the empirical second moment , which tends to overestimate the true value when data are leptokurtic. Intuitively, a well-chosen makes a good estimator of , and meanwhile, we expect the empirical truncated second moment to be a reasonable estimate of as well. Plugging this empirical truncated second moment into (4.1) yields
We then solve the above equation to obtain , a data-drive choice of . By Proposition 3 in Wang et al. (2018), equation (4.2) has a unique solution as long as . We characterize the theoretical properties of this tuning method in a companion paper (Wang et al., 2018).
For the choice of , on one hand, because it controls the confidence level according to (3.5), we should let be sufficiently large so that the estimator is concentrated around the true value with high probability. On the other hand, also appears in the deviation bound that corresponds to the width of the confidence interval, it should not grow too fast as a function of . In practice, we recommend using (or equivalently, ), a typical slowly varying function of .
To implement the spectrum-wise truncated covariance estimator in practice, note that there is only one tuning parameter whose theoretically optimal scale is
Motivated by the data-driven tuning scheme for the element-wise estimator, we choose by solving the equation
where as before we take .
4.2 Adaptive Huber-type -estimator
To construct a data-driven estimator for the adaptive Huber estimator, we follow the same rationale from the previous subsection. Note that the optimal now depends on instead of the second moment , it is therefore conservative to directly apply the above data-driven method in this case. Instead, we propose to calibrate and estimate simultaneously by solving the following system of equations
for and . Via a similar argument, it can be shown that the equation has a unique solution as long as ; for any , the equation also has a unique solution. Starting with an initial estimate , which is the sample variance estimator of , we iteratively solve and for until convergence. The resultant solution, denoted by with slight abuse of notation, is then referred to as the adaptive Huber estimator of . We then obtain the data-adaptive Huber covariance matrix estimator as . See Algorithm 1 for the pseudo-algorithm of this data-driven procedure.
5 Applications to structured covariance estimation
The robustness properties of the element-wise and spectrum-wise truncation estimators are demonstrated in Theorems 3.1 and 3.2. In particular, the exponential-type concentration bounds are essential for obtaining reasonable estimators of high-dimensional structured covariance and concentration matrix. In this section, we apply the proposed generic robust methods to the estimation of bandable covariance matrix, sparse precision matrix, low-rank covariance matrix, covariance matrix and multiple testing under factor models.
5.1 Bandable covariance matrix estimation
Motivated by applications such as climate studies and spectroscopy, in which there is a natural order/metric on the index set of variables , one can expect that a large distance implies near independence. We characterize this feature by the following class of bandable covariance matrices considered in Bickel and Levina (2008a) and Cai, Zhang and Zhou (2010):
Here are regarded as universal constants and the parameter specifies the decay rate of to zero as for each row.
Under sub-Gaussian conditions, Cai, Zhang and Zhou (2010) establish minimax-optimal estimation over under the spectral norm loss. More specifically, they proposed a tapering estimator based on the pivotal sample covariance matrix as follows: , where the positive integer specifies the bandwidth, , , whenever , , , and denotes the sample covariance. When has sub-Gaussian tails, it is shown in Cai, Zhang and Zhou (2010) that with an optimal choice of bandwidth , achieves the minimax convergence rate under the spectral norm that is of order .
When demonstrates general heavy tails with finite fourth moments, it is still unknown whether the (nearly) same optimal rate can be achieved over unless additional distributional assumptions are imposed, such as elliptical symmetry (Mitra and Zhang, 2014; Chen et al., 2018). To close this gap, we propose a robust covariance estimator based on spectrum-wise truncated covariance estimator introduced in Section 3.2. We first define a few notations and one matrix operator to facilitate the illustration of our procedure. Let be a subvector of given in (3.2), where and on the superscript indicate the location and dimension, respectively. Then the spectrum-wise truncated estimator of the corresponding principal submatrix of is defined as
In addition, we define an expanding operator which puts a small matrix onto a large zero matrix: for an matrix , define the matrix , where indicates the location and
Our robust covariance estimator is then defined as