Robust Covariance Estimation for Approximate Factor Models
Abstract
In this paper, we study robust covariance estimation under the approximate factor model with observed factors. We propose a novel framework to first estimate the initial joint covariance matrix of the observed data and the factors, and then use it to recover the covariance matrix of the observed data. We prove that once the initial matrix estimator is good enough to maintain the elementwise optimal rate, the whole procedure will generate an estimated covariance with desired properties. For data with only bounded fourth moments, we propose to use Huber loss minimization to give the initial joint covariance estimation. This approach is applicable to a much wider range of distributions, including subGaussian and elliptical distributions. We also present an asymptotic result for Huber’s Mestimator with a diverging parameter. The conclusions are demonstrated by extensive simulations and real data analysis.
Keywords: Robust covariance matrix, Approximate factor model, Mestimator.
1 Introduction
The problem of estimating a covariance matrix and its inverse has been fundamental in many areas of statistics, including principal component analysis (PCA), linear discriminative analysis for classification, and undirected graphical models, just to name a few. The intense research in high dimensional statistics has contributed a stream of papers related to covariance matrix estimation, including sparse principal component analysis (Johnstone and Lu, 2009; Amini and Wainwright, 2008; Vu and Lei, 2012; Birnbaum et al., 2013; Berthet and Rigollet, 2013; Ma, 2013; Cai et al., 2013), sparse covariance estimation (Bickel and Levina, 2008; Cai and Liu, 2011; Cai et al., 2010; Lam and Fan, 2009; Ravikumar et al., 2011) and factor model analysis (Stock and Watson, 2002; Bai, 2003; Fan et al., 2008, 2013, 2014; Onatski, 2012). A strong interest in precision matrix estimation (undirected graphical model) has also emerged in the statistics community following the pioneering works in Meinshausen and Bühlmann (2006) and Friedman et al. (2008). In the application aspect, many areas such as portfolio allocation (Fan et al., 2008), have benefited from this continuing research.
In the high dimensional setting, the number of variables is comparable or greater than the sample size . This dimensionality poses a challenge to the estimation of covariance matrices. It has been shown in Johnstone and Lu (2009) that the empirical covariance matrix behaves poorly, and sparsity of leading eigenvectors is assumed to circumvent this issue. Following this work, a flourishing literature on sparse PCA has developed indepth analysis and refined algorithms; see Vu and Lei (2012); Berthet and Rigollet (2013); Ma (2013). Taking a different route, Bickel and Levina (2008) advocated thresholding as a regularization approach to estimate a sparse matrix, in the sense that most entries of the matrix are close to zero.
Another challenge in highdimensional statistics is that the measurements can not have light tails, as large scale data are often obtained by using (bio)imaging technologies that have a limited precision. Moreover, it is well known that financial returns exhibit heavy tails. These invalidate the fundamental assumptions in highdimensional statistics that data have subGaussian or subexponential tails, popularly imposed in most of the aforementioned papers. Significant relax of the above assumption requires some new ideas and forms the subject of this paper.
Recently, motivated by FamaFrench model (Fama and French, 1993) from financial econometrics, Fan et al. (2008) and Fan et al. (2013) considered the covariance structure of the static approximate factor model, which models the covariance matrix by a lowrank signal matrix and a sparse noise matrix. The same model will also be the focus of this paper. The model assumes existence of several lowdimensional factors that drives a large panel data , that is
(1.1) 
where ’s are the common factors and ’s are their corresponding factor loadings. The noises ’s, known as the idiosyncratic component, are uncorrelated with the factors . Here is relatively small compared with and . We will treat as fixed independent of and throughout this paper. When the factors are known, this model subsumes the wellknown CAPM model (Sharpe, 1964; Lintner, 1965) and FamaFrench model (Fama and French, 1993). When is unobserved, the model tries to recover the underlying factors for the movements of the whole panel data. Here the approximate factor model means that the covariance of is sparse, including the strict factor model in which is diagonal as a special case. In addition, “static” is on the contrary of the dynamic model which takes into account the time lag and allows more general infinite dimensional representations (Forni et al., 2000; Forni and Lippi, 2001).
The covariance matrix of the outcome from model (1.1) can be written as
(1.2) 
where consisting of in each row is the loading matrix, is the covariance of and is the sparse covariance matrix for . Here we assume the process of is stationary so that do not depend on time. When factors are unknown, Fan et al. (2013) proposed applying PCA to obtain an estimate of the low rank part and sparse part . The crucial assumption is that the factors are pervasive, meaning that the factors have nonnegligible effects on a large amount of dimensions of the outcomes. Fan and Wang (2015) gives more explanation from random matrix theories and aims to relax the pervasiveness assumption in applications such as risk management and estimation of the false discovery proportion. See Onatski (2012) for more discussions on strong and semistrong factors.
In this paper, we consider estimating simply with known factors. The main focus of the paper is on robustness instead of factor recovery. Under exponential tails of the factors and noises, Fan et al. (2011) proposed the idea of performing thresholding on the estimate of , obtained from the sample covariance of the residuals of multiple regression (1.1). The legitimacy of this approach hinges on the assumption that the tails of the factor and error distributions are exponential decay, which is likely to be violated in practice, especially in the financial applications. Thus, the need to extend the applicability of this approach beyond wellbehaved noise has driven further research such as Fan et al. (2015), in which they assume that has an elliptical distribution (Fang et al., 1990).
This paper studies the model (1.1) under a much more relaxed condition, that the random variables and only have finite fourth moments. The main observation that motivates our method is that, the joint covariance matrix of supplies sufficient information to estimate and . To estimate the joint covariance matrix in a robust way, the classical idea that dates back to Huber (1964) proves to be vital and effective. The novelty here is that we let the parameter diverges in order to control the bias in highdimensional applications. The Huber loss function with a diverging parameter, together with other similar functions, has been shown to produce concentration bounds for Mestimators, when the random variables have fat tails; see for example Catoni (2012) and Fan et al. (2016). This point will be clarified in Sections 2 and 3. The Mestimators considered here have additional merits in asymptotic analysis, which is studied in Section 3.3.
This paper can be placed in the broader context of low rank plus sparse representation. In the past few years, robust principal component analysis has received much attention among statisticians, applied mathematicians and computer scientists. Their focus is on identifying the low rank component and sparse component from a corrupted matrix (Chandrasekaran et al., 2011; Candès et al., 2011; Xu et al., 2010). However, the matrices considered therein do not come from random samples, and as a result, neither estimation nor inference are involved. Agarwal et al. (2012) does consider the noisy decomposition, but still, it focuses more on identifying and separating the low rank part and sparse part. In spite of connections with the robust PCA literature, such as the incoherence condition (see Section 2), this paper and its predecessors are more engaged in disentangling “true signal” from noise, in order to improve estimation of covariance matrices. In this respect, they bear more similarity with the literature of covariance matrix estimation.
We make a few notational definitions before presenting the main results. For a general matrix , the max norm of , or the entrywise maximum absolute value, is denoted as . The operator norm of is whereas the Frobenius norm is . If is furthermore symmetric, we denote as the largest eigenvalue, as the largest one, and as the smallest one. In the paper, is a generic constant that may differ from line to line in the assumptions and also derivation of our theories.
The paper is organized as follows. In Section 2, we present the procedure for robust covariance estimation when only finite fourth moment is assumed for both factors and noises without specific distribution family assumption. The theoretical justification will be provided in Section 3. Simulations will be carried out in Section 4 to demonstrate the effectiveness of the proposed procedure. We also conduct real data analysis on portfolio risk of S&P stocks via FamaFrench model in Section 5. Technical proofs will be delayed to the appendix.
2 Robust covariance estimation
Consider the factor model (1.1) again with observed factors. It can be written in the vector form as
(2.1) 
where , are the factors for , is the unknown loading matrix and is uncorrelated with the factors. We assume that have zero mean and independent for . A motivating example from economic and financial studies is the classical FamaFrench model, where ’s represent excess returns of stocks in the market and ’s are interpreted as common factors driving the market. It is more natural to allow for weak temporal dependence such as mixing as in the work of Fan et al. (2014). Though possible, we assume independence in this paper for the sake of simplicity of analysis.
2.1 Assumptions
We now state the main assumptions of the model. Let be the covariance of , and the covariance of . A covariance decomposition shows that , the covariance of , comprises two parts,
(2.2) 
We assume that is sparse and the sparsity level is measured through
(2.3) 
If , is defined to be , i.e. the exact sparsity. An intuitive justification of the sparsity measurement stems from modeling of the covariance structure: after taking out the common factors, the rest only has weak crosssectional dependence. In addition, we assume that , as well as , is bounded away from and . In the case of degenerate , we can always consider rescaling the factors and reduce the number of observed factors to meet the requirement of nonvanishing minimum eigenvalue of . This leads to our first assumption.
Assumption 2.1.
There exists a constant such that and , where is a matrix with being a constant.
Furthermore, it is observed by Stock and Watson (2002) that the factors are pervasive in the sense that the low rank part of (2.2) is the dominant component of ; more specifically, the top eigenvalues grow linearly as . This motivates the following assumption:
Assumption 2.2.
(i) There exists a constant such that .
(ii) The elements of are uniformly bounded by a constant .
Note first assumption (ii) implies that . So together with (i), the above assumption requires leading eigenvalues to grow with an order of . This assumption is satisfied by the approximate factor model, since by Weyl’s inequality, if the main term is bounded from below. Furthermore, if we assume that each row of is iid from the same distribution with a finite second moment, it is not hard to see satisfies such a condition. Consequently, it is natural to assume is lower bounded for .
Assumption (ii) is related to the matrix incoherence condition. In fact, when grows linearly with , the condition that is bounded is equivalent to the incoherence of eigenvectors of being bounded, which is standard in the matrix completion literature (Candès and Recht, 2009) and the robust PCA literature (Chandrasekaran et al., 2011).
We now consider the moment assumption of random variables in model (1.1).
Assumption 2.3.
is iid with mean zero and bounded fourth moments. That is, there exists a constant such that and .
The independence assumption can be relaxed to mixing conditions, but we do not pursue this direction in the current paper. We are going to establish our results based on the general distribution family with only bounded fourth moment in the above assumption.
2.2 Robust estimation procedure
The basic idea we propose is to estimate the covariance matrix of the joint vector instead of just that of , although it is our target. The covariance of the concatenated dimensional vector contains all the information we need to recover the lowranks and sparse structure. Observe that the covariance matrix can be expressed as
Any method which yields an estimate of as an initial estimator or estimates of could be used to infer the unknown and . Specifically, using the estimator , we can readily obtain an estimator of through the identity
Subsequently, we can subtract the estimator of from to obtain . With the sparsity structure of assumed in Section 2.1, the wellstudied thresholding (Bickel and Levina, 2008; Rothman et al., 2009; Cai and Liu, 2011) can be employed. Applying thresholding to , we obtain a thresholded matrix with guaranteed error in terms of max norm and operator norm. The final step is to add up with the estimator of from to produce the final for .
Due to the fact that we only have bounded fourth moments for factors and errors, a straightforward idea to estimate the covariance matrix is through robust methodology. For the sake of simplicity, we assume has zero mean, so the covariance matrix of takes the form . We shall use the Mestimator proposed in Catoni (2012) and Fan et al. (2016), where the authors proved the concentration property in the estimation of population mean of a random variable with only a finite second moment. In essence, minimizing a suitable loss function, say Huber loss, yields an estimator of the population mean with deviation of order . The Huber loss reads
(2.4) 
Choosing , where is an upper bound of the standard deviation of the random variable of interest, Fan et al. (2016) showed that the minimizer satisfies
(2.5) 
when where . This finite sample result holds for any distributions with bounded second moments, including asymmetric distributions generated by . The diverging parameter is chosen to reduce the biases of the estimator for asymmetric distributions and hence we require a finite second moment. In our covariance matrix estimation, we will take to be the square of a random variable or products of two random variables. When applying this method to estimate elementwise, we expect to achieve elementwise errors of , where the logarithmic term is incurred when we bound the errors uniformly. The formal result will be given in Section 3.
In an earlier work, Catoni (2012) proposed solving the equation where the strictly increasing satisfies . For and , Catoni (2012) proved that
when and where is an upper bound of the standard deviation. This estimator can also be used for covariance estimation, though it usually has a larger bias as shown in Fan et al. (2016).
The whole procedure can be presented in the following steps:

For each entry of the covariance matrix , obtain a robust estimator by solving a convex minimization problem (through, for example, NewtonRapson method):
where is chosen as discussed above and

Produce the final estimator for :
Note in the above steps, the choice of the parameters (in the definition of ) and are not yet specified and will be discussed in Section 3.
Before delving into the analysis of the procedure, we first deviate to look at a technical issue. Recall that is an estimator of , by Weyl’s inequality,
Since both matrices are of low dimensionality, as long as we are able to estimate every entry of accurate enough (see Lemma 3.1 below), vanishes with high probability as diverges. Since is invertible with high probability, there is no major issue implementing the procedure. In cases where positive semidefinite (psd) matrix is expected, we replace the matrix with its nearest positive semidefinite version. We can do this projection for either or . For example, for , we solve the following optimization problem:
(2.6) 
and simply employ as a surrogate of . Observe that
Thus, apart from a different constant, inherits all the desired properties of , and we are able to safely replace with without modifying our estimation procedure. Moreover, (2.6) can be cast into the semidefinite programming problem below,
(2.7) 
which can be easily solved by a semidefinite programming solver, e.g. Grant et al. (2008).
3 Theoretical analysis
In this section, we will show the theoretical properties of our robust estimator under bounded fourth moments. We will also show that when the data are known to be generated from more restricted families (e.g. subGaussian), commonly used estimators such as sample covariance estimator suffices as an initial estimator in Step .
3.1 General theoretical properties
From the above discussion on Mestimators and their concentration results, it is immediate to have the following lemma.
Lemma 3.1.
Suppose that a dimensional random vector is centered and has finite fourth moments, i.e. , for . Letting and be Huber’s estimator with parameter , then there exists a universal constant such that for any and , with probability ,
(3.1) 
where is a predetermined parameter satisfying .
In practice, we do not know any of the fourth moments in advance. To pick up a good , one possibility is to try a sequence of geometrically increasing , as studied in Catoni (2012). Similar to Fan et al. (2015), we may also use empirical variance to give a rough bound of .
Recall that is a dimensional vector concatenating and . From Assumption 2.3, there is a constant as a uniform bound for . This leads to the following result.
Corollary 3.1.
Suppose that is an estimator of covariance matrix , whose entries are Huber’s estimators with parameter . Then there exists a universal constant such that for any and , with probability ,
(3.2) 
where is a predetermined parameter satisfying .
So after Step 1 of the proposed procedure, we obtain an estimator that achieves optimal rate of elementwise convergence. With , we proceed to establish results of estimation errors of our concern. We will establish convergence rates for both and . The key theorem that links the estimation error under elementwise max norm with that under other norms is stated as follows.
Theorem 3.1.
Note that this theorem provides a nice interface that connects maxnorm guarantee with the desired convergence rate . Therefore, any robust method that attains the elementwise optimal convergence rate as in Corollary 3.1 can be used in Step 1 instead of the current Mestimator approach.
3.2 Estimators under more restricted distributional assumptions
We analyzed theoretical properties of the robust procedure in the previous subsection under the assumption of only bounded fourth moments. Theorem 3.1 shows that any estimator that achieves the optimal max norm convergence rate could serve as an initial pilot estimator for to be used in Step 2 and Step 3 of our procedure. Thus the procedure depends on the distribution assumption 2.3 only through Step 1 where a proper estimator is proposed. Sometimes, we do have more information on the shape of the distributions of factors and noises. For example, if the distribution of has a subGaussian tail, the sample covariance matrix attains the optimal elementwise maximal rate for estimating .
In an earlier work, Fan et al. (2011) proposed to simply regress observations on in order to obtain
(3.8) 
where and . Then they threshold the matrix where and . This regression procedure is equivalent to applying directly in Step 1 and also equivalent to solving a leastsquare minimization problem, and thus suffers from robustness issue when the data come from heavytailed distributions. All the convergence rates achieved in Theorem 3.1 are identical with Fan et al. (2011) where subGaussian tails are assumed.
As we explained, if is subGaussian distributed, instead of can be used. If and exhibit heavy tails, another widely used assumption is tdistribution, which is included in the elliptical distribution family. The elliptical distribution is defined as follows. Let and with . A dimensional random vector has an elliptical distribution, denoted by , if it has a representation (Fang et al., 1990)
(3.9) 
where is a uniform random vector on the unit sphere in , is a scalar random variable independent of , is a deterministic matrix satisfying . To make the representation (3.9) identifiable, we require so that . Here we also assume continuous elliptical distributions with .
If and are uncorrelated and jointly elliptical, i.e., , then a well known good estimator for the correlation matrix of is marginal Kendall’s tau. Kendall’s tau correlation coefficient is defined as
(3.10) 
whose population counterpart is
(3.11) 
For the elliptical family, the key identity relates Pearson correlation with Kendall’s correlation (Fang et al., 1990). Using , Han and Liu (2014) showed that is an accurate estimate of , achieving . Let where is the correlation matrix and is a diagonal matrix consisting of standard deviations for each dimension. We construct by separately estimating and . As before, if fourth moment exists, we estimate by only considering in Step 1.
Therefore, if is elliptically distributed, can be used as the initial pilot estimator for in Step 1. Note that is much more computationally efficient than . However, for general heavytailed distributions, there is no simple way to connect the usual correlation with Kendall’s correlation. Thus we should favor instead. We will compare the three estimators , and throughly through simulations in Section 4.
3.3 Asymptotics of robust mean estimators
In this section we look further into robust mean estimators. Though the result we shall present is asymptotic and not essential for our main theorem 3.1, it is interesting in its own right and deserves some treatment.
Perhaps the best known result of Huber’s mean estimator is the asymptotic minimax theory. In Huber (1964), Huber considered the socalled contamination model:
where is a known distribution, is fixed and is the family of symmetric distributions. Let be the minimizer of , where for , and for , where is fixed. In the special case where is Gaussian, Huber’s result shows that with appropriate choice of , Huber’s estimator minimizes the maximal asymptotic variance among all translation invariant estimators, the maximum being taken over .
One problem with contamination model is that it makes sense only when we assume symmetry of , if is the quantity we are interested in. In contrast, Catoni (2012) and Fan et al. (2016) studied a different family, in which distributions have finite second moments. Bickel (1976) called them ‘local’ and ‘global’ models respectively, and offered a detailed discussion.
This paper, along with the preceding two papers (Catoni, 2012; Fan et al., 2016), studies robustness in the sense of the second model. The technical novelty primarily lies in the nice concentration property, which is a powerful tool in high dimensional statistics. This requires the parameter of to grow with , versus being kept fixed, such that the condition in Corollary 3.1 is satisfied. It turns out that, in addition to the concentration property, we can establish results regarding its asymptotic behaviors in an exact manner.
Let for and for ; its derivative . Let us write . Denote as a solution of , which is unique when is sufficiently large, and a solution of . We have the following theorem.
Theorem 3.2.
Suppose that is drawn from some distribution with mean and finite variance . Suppose is any sequence with . Then, as ,
and moreover
This theorem gives a decomposition of error into two components: variance and bias. The rate of bias depends on the distribution and . When the distribution is either symmetric or , the second component is , a negligible quantity compared with the asymptotic variance. While Huber’s approach needs the symmetric restriction, there is no need for our estimator. This theorem also lends credibility to the biasvariance tradeoff we observed in the simulation (see Section 4.1).
It is worth comparing the above Huber loss minimization with another candidate for robust mean estimation called “medianofmeans” estimator given by Hsu and Sabato (2014). The method, as its name suggests, first divides samples into subgroups and calculates means for each subgroup, then take the median of those means as the final estimator. The first step basically symmetrizes the distribution by the central limit theorem and the second step is to robustify the procedure. According to Hsu and Sabato (2014), if we choose and elementwisely estimate , similar to (2.5), with probability , we have
Although “medianofmeans” has the desired concentration property, unlike our estimator here, its asymptotic behavior differs from the empirical mean estimator, and as a consequence, it is not asymptotically efficient when the distribution is Gaussian. Therefore, regarding efficiency, we prefer our proposed procedure in Section 2.2.
4 Simulations
We now present simulation results to demonstrate improvement of the proposed robust method over the leastsquare based method (Fan et al., 2008, 2011) and Kendall’s tau based method (Han and Liu, 2014; Fan et al., 2015) when factors and errors are heavytailed and even elliptically distributed.
However, one must be cautious of the choice of the tuning parameter , since it plays an important role in the quality of the robust estimates. Out of this concern, we shall discuss the intricacy of choosing parameter before presenting the performance of robust estimates of covariance matrices.
4.1 Robust estimates of variances and covariances
For random variables with zero mean that may potentially exhibits heavytailed behavior, the sample mean of is not good enough for our estimation purpose. Though being unbiased, in the high dimensional setting, there is no guarantee that multiple sample means stay close to the true values simultaneously.
As shown in theoretical analysis, this problem is alleviated for robust estimators constructed through Mestimators, whose influence functions grow slowly at extreme values. The desired concentration property in (3.2) depends on the choice of parameter , which decides the range outside which large values cease to become more influential. However, in practice, we have to make a good guess of as the theory suggests; even so, we may be too conservative in the choice of .
To show this, we plot in Figure 1 the histograms of our estimates of in runs, where is generated from a tdistribution with degree of freedom . The first three histograms show the estimates constructed from Huber’s Mestimator, with parameter
(4.1) 
where is respectively, and the last histogram is the usual sample estimate (or ). The quality of estimates ranges from large biases to large variances. We also plot in Figure 2 the histograms of estimates of , where is generated from a multivariate tdistribution with and an identity scale matrix. The only difference is that in (4.1), the variance of is replaced by the covariance of .
From Figure 1, we observe a biasvariance tradeoff phenomenon as varies. This is also consistent with the theory in Section 3.3. When is small, the robust method underestimate the variance, yielding a large bias due to the asymmetric of the distribution of . As increases, a larger variance is traded for a smaller bias, until , in which case the robust estimator simply becomes the sample mean.
For the covariance estimation, Figure 2 exhibits a different phenomenon. Since the distribution of is symmetric for , there is no bias incurred when is small. Since the variance is smaller when is smaller, we have a net gain in terms of the quality of estimates. In the extreme case where is zero, we are actually estimating the median. Fortunately, under distributional symmetry, the mean and the median are the same.
The simple simulations help us to understand how to choose in practice: if the distribution is close to a symmetric one, one can choose aggressively, i.e. making smaller; otherwise, a conservative is preferred.
4.2 Covariance matrix estimation
We implemented the robust estimation procedure with three initial pilot estimators , and . We simulated samples of from a multivariate tdistribution with covariance matrix diag and various degrees of freedom. Each row of is independently sampled from a standard normal distribution. The population covariance matrix of is . For running from to and , we calculated errors of the robust procedure in different norms. As suggested by the experiments in the previous section, we chose a larger parameter to estimate the diagonal elements of , and a smaller one to estimate its offdiagonal elements. We used the thresholding parameter .
The estimation errors are gauged in the following norms: , and as shown in Theorem 3.1. We considered two different settings: (1) is generated from multivariate t distribution with very heavy (), medium heavy (), and light ( or Gaussian) tail; (2) is elementwise iid onedimensional t distribution with degree of freedom and . They are separately plotted in Figures 3 and 4. The estimation errors of applying sample covariance matrix are used as the baseline for comparison. For example, if is used to measure performance, the blue curve represents ratio while the black curve represents ratio where are respectively estimators given by the robust procedure with initial pilot estimators for . Therefore if the ratio curve moves below , the method is better than naive sample estimator given in Fan et al. (2011) and vice versa. The more it gets below , the more robust the procedure is against heavytailed randomness.
The first setting (Figure 3) represents a heavytailed elliptical distribution, where we expect the two robust methods work better than the sample covariance based method, especially in the case of extremely heavy tails (solid lines for ). As expected, both black curves and blue curves under the three measures behave visibly better (smaller than ). On the other hand, if data are indeed Gaussian (dotted line for ), the method with sample covariance performs better under most measures (greater than ). Nevertheless, our robust method still performs comparably with the sample covariance method, as the median error ratio stays around whereas Kendall’s tau method can be much worse than the sample covariance method. A plausible explanation is that the variance reduced compensates for the bias incurred in our procedure. In addition, the IQR plots tell us the proposed robust method is indeed more stable than Kendall’s tau.
The second setting (Figure 4) provides an example of nonelliptical distributed heavytailed data. We can see that the performance of the robust method dominates the other two methods, which verifies the approach in this paper especially when data comes from a general heavytailed distribution. While our method is able to deal with more general distributions, Kendall’s tau method does not apply to distributions outside the elliptical family, which excludes the elementwise iid distribution in this setting. This explains why under various measures, our robust method is better than Kendall’s tau method by a clear margin. Note that even in the first setting where the data are indeed elliptical, with proper tuning, the proposed robust methods can still outperform Kendall’s tau.
5 Real data analysis
In this section, we will look into financial historical data during 2005  2013, and assess to what extent our factor model characterizes the data.
The dataset we used in our analysis consists of daily returns of stocks, all of which are large market capitalization constituents of S&P index, collected without missing values from 2005 to 2013. This dataset has also been used in Fan et al. (2014), where they investigated how covariates (e.g. size, volume) could be utilized to help estimate factors and factor loadings, whereas the focus of the current paper is to develop robust methods in the presence of heavy tailed data.
In addition, we collected factors data for the same period, where the factors are calculated according to FamaFrench threefactor model (Fama and French, 1993). After centering, the panel matrix we will use for analysis, is a by matrix , in addition to a factor matrix of size by . Here is the number of daily returns and is the number of stocks.
5.1 Heavy tailedness
First, we look at how the daily returns are distributed. Especially, we are interested in the behaviors of their tails. In Figure 5, we made QQ plots that compare the distribution of with either Gaussian distribution or tdistributions. In the four plots, the base distributions are Gaussian distribution, and tdistribution with varying degree of freedom, ranging from to . We also fit a line for each plot, showing how much the return data deviate from the base distribution. It is clear that the data has a tail heavier than that of a Gaussian distribution, and that tdistribution with is almost in alignment with the return data. Similarly, we made the QQ plots for the factors in Figure 6. The plots also show that distribution is better in terms of fitting the data; however, the tails are even heavier, and tdistribution seems to best fit the data.
5.2 Spiked covariance structure
We now consider how the covariance matrix of returns looks like, since a spiked covariance structure would justify the pervasiveness assumption. To find the spectral structure, we calculated eigenvalues of the sample covariance matrix , and made a histogram based on logarithmic scale (see the left panel Figure 7). In the histogram, the counts in the rightmost four bins are , , and , representing only a few large eigenvalues, which is a strong signal of a spiked structure. We also plotted the proportion of residue eigenvalues , against in the right panel of Figure 7. The top eigenvalues account for a major part of the variances, which lends weight to the pervasive assumption.
The spiked covariance structure has been studied in Paul (2007), Johnstone and Lu (2009) and many other papers, but under their regime, the top eigenvalues or “spiked” eigenvalues do not grow with the dimension. In this paper, the spiked eigenvalues have stronger signals, and thus are easier to be separated from the rest of eigenvalues. In this respect, the connotation of “spiked covariance structure” is closer to that in Fan and Wang (2015). As empirical evidence, this phenomenon also buttresses the motivation of study in Fan and Wang (2015).
5.3 Portfolio risk estimation
We consider portfolio risk estimation. To be specific, for a portfolio with weight vector on all the market assets, its risk is measured by quantity where is the true covariance of excess returns of all the assets. Note that might be time varying. Here we consider a class of weights with gross exposure , that is and . We consider four scenarios . Note that represents the case of no short selling and the other values measure different levels of exposure to short selling.
To assess how well our robust estimator performs compared with sample covariance, we calculated the covariance estimators and , using the daily data of preceding months, where is our robust covariance estimator and is the sample covariance, for every trading day from 2006 to 2013. We indexed those dates by where runs from to . Let be the excess return of the following trading day after . For a weight vector , the error we used to gauge the two approaches is
Note the biasvariance decomposition where . The first term measures the systematic risk that cannot be reduced while the second term is the estimation error for the risk of portfolio .
To generate multiple random weights with gross exposure , we adopted the strategy used in Fan and Yao (2015): (1) for each index let (long) with probability and (short) with probability ; (2) generate iid by exponential distribution; (3) for , let and for , let . We made a set of scatter plots in Figure 8, in which the xaxis represents and the yaxis . In addition, we highlighted in the first plot the point with uniform weights (i.e. ), which serves as a benchmark for comparison. The dashed line shows where the two approaches have the same performance. Clearly, for all the robust approach has smaller risk errors, and therefore has better empirical performance in estimating portfolio risks.
Appendix A Appendix
Proof of Theorem 3.1.
Since we have robust estimator such that , we clearly know achieve the same rate. Using this, let us first prove . Obviously,
(A.1) 
because the multiplication is along the fixed dimension and each element is estimated with the rate of convergence . Also , therefore is good enough to estimate with error in max norm.