# Robust high dimensional factor models with applications to statistical machine learning^{1}^{1}1The authors gratefully acknowledge NSF grants DMS-1712591 and DMS-1662139 and NIH grant R01-GM072611.

###### Abstract

Factor models are a class of powerful statistical models that have been widely used to deal with dependent measurements that arise frequently from various applications from genomics and neuroscience to economics and finance. As data are collected at an ever-growing scale, statistical machine learning faces some new challenges: high dimensionality, strong dependence among observed variables, heavy-tailed variables and heterogeneity. High-dimensional robust factor analysis serves as a powerful toolkit to conquer these challenges.

This paper gives a selective overview on recent advance on high-dimensional factor models and their applications to statistics including Factor-Adjusted Robust Model selection (FarmSelect) and Factor-Adjusted Robust Multiple testing (FarmTest). We show that classical methods, especially principal component analysis (PCA), can be tailored to many new problems and provide powerful tools for statistical estimation and inference. We highlight PCA and its connections to matrix perturbation theory, robust statistics, random projection, false discovery rate, etc., and illustrate through several applications how insights from these fields yield solutions to modern challenges. We also present far-reaching connections between factor models and popular statistical learning problems, including network analysis and low-rank matrix recovery.

Key Words: Factor model, PCA, covariance estimation, perturbation bounds, robustness, random sketch, FarmSelect, FarmTest

## 1 Introduction

In modern data analytics, dependence across high-dimensional outcomes or measurements is ubiquitous. For example, stocks within the same industry exhibit significantly correlated returns, housing prices of a country depend on various economic factors, gene expressions can be stimulated by cytokines. Ignoring such dependence structure can produce significant systematic bias and yields inefficient statistical results and misleading insights. The problems are more severe for high-dimensional big data, where dependence, non-Gaussianity and heterogeneity of measurements are common.

Factor models aim to capture such dependence by assuming several variates or “factors”, usually much fewer than the outcomes, that drive the dependence of the entire outcomes (Lawley and Maxwell, 1962; Stock and Watson, 2002). Stemming from the early works on measuring human abilities (Spearman, 1927), factor models have become one of the most popular and powerful tools in multivariate analysis and have made profound impact in the past century on psychology (Bartlett, 1938; McCrae and John, 1992), economics and finance (Chamberlain and Rothschild, 1982; Fama and French, 1993; Stock and Watson, 2002; Bai and Ng, 2002), biology (Hirzel et al., 2002; Hochreiter et al., 2006; Leek and Storey, 2008), etc. Suppose are i.i.d. -dimensional random vectors, which may represent financial returns, housing prices, gene expressions, etc. The generic factor model assumes that

(1) |

where , is the mean vector, is the matrix of factor loadings, stores -dimensional vectors of common factors with , and represents the error terms (a.k.a. idiosyncratic components), which has mean zero and is uncorrelated with or independent of . We emphasize that, for most of our discussions in the paper (except Section 3.1), only are observable, and the goal is to infer and through . Here we use the name “factor model” to refer to a general concept where the idiosyncratic components are allowed to be weakly correlated. This is also known as the “approximate factor model” in the literature, in contrast to the “strict factor model” where the idiosyncratic components are assumed to be uncorrelated.

Note that the model (1) has identifiability issues: given any invertible matrix , simultaneously replacing with and with does not change the observation . To resolve this ambiguity issue, the following identifiability assumption is usually imposed:

###### Assumption 1.1 (Identifiability).

is diagonal and .

Other identifiability assumptions as well as detailed discussions can be found in Bai and Li (2012) and Fan et al. (2013).

Factor analysis is closely related to principal component analysis (PCA), which breaks down the covariance matrix into a set of orthogonal components and identifies the subspace that explains the most variation of the data (Pearson, 1901; Hotelling, 1933). In this selective review, we will mainly leverage PCA, or more generally, spectral methods, to estimate the factors and the loading matrix in (1). Other popular estimators, mostly based on the maximum likelihood principle, can be found in Lawley and Maxwell (1962); Anderson and Amemiya (1988); Bai and Li (2012), etc. The covariance matrix of consists of two components: and . Intuitively, when the contribution of the covariance from the error term is negligible compared with those from the factor term , the top- eigenspace (namely, the space spanned by top eigenvectors) of the sample covariance of should be well aligned with the column space of . This can be seen from the assumption that , which occurs frequently in high-dimensional statistics (Fan et al., 2013).

Here is our main message: applying PCA to well-crafted covariance matrices (including vanilla sample covariance matrices and their robust version) consistently estimates the factors and loadings, as long as the signal-to-noise ratio is large enough. The core theoretical challenge is to characterize how idiosyncratic covariance perturb the eigenstructure of the factor covariance . In addition, the situation is more complicated with the presence of heavy-tailed data, missing data, computational constraints, heterogeneity, etc.

The rest of the paper is devoted to solutions to these challenges and a wide range of applications to statistical machine learning problems. In Section 2, we will elucidate the relationship between factor models and PCA and present several useful deterministic perturbation bounds for eigenspaces. We will also discuss robust covariance inputs for the PCA procedure to guard against corruption from heavy-tailed data. Exploiting the factor structure of the data helps solve many statistical and machine learning problems. In Section 3, we will see how the factor models and PCA can be applied to high-dimensional covariance estimation, regression, multiple testing and model selection. In Section 4, we demonstrate the connection between PCA and a wide range of machine learning problems including Gaussian mixture models, community detection, matrix completion, etc. We will develop useful tools and establish strong theoretical guarantees for our proposed methods.

Here we collect all the notations for future convenience. We use to refer to . We adopt the convention of using regular letters for scalars and using bold-face letters for vectors or matrices. For , and , we define , , where , and . For a matrix , we use and to denote its operator norm (spectral norm), Frobenius norm, entry-wise (element-wise) max-norm, and vector norm, respectively. To be more specific, the last two norms are defined by and . Let denote the identity matrix, denote the -dimensional all-one vector, and denote the indicator of event , i.e., if happens, and otherwise. We use to refer to the normal distribution with mean vector and covariance matrix . For two nonnegative numbers and that possibly depend on and , we use the notation and to mean for some constant , and the notation and to mean for some constant . We write if both and hold. For a sequence of random variables and a sequence of nonnegative deterministic numbers , we write if for any , there exists and such that holds for all ; and we write if for any and , there exists such that holds for all . We omit the subscripts when it does not cause confusion.

## 2 Factor models and PCA

### 2.1 Relationship between PCA and factor models in high dimensions

Under model (1) with the identifiability condition, is given by

(2) |

Intuitively, if the magnitude of dominates , the top- eigenspace of should be approximately aligned with the column space of . Naturally we expect a large gap between the eigenvalues of and to be important for estimating the column space of through PCA (see Figure 1). On the other hand, if this gap is small compared with the eigenvalues of , it is known that PCA leads to inconsistent estimation (Johnstone and Lu, 2009). The above discussion motivates a simple vanilla PCA-based method for estimating and as follows (assuming the Identifiability Assumption).

Step 1. Obtain an estimator and of and , e.g., the sample mean and covariance matrix or their robust versions.

Step 2. Compute the eigen-decomposition of . Let be the top eigenvalues and be their corresponding eigenvectors. Set and .

Step 3. Obtain PCA estimators and , namely, consists of the top- rescaled eigenvectors of and is just the rescaled projection of onto the space spanned by the eigen-space: .

Let us provide some intuitions for the estimators in Step 3. Recall that is the th column of . Then, by model (1), . In the high-dimensional setting, the second term is averaged out when is weakly dependent across its component. This along with the identifiability condition delivers that

(3) |

Now, we estimate by and hence by and by . Using the substitution method, we obtain the estimators in Step 3.

The above heuristic also reveals that
the PCA-based methods work well if the effect of the factors outweighs the noise. To quantify this, we introduce a form of Pervasiveness Assumption from the factor model literature. While this assumption is strong^{4}^{4}4There is a weaker assumption, under which (1) is usually called the weak factor model; see Onatski (2012)., it simplifies our discussion and captures the above intuition well: it holds when the factor loadings are random samples from a nondegenerate population (Fan et al., 2013).

###### Assumption 2.1 (Pervasiveness).

The first eigenvalues of have order , whereas .

Note that under the Identifiability Assumption 1.1. The first part of this assumption holds when each factor influences a non-vanishing proportion of outcomes. Mathematically speaking, it means that for any , the average of squared loadings of the th factor satisfies (right panel of Figure 1). This holds with high probability if, for example, are i.i.d. realizations from a non-degenerate distribution, but we will not make such assumption in this paper. The second part of the assumption is reasonable, as cross-sectional correlation becomes weak after we take out the common factors. Typically, if is a sparse matrix, the norm bound holds; see Section 3.1 for details. Under this Pervasiveness Assumption, the first eigenvalues of will be well separated with the rest of eigenvalues. By the Davis-Kahan theorem (Davis and Kahan, 1970), which we present as Theorem 2.1, we can consistently estimate the column space of through the top- eigenspace of . This explains why we can apply PCA to factor model analysis (Fan et al., 2013).

Though factor models and PCA are not identical (see Jolliffe, 1986), they are approximately the same for high-dimensional problems with the pervasiveness assumption(Fan et al., 2013). Thus, PCA-based ideas are important components of estimation and inference for factor models. In later sections (especially Section 4), we discuss statistical and machine learning problems with factor-model-type structures. There PCA is able to achieve consistent estimation even when the Pervasiveness Assumption is weakened—and somewhat surprisingly—PCA can work well down to the information limit. For perspectives from random matrix theory, see Baik et al. (2005); Paul (2007); Johnstone and Lu (2009); Benaych-Georges and Nadakuditi (2011); O’Rourke et al. (2016); Wang and Fan (2017), among others.

### 2.2 Estimating the number of factors

In high-dimensional factor models, if the factors are unobserved, we need to choose the number of factors before estimating the loading matrix, factors, etc. The number can be usually estimated from the eigenvalues of the the sample covariance matrix or its robust version. With certain conditions such as separation of the top eigenvalues from the others, the estimation is consistent. Classical methods include likelihood ratio tests (Bartlett, 1950), the scree plot (Cattell, 1966), parallel analysis (Horn, 1965), etc. Here, we introduce a few recent methods: the first one is based on the eigenvalue ratio, the second on eigenvalue differences, and the third on the eigenvalue magnitude.

For simplicity, let us use the sample covariance and arrange its eigenvalues in descending order: , where (the remaining eigenvalues, if any, are zero). Lam and Yao (2012) and Ahn and Horenstein (2013) proposed an estimator based on ratios of consecutive eigenvalues. For a pre-determined , the eigenvalue ratio estimator is

Intuitively, when the signal eigenvalues are well separated from the other eigenvalues, the ratio at should be large. Under some conditions, the consistency of this estimator, which does not involve complicated tuning parameters, is established.

In an earlier work, Onatski (2010) proposed to use the differences of consecutive eigenvalues. For a given and pre-determined integer , define

Using a result on eigenvalue empirical distribution from random matrix theory, Onatski (2010) proved consistency of under the Pervasiveness Assumption. The intuition is that, the Pervasiveness Assumption implies that tends on in probability as ; whereas almost surely for because these -s converge to the same limit, which can be determined using random matrix theory. Onatski (2010) also proposed a data-driven way to determine from the empirical eigenvalue distribution of the sample covariance matrix.

A third possibility is to use an information criterion. Define

where is the sample mean, and the equivalence (second equality) is well known. For a given , is interpreted as the scaled sum of squared residuals, which measures how well factors fit the data. A very natural estimator is to find the best such that the following penalized version of is minimized (Bai and Ng, 2002):

and is any consistent estimate of . The upper limit is assumed to be no smaller than , and is typically chosen as or in empirical studies in Bai and Ng (2002). Consistency results are established under more general choices of .

We conclude this section by remarking that in general, it is impossible to consistently estimate if the smallest nonzero eigenvalue is much smaller than , because the ‘signals’ (eigenvalues of ) would not be distinguishable from the the noise (eigenvalues of ). As mentioned before, consistency of PCA is well studied in the random matrix theory literature. See Dobriban (2017) for a recent work that justifies parallel analysis using random matrix theory.

### 2.3 Robust covariance inputs

To extract latent factors and their factor loadings, we need an initial covariance estimator. Given independent observations with mean zero, the sample covariance matrix, namely , is a natural choice to estimate . The finite sample bound on has been well studied in the literature (Vershynin, 2010; Tropp, 2012; Koltchinskii and Lounici, 2017). Before presenting the result from Vershynin (2010), let us review the definition of sub-Gaussian variables.

A random variable is called sub-Gaussian if is finite, in which case this quantity defines a norm called the sub-Gaussian norm. Sub-Gaussian variables include as special cases Gaussian variables, bounded variables, and other variables with tails similar to or lighter than Gaussian tails. For a random vector , we define ; we call sub-Gaussian if is finite.

###### Theorem 2.1.

Let be the covariance matrix of . Assume that are i.i.d. sub-Gaussian random vectors, and denote . Then for any , there exist constants and only depending on such that

(4) |

where .

###### Remark 2.1.

The spectral-norm bound above depends on the ambient dimension , which can be large in high-dimensional scenarios. Interested readers can refer to Koltchinskii and Lounici (2017) for a refined result that only depends on the intrinsic dimension (or effective rank) of .

An important asepect of the above result is the sub-Gaussian concentration in (4), but this depends heavily on the sub-Gaussian or sub-exponential behaviors of observed random vectors. This condition can not be validated in high dimensions when tens of thousands of variables are collected. See Fan et al. (2016b).
When the distribution is heavy-tailed^{5}^{5}5Here, we mean it has second bounded moment when estimating the mean and has bounded fourth moment when estimating the variance., one cannot expect sub-Gaussian or sub-exponential behaviors of the sample covariance in the spectral norm (Catoni, 2012). See also Vershynin (2012) and Srivastava and Vershynin (2013). Therefore, to perform PCA for heavy-tailed data, the sample covariance is not a good choice to begin with. Alternative robust estimators have been constructed to achieve better finite sample performance.

Catoni (2012), Fan et al. (2017b) and Fan et al. (2016b) approached the problem by first considering estimation of a univariate mean from a sequence of i.i.d random variables with variance . In this case, the sample mean provides an estimator but without exponential concentration. Indeed, by Markov inequality, we have , which is tight in general and has a Cauchy tail (in terms of ). On the other hand, if we truncate the data with and compute the mean of the truncated data, then we have (Fan et al., 2016b)

for a universal constant . In other words, the mean of truncated data with only a finite second moment behaves very much the same as the sample mean from the normal data: both estimators have Gaussian tails (in terms of ). This sub-Gaussian concentration is fundamental in high-dimensional statistics as the sample mean is computed tens of thousands or even millions of times.

As an example, estimating the high-dimensional covariance matrix involves univariate mean estimation, since the covariance can be expressed as an expectation: as . Estimating each component by the truncated mean yields a covariance matrix . Assuming the fourth moment is bounded (as the covariance itself are second moments), by using the union bound and the above concentration inequality, we can easily obtain

for any and a constant . In other words, with truncation, when the data have merely bounded fourth moments, we can achieve the same estimation rate as the sample covariance matrix under the Gaussian data.

Fan et al. (2016b) and Minsker (2016) independently proposed shrinkage variants of the sample covariance with sub-Gaussian behavior under the spectral norm, as long as the fourth moments of are finite. For any , Fan et al. (2016b) proposed the following shrinkage sample covariance matrix

(5) |

to estimate , where is the -norm. The following theorem establishes the statistical error rate of in terms of the spectral norm.

###### Theorem 2.2.

Suppose for any unit vector . Then it holds that for any ,

(6) |

where and is a universal constant.

Applying PCA to the robust covariance estimators as described above leads to more reliable estimation of principal eigenspaces in the presence of heavy-tailed data.

### 2.4 Perturbation bounds

In this section, we introduce several perturbation results on eigenspaces, which serve as fundamental technical tools in factor models and related learning problems. For example, in relating the factor loading matrix to the principal components of covariance matrix in (2), one can regard as a perturbation of by an amount of and take and in Theorem 2.3 below. Similarly, we can also regard a covariance matrix estimator as a perturbation of by an amount of .

We will begin with a review of the Davis-Kahan theorem (Davis and Kahan, 1970), which is usually useful for deriving -type bounds (which includes spectral norm bounds) for symmetric matrices. Then, based on this classical result, we introduce entry-wise () bounds, which typically give refined results under structural assumptions. We also derive bounds for rectangular matrices that are similar to Wedin’s theorem (Wedin, 1972). Several recent works on this topic can be found in Yu et al. (2014); Fan et al. (2018b); Koltchinskii and Xia (2016); Abbe et al. (2017); Zhong (2017); Cape et al. (2017); Eldridge et al. (2017).

First, for any two subspaces and of the same dimension in , we choose any with orthonormal columns that span and , respectively. We can measure the closeness between two subspaces though the difference between their projectors:

The above definitions are both proper metrics (or distances) for subspaces and and do not depend on the specific choice of and , since and are projection operators. Importantly, these two metrics are connected to the well-studied notion of canonical angles (or principal angles). Formally, let the singular values of be , and define the canonical angles for . It is often useful to denote the sine of the canonical (principal) angles by , which can be interpreted as a generalization of sine of angles between two vectors. The following identities are well known (Stewart and Sun, 1990).

In some cases, it is convenient to fix a specific choice of and . It is known that for both Frobenius norm and spectral norm,

where is the space of orthogonal matrices of size . The minimizer (best rotation of basis) can be given by the singular value decomposition (SVD) of . For details, see Cape et al. (2017) for example.

Now, we present the Davis-Kahan theorem (Davis and Kahan, 1970).

###### Theorem 2.3.

Suppose are symmetric, and that have orthonormal column vectors which are eigenvectors of and respectively. Let be the set of eigenvalues corresponding to the eigenvectors given in , and let (respectively ) be the set of eigenvalues corresponding to the eigenvectors not given in (respectively ). If there exists an interval and such that and , then for any orthogonal-invariant norm^{6}^{6}6A norm is orthogonal-invariant if for any matrix and any orthogonal matrices and .

This theorem can be generalized to singular vector perturbation for rectangular matrices; see Wedin (1972). A slightly unpleasant feature of this theorem is that depends on the eigenvalues of both and . However, with the help of Weyl’s inequality, we can immediately obtain a corollary that does not involve the eigenvalues of . Let denote the th largest eigenvalue of a real symmetric matrix. Recall that Weyl’s inequality bounds the differences between the eigenvalues of and :

(7) |

This inequality suggests that, if the eigenvalues in have the same ranks (in descending order) as those in , then and are similar. Below we state our corollary, whose proof is in the appendix.

###### Corollary 2.1.

Assume the setup of the above theorem, and suppose the eigenvalues in have the same ranks as those in . If and for some , then

We can then use to obtain a bound under the Frobenius norm. In the special case where and , reduce to vectors, we can choose , and the above corollary translates into

(8) |

We can now see that the factor model and PCA are approximately the same with sufficiently large eigen-gap. Indeed, under Identifiability Assumption 1.1, we have . Applying Weyl’s inequality and Corollary 2.1 to (as ) and (as ), we can easily control the eigenvalue/eigenvector differences by and the eigengap, which is comparably small under Pervasiveness Assumption 2.1. This difference can be interpreted as the bias incurred by PCA on approximating factor models.

Furthermore, given any covariance estimator , we can similarly apply the above results by setting and to bound the difference between the estimated eigenvalues/eigenvectors and the population counterparts. Note that the above corollary gives us an upper bound on the subspace estimation error in terms of the ratio .

Next, we consider entry-wise bounds on the eigenvectors. For simplicity, here we only consider eigenvectors corresponding to unique eigenvalues rather than the general eigenspace. Often, we want to have a bound on each entry of the eigenvector difference , instead of an norm bound, which is an average-type result. In many cases, none of these entries has dominant perturbation, but the Davis-Kahan’s theorem falls short of providing a reasonable bound (the naïve bound gives a suboptimal result).

Some recent papers (Abbe et al., 2017) have addressed this problem, and in particular, entry-wise bounds of the following form are established.

where is related to the structure of the statistical problem and typically can be as small as , which is very desirable in high-dimensional setting. The small term is often related to independence pattern of the data, which is typically small under mild independence conditions.

We illustrate this idea in Figure 2 through a simulated data example (left) and a real data example (right), both of which have factor-type structure. For the left plot, we generated a network data according to the stochastic block model with blocks (communities), each having nodes : the adjacency matrix that represents the links between nodes is a symmetric matrix, with upper triangular elements generated independently from Bernoulli trials (diagonal elements are taken as 0), with the edge probability for two nodes within blocks and otherwise. Our task is to classify (cluster) these two communities based on the adjacency matrix. We used the second eigenvector (that is, corresponding to the second largest eigenvalue) of the adjacency matrix as a classifier. The left panel of Figure 2 represents the values of the coordinates (or entries) in the y-axis against the indices in the x-axis. For comparison, the second eigenvector of the expectation of the adjacency matrix—which is of interest but unknown—have entries taking values only in , depending on the unknown nature of which block a vertex belongs to (this statement is not hard to verify). We used the horizontal line to represent these ideal values: they indicate exactly the membership of each vertex. Clearly, the magnitude of entry-wise perturbation is . Therefore, we can use as an estimate of and classify all nodes with the same sign as the same community. See Section 4.2 for more details.

For the right plot, we used daily return data of stocks that are constituents of S&P 500 index from 2012.1.1–2017.12.31. We considered stocks with exactly records and excluded stocks with incomplete/missing values, which resulted in stocks. Then, we calculated the sample covariance matrix using the data in the entire period, and computed two leading eigenvectors (note that they span the column space of ) and plotted the coordinates (entries) using small dots. Stocks with an coordinate smaller than quantile or larger than quantile are potentially outlying values and are not shown in the plot. In addition, we also highlighted the fluctuation of six stocks during three time windows: 2012.1–2015.12, 2013.1–2016.12 and 2014.1–2017.12, with different big markers. That is, for each of the three time windows, we re-computed the covariance matrices and the two leading eigenvectors, and then highlighted coordinates that correspond to the six major stocks. Clearly, the magnitude for these stocks is small, which is roughly , and the fluctuation of coordinates is also very small. Both plots suggest an interesting phenomenon of eigenvectors in high dimensions: entry-wise behavior of eigenvectors can be benign under factor model structure.

To state our results rigorously, let us suppose that are symmetric matrices, with and . Let the eigen-decomposition of and be

(9) |

Here the eigenvalues and are the largest ones of and , respectively, in terms of absolute values. Both sequences are sorted in descending order. are eigenvalues of whose absolute values are smaller than . The eigenvectors and are normalized to have unit norms.

Here are allowed to take negative values. Thanks to Weyl’s inequality, and are well-separated when the size of perturbation is not too large. In addition, we have the freedom to choose signs for eigenvectors, since they are not uniquely defined. Later, we will use ‘up to sign’ to signify that our statement is true for at least one choice of sign. With the conventions and , we define the eigen-gap as

(10) |

which is the smallest distance between and other eigenvalues (including ). This definition coincides with the (usual) eigen-gap in Corollary 2.1 in the special case where we are interested in a single eigenvalue and its associated eigenvector.

We now present an entry-wise perturbation result. Let us first look at only one eigenvector. In this case, when is small, heuristically,

holds uniformly for each entry. When , that is, is unbiased, this gives the first-order approximation (rather than bounds on the difference ) of the random vector . Abbe et al. (2017) proves rigorously this result and generalizes to eigenspaces. The key technique for the proof is similar to Theorem 2.4 below, which simplifies the one in Abbe et al. (2017) in various ways but holds under more general conditions. It is stated in a deterministic way, and can be powerful if there is certain structural independence in the perturbation matrix . A self-contained proof can be found in the appendix.

For each , let be a modification of with the th row and th column zeroed out, i.e.,

We also define , and denote its eigenvalues and eigenvectors by and , respectively. This construction is related to the leave-one-out technique in probability and statistics. For recent papers using this technique, see Bean et al. (2013); Zhong and Boumal (2018); Abbe et al. (2017) for example.

###### Theorem 2.4.

Fix any . Suppose that , and that the eigen-gap as defined in (10) satisfies . Then, up to sign,

(11) |

where is the th column of .

To understand this theorem, let us compare it with the standard bound (Theorem 2.3) , which implies . The first term of the upper bound in (11) says the perturbation on the th entry can be much smaller, because the factor , always bounded by , can be usually much smaller. For example, if ’s are uniformly distributed on the unit sphere, then this factor is typically of order . This factor is related to the notion of incoherence in Candès and Recht (2009); Candès et al. (2011), etc.

The second term of the upper bound in (11) is typically much smaller than , especially under certain independence assumption. For example, if is independent of other entries, then, by construction, and are independent. If, moreover, entries of are i.i.d. standard Gaussian, is of order , whereas typically scales with . This gives a bound for the th entry, and can be extended to an bound if we are willing to make independence assumption for all (which is typical for random graphs for example).

We remark that this result can be generalized to perturbation bounds for eigenspaces (Abbe et al., 2017), and the conditions on eigenvalues can be relaxed using certain random matrix assumptions (Koltchinskii and Xia, 2016; O’Rourke et al., 2017; Zhong, 2017).

Now, we extend this perturbation result to singular vectors of rectangular matrices. Suppose satisfy and . Let the SVD of and be^{7}^{7}7Here, we prefer using to refer to the singular vectors (not to be confused with the noise term in factor models). The same applies to Section 4.

where and are respectively non-increasing in , and and are all normalized to have unit norm. As before, let have largest absolute values. Similar to (10), we adopt the conventions , and define the eigen-gap as

(12) |

For and , we define unit vectors and by replacing certain row or column of with zeros. To be specific, in our expression , if we replace the th row of by zeros, then the normalized right singular vectors of the resulting perturbed matrix are denoted by ; and if we replace the th column of by zeros, then the normalized left singular vectors of the resulting perturbed matrix are denoted by .

###### Corollary 2.2.

Fix any . Suppose that , and that . Then, up to sign,

where is the th row vector of , and is the th column vector of .

If we view as the data matrix (or observation) , then, the low rank matrix can be interpreted as . The above result provides a tool of studying estimation errors of the singular subspace of this low rank matrix. Note that can be interpreted as the result of removing the idiosyncratic error of the th observation, and as the result of removing the th covariate of the idiosyncratic error.

To better understand this result, let us consider a very simple case: and each row of is i.i.d. . We are interested in bounding the singular vector difference between the rank- matrix and its noisy observation . This is a spiked matrix model with a single spike. By independence between and as well as elementary properties of Gaussian variables, Corollary 2.2 implies that with probability , up to sign,

(13) |

Random matrix theory gives with high probability. Our perturbation inequality (Corollary 2.1) implies that . This upper bound is much larger than the two terms in (13), as is typically much smaller than in high dimensions. Thus, (13) gives a better entry-wise control over the counterpart.

Beyond this simple case, there are many desirable features of Corollary 2.2. First of all, we allow to be moderately large, in which case, as mentioned before, the factor is related to the incoherence structure in the matrix completion and robust PCA literature. Secondly, the result holds deterministically, so random matrices are also applicable. Finally, the result holds for each and , and thus it is useful even if the entries of are not independent, e.g. when a subset of covariates are dependent.

## 3 Applications to High-dimensional Statistics

### 3.1 Covariance estimation

Estimation of high-dimensional covariance matrices has wide applications in modern data analysis. When the dimensionality exceeds the sample size , the sample covariance matrix becomes singular. Structural assumptions are necessary in order to obtain a consistent estimator in this challenging scenario. One typical assumption in the literature is that the population covariance matrix is sparse, with a large fraction of entries being (close to) zero, see Bickel and Levina (2008) and Cai and Liu (2011). In this setting, most variables are nearly uncorrelated. In financial and genetic data, however, the presence of common factors leads to strong dependencies among variables (Fan et al., 2008). The approximate factor model (1) better characterizes this structure and helps construct valid estimates. Under this model, the covariance matrix has decomposition (2), where