Multivariate Analysis of Nonparametric
Estimates of Large Correlation Matrices
We study concentration in spectral norm of nonparametric estimates of correlation matrices. We work within the confine of a Gaussian copula model. Two nonparametric estimators of the correlation matrix, the sine transformations of the Kendall’s tau and Spearman’s rho correlation coefficient, are studied. Expected spectrum error bound is obtained for both the estimators. A general large deviation bound for the maximum spectral error of a collection of submatrices of a given dimension is also established. These results prove that when both the number of variables and sample size are large, the spectral error of the nonparametric estimators is of no greater order than that of the latent sample covariance matrix, at least when compared with some of the sharpest known error bounds for the later. As an application, we establish the minimax optimal convergence rate in the estimation of high-dimensional bandable correlation matrices via tapering off of these nonparametric estimators. An optimal convergence rate for sparse principal component analysis is also established as another example of possible applications of the main results.
arXiv:0000.0000 \startlocaldefs \endlocaldefs
Nonparametric Correlation Matrix Convergence
class=MSC] \kwd[Primary ]62H12 \kwd62G05 \kwd[; secondary ]62G20
Bandable \kwdCorrelation matrix \kwdGaussian copula \kwdHigh dimension \kwdHoeffding decomposition \kwdKendall’s tau \kwdNonparametric \kwdSparse PCA \kwdSpearman’s rho \kwdSpectral norm \kwdTapering \kwdU-statistics
We consider iid copies of a -dimensional Gaussian random vector . We define . We assume that ’s are centered and marginally scaled, so that and the correlation matrix is given by with 1 in the diagonal. In this paper, we work within a high-dimensional ‘double asymptotic’ setting where . We assume that instead of , we only observe iid copies , of the transformed variables
where ’s are unknown but strictly increasing. This is a form of the copula model (Sklar, 1959) for the distribution of the data. Because follows a Gaussian distribution, it is a formulation of the Gaussian copula, cf. Bickel et al. (1993) and references therein. A slightly different but equivalent formulation of the Gaussian copula has been referred to as the nonparanormal model (Liu, Lafferty and Wasserman, 2009). Let . Our goal here is to estimate the latent correlation structure using the observed data matrix Y.
If we could observe the latent data matrix X, an obvious choice as an estimator would be the sample correlation matrix given by . It is for this reason that we refer to the latent as an oracle estimator. It is also clear that is a sufficient statistic for estimating when X is known. As a consequence, any statistical procedure based on could be summarily described as for some function . In this respect, possesses great utility as an ideal raw estimate that lends itself to further analysis as the need be.
However, as noted above, we do not observe X but unknown strictly monotone transformations of columns of it, Y. Thus the sample correlation matrix based on Y, i.e. , is in general inconsistent in estimating the latent correlation structure . Two candidate nonparametric estimators in such a scenario are considered in this paper: Kendall’s tau developed in Kendall (1938) and Spearman’s rank correlation coefficient, developed by Charles Spearman in 1904. These are two widely used nonparametric measures of association. Their properties in fixed dimension have been studied in Kendall (1938, 1948), Kruskal (1958) and many others. More recently, in high-dimensional scenarios, correlation matrix estimators based on these measures have been taken up for study in Liu et al. (2012) and Xue and Zou (2012a) among others.
For the rest of this paper, we call the correlation matrix estimator based on Kendall’s tau and call the one based on Spearman’s rho. It will be interesting to study whether for any statistical procedure, say , based on the raw estimate , it is possible to provide justification for the use of or as a viable replacement. It is however cumbersome to study each individual procedure separately. On the other hand, if is sufficiently smooth with respect to some matrix norm, it would suffice to study the accuracy of and as estimates of in such norms.
A complete description of properties of and as estimators of large necessitates the derivation of the distributions of these matrix estimators. It is well known that in the multivariate Gaussian model, follows a Wishart distribution (Anderson, 1958). To the contrary, derivation of the distribution of and seems at the present moment intractable. On the other hand, analysis of these nonparametric estimators for each individual element of the correlation matrix has been taken upon before. Both Kendall’s tau and Spearman’s rho are specific instances of U-statistics with bounded kernels. In Hoeffding (1948), the asymptotic normality of these nonparametric estimators for an individual correlation was established. Furthermore, the celebrated Hoeffding (1963) inequality provides large deviation bounds for these estimators as U-statistics with bounded kernels. These results provide tools for studying the concentration of and in the matrix norm and its applications (Liu et al., 2012; Xue and Zou, 2012a) and the corresponding Gaussian copula graphical model (Liu, Han and Zhang, 2012).
It is important to note that while estimation accuracy in one specific matrix norm could be more appropriate for a certain set of statistical problems, some other set of problems might require accuracy in a different matrix norm. In this paper we focus on the spectral norm, which is also understood as the operator norm. Many statistical problems can be studied with error bounds in the spectral norm of estimated correlation matrices. A primary example is the principal component analysis (PCA) since the spectral norm is essential in studying the effects of matrix perturbation on eigenvalues and eigenvectors.
Before beginning the study of convergence of and in the spectral norm, it is worthwhile to note that convergence rate of the latent sample covariance matrix in the spectral norm has been studied widely and established in a multitude of literature. A detailed overview and further references can be found in Vershynin (2010) among others. For example, one could derive, from the concentration inequality in Theorem II.13 of Davidson and Szarek (2001), that for with iid rows,
so that the consistency of follows when . Additionally, the concentration inequality also provides a uniform bound on the spectral error for any -dimensional diagonal submatrix for larger . Taking any integer and sets , we have by the union bound
with at least probability . These spectral error bounds are explicit and of sharp order for the latent sample correlation matrix estimate . In this light, it is apt to ask whether and also submit similar error bounds.
In Han and Liu (2013) a rate of was established for in a transelliptical family of distributions (Liu, Han and Zhang, 2012). In a separate but simultaneous work in Wegkamp and Zhao (2013) the same rate was established for in an elliptical copula correlation factor model, which can be also viewed as elliptical copula. In this paper, we provide non-asymptotic spectrum error bounds in the more restrictive Gaussian copula model for both and which improve the convergence rates of these existing error bounds. In particular, we establish in Theorem 1 expected spectral error bounds to match (1.1), and under mild conditions on the sample size, we establish in Theorem 2 and its corollaries large deviation bounds to match (1). These results establish that in the Gaussian copula model the nonparametric estimators and perform as well as the oracle raw estimator in terms of the order of the spectral error. Consequently, a methodology based on that hinges on a spectrum error bound can be performed with the same rate of convergence if or are used in lieu of the latent .
We discuss two different statistical problems where our results could be applied. The first, a ripe problem for application of spectral error bounds, is the estimation of a large bandable correlation matrix. For high-dimensional data, proper estimation of large bandable involves implementation of various regularization strategies such as banding, tapering, thresholding etc. These procedures and their properties have been studied in Wu and Pourahmadi (2003), Bickel and Levina (2008a, b), Karoui (2008), Lam and Fan (2009), Cai and Liu (2011), Cai and Zhou (2012), and Cai and Yuan (2012). In particular, Cai, Zhang and Zhou (2010) established the optimal minimax rate of convergence for a tapered version of for certain classes of unknown bandable . In Xue and Zou (2012b), a tapering estimator based on the Spearman’s rank correlation was studied for the same class of parameters in the Gaussian copula model. However, the question of whether the nonparametric estimator could attain the optimal rate, was not resolved in their paper. Our spectral error bounds imply that the optimal rate is attained if one substitutes with either or .
The second application involves error bounds in the estimation of the leading eigenvector in PCA both with and without a sparsity assumption on the eigenvector. With the advent and increasing prevalence of high dimensional data, various limitations of traditional procedures had come to the fore. For instance, Johnstone and Lu (2009) showed that when , the principal component of is inconsistent in estimating the leading eigenvector of the true correlation matrix. Several remedies to this problem have been proposed, all being different formulations under the auspice of a general sparse PCA paradigm. In sparse PCA, the eigenvectors corresponding to the largest eigenvalues are assumed to be sparse. A vast array of sparse PCA approaches has been proposed and studied in Jolliffe, Trendafilov and Uddin (2003), Zou, Hastie and Tibshirani (2006), d’Aspremont et al. (2007), Vu and Lei (2012), Ma (2013), and Cai, Ma and Wu (2013) among others. For the elliptical copula family, Han and Liu (2013) established the optimal rate of convergence in sparse PCA with under an additional sign sub-Gaussian condition. We will demonstrate that our spectral error bounds for the nonparametric estimators can be directly applied to study the convergence rates for the principle component direction. In particular, for sparse PCA the minimax rate as described in Vu and Lei (2012) will be established without imposing the sign sub-Gaussian condition.
Our work is organized as follows. In Section 2 we describe the Gaussian copula model and the Kendall’s tau and Spearman’s rho estimators for the correlation matrix. In Section 3, we provide upper bounds for the expected spectral error for these two correlation-matrix estimators in Theorem 1 and outline our analytical strategy. In Section 4, we provides a general large deviation inequality in Theorem 2. In Section 5 we discuss two problems where our results on spectral norm concentration could be utilized. Some of the proofs are relegated to the Appendix.
2 Background & Preliminary Results
We describe the basic data model and define the nonparametric estimates of .
2.1 Data Model and Notation
We consider the Gaussian copula or multivariate nonparametric transformational model
where is a multivariate Gaussian random vector with marginal distribution and are unknown strictly increasing functions. We are interested in estimating the population correlation matrix of , denoted by
based on a sample of iid copies of . Since the absorbs the location and scale of the individual , it is natural to assume and on the marginal distribution.
The observations , , are iid copies of . They can be written as
where are independent copies of in (2.1). We denote by the matrix with rows and quite similarly .
We use the following notation throughout the paper. For vectors , the norm is denoted by , with and . For matrices , the operator norm is denoted by . The operator norm, known as the spectrum norm, is
The vectorized and Frobenius norms are denoted by
For symmetric matrices A, the eigenpair of A is denoted by and , so that and is the leading eigenvector. In addition to and , which denote the expectation and probability measure, we denote by the average over iid copies of variables in (2.3). For example,
The relation will imply for some fixed constant . Finally we denote .
2.2 Nonparametric Estimation of Correlation Matrix
The approach we adopt in estimating the correlation matrix in (2.2) is based on Kendall’s tau () or Spearman’s correlation coefficient rho ().
With the observations in (2.3), Kendall’s tau is defined as
and Spearman’s rho as
where is the rank of among . In matrix notation,
The population version of Kendall’s tau is given by
while the population version of Spearman’s rho is given by
In matrix notation, the population version of (2.6) is
Since are strictly increasing functions, we have . Thus, Kendall’s tau, Spearman’s rho and their population version are unchanged if the observed is replaced by the unobserved in their definition. Since follows a standard normal distribution, we have, from Kendall (1948) and Kruskal (1958), that for ,
This immediately leads to the following correlation matrix estimator by Kendall’s tau,
In the same light we define the correlation matrix estimator by Spearman’s rho as
The following proposition states a slightly different version of Theorem 2.3 of Wegkamp and Zhao (2013) and a direct application of their argument to Spearman’s rho.
Both matrices and are nonnegative-definite, , and . Consequently,
3 Expected Spectrum Error Bounds
While Spearman’s rho and Kendall’s tau are structurally different, they can be represented neatly as U-statistics of a special type. In this section we develop bounds for the expected spectrum norm of their error via a certain decomposition of such U-statistics. This decomposition also provides an outline of our analysis of the concentration of the spectrum norm and the sparse spectrum norm of the error in subsequent sections.
Given a sequence of observations from a population in , a matrix U-statistic with order and kernels can be written as
Assume that are permutation symmetric and set
with any constants . The Hoeffding decomposition of can be written as
where is an average of iid random matrices with elements
and are matrix U-statistics with completely degenerate kernels of order . We refer to Hoeffding (1948), Hájek, Šidák and Sen (1967), Hájek (1968), Van der Vaart (2000) and Serfling (2009) for detailed exposition on the Hoeffding decomposition and additional references.
Since the components of the Hoeffding decomposition are orthogonal,
A consequence of the above calculation of variance is
We note that Kendall’s tau and Spearman’s rho are U-statistics of order and respectively, both with kernels satisfying and for . It follows that the high order terms of their Hoeffding decompositions are explicitly bounded by
Now we consider the term . It turns out that in the Gaussian copula model (2.3), the first order kernel for Kendall’s tau can be written as
with , where , and that of Spearman’s rho is of the same form. This motivates a further decomposition of as a sum of and , with
It follows from the definition of the population Spearman’s rho in (2.8) that
Thus, the in (3.7) can be written as the difference between the sample covariance matrix of and its expectation:
Moreover, we will prove that for both Kendall’s tau and Spearman’s rho
with for Kendall’s tau and for Spearman’s rho. Thus, since on the diagonal of and is an average of iid matrices,
Let be the matrix U-statistics of either Kendall’s tau or Spearman’s rho, or as in (2.6) respectively, and the corresponding estimator of in (2.11) and (2.12). It follows from the expansion of the sine function in (2.11) and (2.12) that
with for and for . Thus, the estimators can be decomposed as
where the first two terms are bounded by (3.6) and (3.10) respectively and the third term is explicitly expressed as the difference between a sample covariance matrix and its expectation in (3.7). Moreover, the fourth term can be bounded with a higher order expansion of in (2.11) and (2.12). We note that the fourth term on the right-hand side of (3.13) is not needed if one is interested in studying or without the sine transformation. This analysis leads to the following theorem.
Let and be respectively the Kendall’s tau and Spearman’s rho matrices in (2.6), T and R be their population version in (2.9), and and be the corresponding estimators in (2.11) and (2.12) for the population correlation matrix in the Gaussian copula model (2.1). Then, for certain numerical constant and both and
In particular, defining (where is the integer part of ),
for Kendall’s tau, and for Spearman’s rho, with
If , then
Up to a numerical constant factor, Theorem 1 match the bound (1.1) for the expected spectral error of the oracle sample covariance matrix . While Han and Liu (2013) and Wegkamp and Zhao (2013) focused on large deviation bound of the spectral error of in the elliptical copula model, a direct application of their results requires for the convergence in spectrum norm. Although their results are of sharper order when , it seems that when , the extra logarithmic factor cannot be removed in their analysis based on the matrix Bernstein inequality (Tropp, 2011).
The proof of Theorem 1 requires a number of inequalities which provide key details of the analysis outlined above the statement of the theorem. These inequalities are crucial for our derivation of large deviation spectrum error bounds as well. We state these inequalities in a sequence of lemmas below and defer their proofs to the Appendix.
Let be the bivariate normal density with mean zero, variance one, and correlation . Define
Let be as in (3.19) and . Based on with iid rows, Spearman’s is a U-statistic of order 3 with a permutation symmetric kernel satisfying
Let as in (3.7) and . Then,
and with at least probability ,