Robust estimation of principal components from depth-based multivariate rank covariance matrix
Abstract: Analyzing principal components for multivariate data from its spatial sign covariance matrix (SCM) has been proposed as a computationally simple and robust alternative to normal PCA, but it suffers from poor efficiency properties and is actually inadmissible with respect to the maximum likelihood estimator. Here we use data depth-based spatial ranks in place of spatial signs to obtain the orthogonally equivariant Depth Covariance Matrix (DCM) and use its eigenvector estimates for PCA. We derive asymptotic properties of the sample DCM and influence functions of its eigenvectors. The shapes of these influence functions indicate robustness of estimated principal components, and good efficiency properties compared to the SCM. Finite sample simulation studies show that principal components of the sample DCM are robust with respect to deviations from normality, as well as are more efficient than the SCM and its affine equivariant version, Tyler’s shape matrix. Through two real data examples, we also show the effectiveness of DCM-based PCA in analyzing high-dimensional data and outlier detection, and compare it with other methods of robust PCA.
Keywords: Data depth; Principal components analysis; Robustness; Sign covariance matrix; Multivariate ranking
In multivariate analysis, the study of principal components is important since it provides a small number of uncorrelated variables from a potentially larger number of variables, so that these new components explain most of the underlying variability in the original data. In case of multivariate normal distribution, the sample covariance matrix provides the most asymptotically efficient estimates of eigenvectors/ principal components, but it is extremely sensitive to outliers as well as relaxations of the normality assumption. To address this issue, several robust estimators of the population covariance or correlation matrix have been proposed which can be used for Principal Components Analysis (PCA). They can be roughly put into these categories: robust, high breakdown point estimators that are computation-intensive (Rousseeuw, 1985; Maronna et al., 1976); M-estimators that are calculated by simple iterative algorithms but do not necessarily possess high breakdown point (Huber, 1977; Tyler, 1987); and symmetrised estomators that are highly efficient and robust to deviations from normality, but sensitive to outliers and computationally demanding (Dümbgen, 1998; Sirkiä et al., 2007).
When principal components are of interest, one can also estimate the population eigenvectors by analyzing the spatial sign of a multivariate vector: the vector divided by its magnitude, instead of the original data. The covariance matrix of these sign vectors, namely Sign Covariance Matrix (SCM) has the same set of eigenvectors as the covariance matrix of the original population, thus the multivariate sign transformation yields computationally simple and high-breakdown estimates of principal components (Locantore et al., 1999; Visuri et al., 2000). Although the SCM is not affine equivariant, its orthogonal equivariance suffices for the purpose of PCA. However, the resulting estimates are not very efficient, and are in fact asymptotically inadmissible (Magyar and Tyler, 2014), in the sense that there is an estimator (Tyler’s M-estimate of scatter, to be precise) that has uniformly lower asymptotic risk than the SCM.
The nonparametric concept of data-depth had first been proposed by Tukey (1975) when he introduced the halfspace depth. Given a dataset, the depth of a given point in the sample space measures how far inside the data cloud the point exists. An overview of statistical depth functions can be found in (Zuo and Serfling, 2000). Depth-based methods have recently been popular for robust nonparametric classification (Jornsten, 2004; Ghosh and Chaudhuri, 2005; Dutta and Ghosh, 2012; Sguera et al., 2014). In parametric estimation, depth-weighted means (Zuo et al., 2004) and covariance matrices (Zuo and Cui, 2005) provide high-breakdown point as well as efficient estimators, although they do involve choice of a suitable weight function and tuning parameters. In this paper we study the covariance matrix of the multivariate rank vector that is obtained from the data-depth of a point and its spatial sign, paying special attention to its eigenvectors. Specifically, we develop a robust version of principal components analysis for elliptically symmetric distributions based on the eigenvectors of this covariance matrix, and compare it with normal PCA and spherical PCA, i.e. PCA based on eigenvectors of the SCM.
The chapter is arranged in the following fashion. Section 2 provides preliminary theoretical concepts required for developments in the subsequent sections. Section 3 introduces the Depth Covariance Matrix (DCM) and states some basic results related to this. Section 4 provides asymptotic results regarding the sample DCM, calculated using data depths with respect to the empirical distribution function, as well as its eigenvectors and eigenvalues. Section 5 focuses solely on principal component estimation using the sample DCM. We obtain influence functions and asymptotic efficiencies for eigenvectors of the DCM. We also compare their finite sample efficiencies for several multinormal and multivariate -distributions with those of the SCM, Tyler’s scatter matrix and its depth-weighted version through a simulation study. Section 6 presents two applications of the methods we develop on real data. Finally, we wrap up our discussion in Section 7 by giving a summary of our findings and providing some potential future areas of research. Appendices A and B contain all technical details and proofs of the results we derive.
2.1 Spatial signs and sign covariance matrix
Given a vector , its spatial sign is defined as the vector valued function (Möttönen and Oja, 1995):
When is a random vector that follows an elliptic distribution , with a mean vector and covariance matrix , the sign vectors reside on the surface of a -dimensional unit ball centered at . Denote by the covariance matrix of spatial signs, or the Sign Covariance Matrix (SCM). The transformation keeps eigenvectors of population covariance matrix unchanged, and eigenvectors of the sample SCM are -consistent estimators of their population counterparts (Taskinen et al., 2012).
The sign transformation is rotation equivariant, i.e. for any orthogonal matrix , and as a result the SCM is rotation equivariant too, in the sense that . This is not necessarily true in general if is replaced by any non-singular matrix. An affine equivariant version of the sample SCM is obtained as the solution of the following equation:
which turns out to be Tyler’s M-estimator of scatter (Tyler, 1987). In this context, one should note that for scatter matrices, affine equivariance will mean any affine transformation on the original random variable ( non-singular, ) being carried over to the covariance matrix estimate upto a scalar multiple: for some .
2.2 Data depth and outlyingness
For any multivariate distribution belonging to a set of distributions , the depth of a point , say is any real-valued function that provides a ’center outward ordering’ of with respect to (Zuo and Serfling, 2000). Liu (1990) outlines the desirable properties of a statistical depth function:
(D1) Affine invariance: ;
(D2) Maximality at center: for having center of symmetry . This point is called the deepest point of the distribution.;
(D3) Monotonicity with respect to deepest point: , being deepest point of .;
(D4) Vanishing at infinity: as .
In (D2) the types of symmetry considered can be central symmetry, angular symmetry and halfspace symmetry. Also for multimodal probability distributions, i.e. distributions with multiple local maxima in their probability density functions, properties (D2) and (D3) are actually restrictive towards the formulation of a reasonable depth function that captures the shape of the data cloud. In our derivations that follow, we replace these two by a weaker condition:
(D2*) Existence of a maximal point: The maximum depth over all distributions and points is bounded above, i.e. . We denote this point by .
A real-valued function measuring the outlyingness of a point with respect to the data cloud can be seen as the opposite of what data depth does. Indeed, such functions have been used to define several depth functions, for example simplicial depth, projection depth and -depth. Keeping with the spirit of the utility of these functions we name them ‘htped’: literally the reverse of ‘depth’, and give a general definition of such functions as a transformation on any depth function:
Given a random variable following a probability distribution , and a depth function , we define Htped of a point as: as any function of the data depth so that is bounded, monotonically decreasing in and .
For a fixed depth function, there are several choices of a corresponding htped. We develop our theory assuming a general htped function, but for the plots and simulations, fix our htped as , i.e. simply subtract the depth of a point from the maximum possible depth over all points in sample space.
We will be using the following 3 measures of data-depth to obtain our DCMs and compare their performances:
Halfspace depth (HD) (Tukey, 1975) is defined as the minimum probability of all halfspaces containing a point. In our notations,
Mahalanobis depth (MhD) (Liu et al., 1999) is based on the Mahalanobis distance of to with respect to : . It is defined as
note here that can be seen as a valid htped function of with respect to .
Projection depth (PD) (Zuo, 2003) is another depth function based on an outlyingness function. Here that function is
where and are some univariate measures location and scale, respectively. Given this the depth at is defined as .
Computation-wise, MhD is easy to calculate since the sample mean and covariance matrix are generally used as estimates of and , respectively. However this makes MhD less robust with respect to outliers. PD is generally approximated by taking maximum over a number of random projections. There have been several approaches for calculating HD. A recent unpublished paper (Rainer and Mozharovskyi, 2014) provides a general algorithm that computes exact HD in time. In this paper, we shall use inbuilt functions in the R package fda.usc for calculating the above depth functions.
3 Depth-based rank covariance matrix
Consider a vector-valued random variable . Data depth is as much a property of the random variable as it is of the underlying distribution, so for ease of notation while working with transformed random variables, from now on we shall be using to denote the depth of a point . Now, given a depth function (equivalently, an htped function ), transform the original random variable as: , being the spatial sign functional. The transformed random variable can be seen as the multivariate rank corresponding to (e.g. Serfling (2006)). The notion of multivariate ranks goes back to Puri and Sen (1971), where they take the vector consisting of marginal univariate ranks as multivariate rank vector. Subsequent definitions of multivariate ranks were proposed by Möttönen and Oja (1995); Hallin and Paindaveine (2002) and Chernozhukov et al. (2014). Compared to these formulations, our definition of multivariate ranks works for any general depth function, and provides an intuitive extension to any spatial sign-based methodology.
Figure 1 gives an idea of how the multivariate rank vector is distributed when has a bivariate normal distribution. Compared to the spatial sign, which are distributed on the surface of -dimensional unit ball centered at , these spatial ranks have the same direction as original data and reside inside the -dimensional ball around that has radius (which, for the case of halfspace depth, equals 0.5). Any outlying samples situated far away from the data cloud (represented by red points in the figure) are mapped close to the boundary of the -dimensional ball after the rank transformation.
Now consider the spectral decomposition for the covariance matrix of : , being orthogonal and diagonal with positive diagonal elements. Also normalize the original random variable as . In this setup, we can represent the transformed random variable as
is an even function in because of affine invariance, as is . Since is odd in for circularly symmetric , it follows that , and consequently we obtain an expression for the covariance matrix of :
Let the random variable follow an elliptical distribution with center and covariance matrix , its spectral decomposition. Then, given a depth function the covariance matrix of the transformed random variable is
where , so that a diagonal matrix with diagonal entries
The matrix of eigenvectors of the covariance matrix, , remains unchanged in the transformation . As a result, the multivariate rank vectors can be used for robust principal component analysis, which will be outlined in the following sections. However, as one can see in the above expression, the diagonal entries of are not same as those of , i.e. the actual eigenvalues. This is the reason for lack of affine equavariance of the DCM. Following the case of multivariate sign covariance matrices (Taskinen et al., 2012) one can get back the shape components, i.e. original standardized eigenvalues from by an iterative algorithm:
Set , and start with an initial value .
Calculate the next iterate
and standardize its eigenvalues:
Stop if convergence criterion is satisfied. Otherwise set ad go to step 2.
Unlike sign covariance matrices and symmetrized sign covariance matrices (Dümbgen, 1998), however, attempting to derive an affine equivariant counterpart (as opposed to only orthogonal equivariance) of the DCM through an iterative approach analogous to Tyler (1987) will not result in anything new. This is because Tyler’s scatter matrix is defined as the implicit solution to the following equation:
and simply replacing by its multivariate rank counterpart will not change the estimate as and have the same directions. Instead we consider a depth-weighted version of Tyler’s scatter matrix (i.e. weights in right side of (3)) in the simulations in Section 5. The simulations show that it has slightly better finite-sample efficiency than but has same asymptotic performance. We conjecture that its concentration properties can be obtained by taking an approach similar to Soloveychik and Wiesel (2014).
4 Asymptotic results
4.1 The sample DCM
Let us now consider iid random draws from our elliptic distribution , say . For ease of notation, denote . Then, given the depth function and known location center , one can show that the vectorized form of -times the sample DCM: has an asymptotic multivariate normal distribution with mean and a certain covariance matrix by straightforward application of the central limit theorem (CLT). But in practice the population depth function is estimated by the depth function based on the empirical distribution function, . Denote this sample depth by . Here we make the following assumption regarding its relation to :
(D5) Uniform convergence: as .
The assumption that empirical depths converge uniformly at all points to their population versions holds under very mild conditions for several well known depth functions: for example projection depth (Zuo, 2003) and simplicial depth (Dümbgen, 1992). One also needs to replace the known location parameter by some estimator . Examples of robust estimators of location that are relevant here include the spatial median (Haldane, 1948; Brown, 1983), Oja median (Oja, 1983), projection median (Zuo, 2003) etc. Now, given and , to plug them into the sample DCM and still go through with the CLT we need the following result:
Consider a random variable having a continuous and symmetric distribution with location center such that . Given random samples from this distribution, suppose is an estimator of so that . Then with the above notations, and given the assumption (D5) we have
We are now in a position to state the result for consistency of the sample DCM:
Consider iid samples from the distribution in Lemma 4.1. Then, given a depth function and an estimate of center so that ,
In case is elliptical, an elaborate form of the covariance matrix explicitly specifying each of its elements (more directly those of its -rotated version) can be obtained, which is given in Appendix A. This form is useful when deriving limiting distributions of eigenvectors and eigenvalues of the sample DCM.
4.2 Eigenvectors and eigenvalues
Since we are mainly interested in using the DCM for a robust version of principal components analysis, from now on we assume that the eigenvalues of are distinct: to obtain asymptotic distributions of principal components. In the case of eigenvalues with larger than 1 multiplicities, the limiting distributions of eigenprojection matrices can be obtained analogous to those of the sign covariance matrix (Magyar and Tyler, 2014).
We now derive the asymptotic joint distributions of eigenvectors and eigenvalues of the sample DCM. The following result allows us to get these, provided we know the limiting distribution of the sample DCM itself:
(Taskinen et al., 2012) Let be an elliptical distribution with a diagonal covariance matrix , and be any positive definite symmetric matrix such that at the limiting distribution of is a -variate (singular) normal distribution with mean zero. Write the spectral decomposition of as . Then the limiting distributions of and are multivariate (singular) normal and
The first matrix picks only off-diagonal elements of the LHS and the second one only diagonal elements. We shall now use this as well as the form of the asymptotic covariance matrix of the vec of sample DCM, i.e. to obtain limiting variance and covariances of eigenvalues and eigenvectors.
Consider the sample DCM and its spectral decomposition . Then the matrices and have independent distributions. The random variable asymptotically has a -variate normal distribution with mean , and the asymptotic variance and covariance of different columns of are as follows:
where . The vector consisting of diagonal elements of , say asymptotically has a -variate normal distribution with mean and variance-covariance elements:
5 Robustness and efficiency properties
In this section, we first obtain the influence functions of the DCM as well as its eigenvectors and eigenvalues, which are essential to understand how much influence a sample point, especially an infinitesimal contamination, has on any functional on the distribution (Hampel et al., 1986). We also derive the asymptotic efficiencies of individual principal components with respect to those of the original covariance matrix and sign covariance matrix. Unlike affine equivariant estimators of shape, the Asymptotic Relative Efficiency (ARE) of eigenvectors (with respect to any other affine equivariant estimator) can not be simplified as a ratio of two scalar quantities dependent on only the distribution of (e.g. Taskinen et al. (2012); Ollilia et al. (2003)). Finite sample efficiency of the DCM estimates with respect to infinitesimal contamination and heavy-tailed distributions shall also be demonstrated by a simulation study.
5.1 Influence functions
Given any probability distribution , the influence function of any point in the sample space for some functional on the distribution is defined as
where is with an additional mass of at , i.e. ; being the distribution with point mass at . When for some -integrable function , . It now follows that for the DCM,
Following Croux and Haesbroeck (2000), we now get the influence function of the column of :
where . Clearly this influence function will be bounded, which indicates good robustness properties of principal components. Moreover, since the htped function takes small values for points close to the center of the distribution, it does not suffer from the inlier effect that is typical of the SCM and Tyler’s shape matrix. The influence function for the eigenvector estimates of these two matrices (say and , respectively) are as follows:
for . In Figure 2 we consider first eigenvectors of different scatter estimates for the and plot norms of these influence functions for different values of . The plots for SCM and Tyler’s shape matrix demonstrate the ’inlier effect’, i.e. points close to symmetry center and the center itself having high influence. The influence function for the sample covariance matrix is obtained by replacing by in the expression of above, hence is unbounded and the corresponding eigenvector estimators are not robust. In comparison, all three DCMs considered here have a bounded influence function as well as small values of the influence function at ’deep’ points.
5.2 Asymptotic and finite-sample efficiencies
Suppose is a -consistent estimator of the population covariance matrix , which permits a spectral decomposition , where . Then the asymptotic variance of the eigenvectors are (see Theorem 13.5.1 in Anderson (3rd ed. 2003))
For 2 dimensions, this expression can be somewhat simplified. Suppose the two eigenvalues are and . In that case the eigenvalues of the DCM are
and by simple algebra we get
For , Table 1 below considers 6 different elliptic distributions (namely, bivariate with df = and bivariate normal), and summarizes the ARE for first eigenvector of the SCM, Tyler’s scatter matrix and DCM for 3 choices of depth function (HSD-CM, MhD-CM, PD-CM: columns 3-5), as well as their depth-weighted versions mentioned at the end of section 3 (HSD-wCM, MhD-wCM, PD-wCM: columns 6-8). The SCM and Tyler’s M-estimator perform better than the sample covariance matrix only for a bivariate -distribution with df = 5. Estimates based on depth-based covariance matrices have much better performances for all distributions, and continue to be competitive of the sample covariance matrix estimates as the base distribution approaches normality, especially those based on projection depth. Interestingly, depth-weighted versions seem to have better performances than their corresponding DCMs, more so for heavy-tailed distributions.
We now obtain finite sample efficiencies of the three DCMs as well as their depth-weighted affine equivariant counterparts by a simulation study, and compare them with the same from the SCM and Tyler’s scatter matrix. We consider the same 6 elliptical distributions considered in ARE calculations above, and from every distribution draw 10000 samples each for sample sizes . All distributions are centered at , and have covariance matrix . We consider 3 choices of : 2, 3 and 4.
We use the concept of principal angles (Miao and Ben-Israel, 1992) to find out error estimates for the first eigenvector of a scatter matrix. In our case, the first eigenvector will be
For an estimate of the eigenvector, say , error in prediction is measured by the smallest angle between the two lines, i.e. . A smaller absolute value of this angle is equivalent to better prediction. We repeat this 10000 times and calculate the Mean Squared Prediction Angle:
Finally, the finite sample efficiency of some eigenvector estimate relative to that obtained from the sample covariance matrix, say is obtained as:
Tables 2, 3 and 4 give FSE values for , respectively. In general, all the efficiencies increase as the dimension goes up. DCM-based estimators (columns 3-5 in each table) outperform SCM and Tyler’s scatter matrix, and among the 3 depths considered, projection depth seems to give the best results. Its finite sample performances are better than Tyler’s and Huber’s M-estimators of scatter as well as their symmetrized counterparts (see Table 4 in Sirkiä et al. (2007), and quite close to the affine equivariant spatial sign covariance matrix (see Table 2 in Ollilia et al. (2003)) For , the first 5 columns of Table 2 approximate the asymptotic efficiencies in Table 1 well, except for the multivariate -distribution with df = 5. Finally, the depth-weighted iterated versions of these 3 SCMs (columns 6-8 in each table) seem to further better the performance of their corresponding orthogonal equivariant counterparts.
6 Examples in real data analysis
6.1 Bus data
This dataset is available in the R package rrcov, and consists of data on images of 218 buses. The 18 variables here correspond to several features related to these images. Here we extend upon the analysis in Maronna et al. (2006), pp. 213 to compare the classical PCA and 3 different methods of robust PCA (including that using SCM) with our DCM-based method. Similar to the original analysis, we set aside variable 9 and scale the other variables by dividing with their respective median absolute deviations (MAD). This is done because all the variables had much larger standard deviations compared to their MADs, and variable 9 had MAD = 0.
For the sake of uniformity, we use projection depth as our fixed depth function while doing depth-based PCA in our data analysis examples. We compare the outputs of classical (CPCA) and depth-based PCA (DPCA) with the following 3 robust methods: Spherical PCA, i.e. PCA based on the SCM (SPCA), PCs obtained by the ROBPCA algorithm (Hubert et al., 2005), and the Minimum Covariance Determinant (MCD) estimator (Rousseeuw, 1984) (MPCA). Table 5 gives the proportions of variability that are left unexplained after the top components are taken into account in each of the 5 methods. The first PC in classical PCA seems to explain a much higher proportion of variability in original data than robust methods. However, as noted in Maronna et al. (2006) and Hubert et al. (2005), this is a result of the classical variances being inflated due to outliers in the direction of the first principal axis. Among the robust methods, the proportions of unexplained variances are highest for DCM-based PCA for all values of .
Table 6 demonstrates why robust methods actually give a better representation of the underlying data structure than classical PCA here. Each of its column lists different quantiles of the squared orthogonal distance for a sample point from the hyperplane formed by top 3 PCs estimated by the corresponding method. For PCA based on projection-DCM, the estimated principal component subspaces are closer to the data than CPCA for more than 90% of samples, and the distance only becomes larger for higher quantiles. This means that for CPCA, estimated basis vectors of the hyperspace get pulled by extreme outlying points, while the influence of these outliers is very low for DPCA. SPCA and ROBPCA perform very closely in this respect, the percentage of points that have less squared distance than CPCA being between 80% and 90% for both of them. This percentage is only 50% for MPCA, which suggests that the corresponding 3-dimensional subspace estimated by MCD is possibly not an accurate representation of the truth.
|Method of PCA|
|Quantile||Method of PCA|
6.2 Octane data
We now apply our method to a high-dimensional dataset and demonstrate its effectiveness in outlier detection. Due to Esbensen et al. (1994), this dataset consists of 226 variables and 39 observations. Each observation is a gasoline sample with a certain octane number, and have their NIR absorbance spectra measured in 2 nm intervals between 1100 - 1550 nm. There are 6 outliers here: compounds 25, 26 and 36-39, which contain alcohol. Proportions of explained variability by PCs obtained by the two methods are given in Figure 3. Once again, the first PC in CPCA explains a much larger proportion of variance than DPCA. Second and third PCs obtained by DPCA explain higher proportions of variability than the corresponding components in CPCA.
In both methods, the first two PCs explains a large amount (98% for CPCA, 89% for DPCA) of underlying variability. But in DPCA, these PCs are more effective in detecting outliers, which we demonstrate in Figure 4. For any method of PCA with components on a dataset of observations and variables, the score distance (SD) and orthogonal distance (OD) for observation () are defined as: